Choosing a Database in 2017

Choosing a Database in 2017

Tom Weiss, Thu 10 August 2017

The unstructured data hype of the last few years seems to be abating and we are are increasingly finding clients wanting to implement structured datasets. This is primarily because media data does fit relatively simply into a taxonomy: the consumer has various attributes and is consuming media which also has many different attributes.

The previous approach of dumping large volumes of unstructured data into a Hadoop cluster because “it might be useful later” is providing both expensive, inefficient, and provides a significant security risk.

This blog covers the current state of the art options

SQL database options

Although there are many options available for large-scale data warehouses, the most common implemented at scale are:

  • Redshift
  • Hive
  • Oracle
  • SQL Server

Redshift

Amazon’s Redshift provides a simple-to-deploy, secure, and fast data warehouse infrastructure. It is highly cost effective when deployed correctly but it is hard to find experienced staff who know how to correctly deploy it. Particularly troublesome areas are in establishing distribution and sort keys and choosing the correct architecture of the different nodes in the product. Redshift supports storage and querying of JSON object data within referential tables enabling the most efficient data structures.

Hive

Hive is a very popular addition to provide SQL functionality in a Hadoop environment and it is also available from AWS as part of the AMR product. It is fast and cost effective when Hadoop infrastructure is already available or used for other purposes but in a pure AWS environment is probably less effective than Redshift.

Hive supports storage and querying of JSON object data within referential tables.

Oracle

Oracle supports storage and querying of JSON object data within referential tables and the Exadata product provides a columnar database with strong performance against both Redshift and Hive in an on-premises model. We would only really recommend Oracle as a platform for a data lake for an organization that already has a strong internal DBA function familiar with the Oracle platform.

SQL server

SQL server is Microsoft’s relational database and can be deployed both on premises and within the cloud. It is very much a traditional database and supports columnar storage.

We would only really recommend SQL server for an organization that has strongly aligned their technology roadmap to the Microsoft platform.

SQL Server does not support storage and querying of JSON object data within referential tables

NOSQL database options

NOSQL databases provide massive data stores with alternative query languages for accessing the data and are typically used when developing applications rather than for a data analytics data store.

Key examples are:

  • Google BigTable – the database behind google search, analytics, and gmail can handle massive workloads in realtime and is provided by Google as a service through their cloud
  • Hadoop HBase – provide high performance read/write of large amounts of data but minimal queries of data
  • Cassandra – provides scalability and high availability with good performance with excellent cross-regional replication and query performance comparable with a relational database.

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you

Other articles about Data Engineering


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531