Best Practice Data Architectures in 2017

Best Practice Data Architectures in 2017

Clive Skinner, Thu 28 September 2017

We now have a more detailed architecture for data pipelines on AWS here

With an ever-increasing number of technologies available for data processing and three highly competitive cloud platform vendors, we at Dativa have to stay on top of exactly what are the best technology choices for our clients.

Most of the work we undertake involves processing and loading data into a data lake, provide post-processing on top of that, and then reporting on it. We've developed a standard data processing pipeline architecture that covers both historical data analysis in a data lake and also real-time data processing through a separate stream.

A standard data pipeline architecture

Standard data pipeline architecture

There are many options for how we can implement this in 2017, but they broadly fall into four categories:

  • Amazon-centric using the AWS platform
  • Google-centric using the Google cloud platform
  • Microsoft-centric using the Azure platform
  • Platform independent using open source software

Most of our customers tend to align themselves with one of the cloud vendors and then take some components from the platform-independent option.

Best of breed options for each platform

Each of the components is covered below together with the options for implementing that component using the standard tools.

Component Description Context Technology Options
Amazon Web Services Google Cloud Microsoft Azure Platform Independant
Data Source Point of delivery, or source, of batched or streamed data to be ingested into the system. Any data transfer undertaken is via secure a mechanism e.g. HTTPS, SFTP, SCP etc. Pull small batch AWS Lambda Cloud Functions Azure Functions Python
HTTPS, SFTP, SCP etc.
Pull large batch AWS Batch - Azure Batch
Push batch Amazon S3 Cloud Storage Azure Storage
Streamed Amazon Kinesis Cloud Pub/Sub
Cloud Dataflow
Azure Stream Analytics
Azure Event Hub
Discover and map AWS Glue Cloud DataPrep Azure Data Catalog
Orchestration Optimised automation of the processes and workflow required to extract, clean, transform, aggregate and load the Source Data into the Data Lake or Warehouse. Distributed services AWS Step Functions App Engine Microsoft Flow Apache Airflow
Complex processing AWS Data Pipeline Cloud Dataflow Azure Data Factory
Managed service AWS Glue Cloud DataPrep Azure Data Catalog
Pre-processing and Loader Loading of the extracted, cleansed and transformed data into the Data Lake or Warehouse. The frequency and strategy (e.g. append, replace, etc.) used will vary depending on the business case. Scripted loading AWS Lambda Cloud Functions Azure Functions Python
Apache Spark
Apache Hive
Managed loading AWS Glue Cloud DataPrep Azure Data Catalog
Cleansing Service Filtering of data to improve reliability and quality by correcting errors, checking for consistency and testing against pre-defined acceptance criteria. Scripted AWS Lambda Cloud Functions Azure Functions Python
Pipeline API
Managed service Pipeline API
Fact Generation Aggregation and processing of data to improve Historical Reporting and Other Tools query efficiency. Small scale AWS Lambda Cloud Functions Azure Functions SQL/HiveQL
Python
Presto
Large scale AWS Batch - Azure Batch
Data Lake Storage repository that can hold large amounts of structured, semi-structured and/or unstructured data. Where only structured data is stored a data warehouse may be used instead to improve query efficiency. Structured data Amazon Redshift BigQuery Azure SQL Datawarehouse Apache Hadoop
Apache Hive
Semi-structured, unstructured data Amazon S3 Cloud Storage Azure Storage
Any data Amazon EMR Cloud Dataflow
Cloud Dataproc
Azure HDInsight (Hadoop)
Machine Learning Data Enrichment Discovering known properties of data sets, or learning from and making predictions about previously unknown properties of data sets. Managed service Amazon Machine Learning Cloud Machine Learning Engine Azure Machine Learning Apache Spark MLib
Scripted AWS Batch
Amazon EMR
Cloud Dataflow
Cloud Dataproc
Azure Batch
Azure HDInsight (Hadoop)
Historical Reporting Analytics tools allowing visualisation and reporting of historical (e.g. non-real time) data. Platform Amazon QuickSight Data Studio PowerBI Tableau
Periscope
Qlik
Third-party Tableau
Periscope
Qlik
Other Tools Interfaces allowing access to third-party tools and services to historical data. Structured data APIs Amazon API Gateway Cloud Endpoints Azure API Management Apache Impala
Apache Spark SQL
Semi-structured and unstructured data queries Amazon Athena
Amazon Redshift Spectrum
BigQuery Azure Data Lake Analytics
Real Time Data Shipping Either batched or streamed information collected and delivered in near real-time. Streamed data Amazon Kinesis Cloud Pub/Sub
Cloud Dataflow
Azure Stream Analytics
Azure Event Hub
Apache Spark Streaming
Logstash/Beats
Apache Storm
Log data Amazon CloudWatch Cloud Monitoring
Cloud Logging
Azure Application Insights
Azure Log Analytics
Managed service Logstash/Beats
Real Time Data Store Storage repository for real-time data, either shared with historical data, or stored separately. Structured data Amazon Redshift BigQuery Azure SQL Datawarehouse Apache Hadoop
Elasticsearch/Beats
Splunk
Semi-structured, unstructured data Amazon S3 Cloud Storage Azure Storage
Managed service Elasticsearch/Beats
Splunk
Real Time Reporting Typically 'live' dashboards or reports based on the most recently delivered real-time data. Streamed data Amazon Kinesis Analytics Cloud Pub/Sub
Cloud Dataflow
Cloud DataPrep
Azure Stream Analytics
Azure Event Hub
Kibana
Splunk
Managed service Kibana
Splunk
Containers Software packaged to run in isolation on a shared operating system. Guarantees that software will always run the same, regardless of where it’s deployed. Managed service EC2 Container Service Container Engine
Container Registry
Azure Container Service Docker
Autoscaling Allows automated scaling up or down of services based on definable conditions to ensure optimum performance versus use of resources. Managed service Auto Scaling Autoscaler Azure Autoscale Platform specific (e.g. Apache Cloudstack, Docker Swarm etc.)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you

Other articles about Data Engineering


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531