Last year I wrote about how Dativa makes the best technology choices for their clients from an ever-increasing number of options available for data processing and three highly competitive cloud platform vendors. As an Amazon Web Services Consulting Partner, Dativa has been recognized by AWS for their ability to design, architect, build, migrate, and manage data services on the Amazon cloud. Therefore, in this post I want to examine more deeply the AWS services we deploy most often for clients looking to migrate their data workloads to the cloud.
The role of data architecture is to gather and prepare information in a form that allows data scientists to perform their tasks quickly and efficiently. As the data sets our clients use are often extremely large, scalability and performance of the data architecture are critical. For this reason we tend to favor ‘serverless’ deployments. Not only does this provide scalability and resilience but it also significantly reduces system administration costs over the lifetime of the deployment.
A fully-featured AWS data pipeline architecture deployed by us might look something like this:
Simpler deployments might only include a subset of the above features, but in each case the resources used will have been carefully chosen for their intended application:
Resource | Description | Applications | Overview |
Amazon S3 | Highly durable and available cloud object store. |
| Often used for data delivery (import and export) as well as for storing interim data during data processing. Can be used as data lake for storing large amounts of structured or unstructured data. |
AWS Lambda | Serverless environment to run code without provisioning or managing servers. |
| We typically use Python as the runtime environment. Execution times are limited to maximum of 5 minutes although compute power and memory can be significantly increased if required. Long running or asynchronous processing is usually better done with AWS Batch. We also use our Data Pipeline API for cleansing and tokenizing. |
AWS Batch | Dynamically provisioned compute resources for running large numbers of batch jobs. |
| Whilst not ‘serverless’ in the generally understood sense AWS Batch does do all the provisioning and scaling of compute resources automatically, allowing jobs to be efficiently scheduled and executed with minimal administration. Using a Lambda-like container we schedule jobs in much the same way as the Lambda service does - with the advantage that they can run for as long as we like. |
AWS Glue | Managed extract, transform and load service. |
| Data catalogs generated by Glue can be used by Amazon Athena. Glue jobs can prepare and load data to S3 or Redshift on a scheduled or manual basis . Underlying technology is Spark and the generated ETL code is customizable allowing flexibility including invoking of Lambda functions or other external services. |
Amazon Athena | Interactive query service allowing analysis of data in S3 using standard SQL. |
| Integrated with AWS Glue catalogs. Serverless and no data warehouse needed so no ETL required. We use this a lot for one-off analysis of large or small data sets that would otherwise require a lot more time and infrastructure to analyse using more conventional means. |
Amazon Redshift | Managed data warehouse allowing complex analytic queries to be run using standard SQL and Business Intelligence tools |
| Redshift’s big attraction is its fast and consistent query performance, even across extremely large data sets, as data load scales linearly with cluster size. An added bonus is the ability to create temporary Spectrum tables to query data in S3, allowing us to easily perform transformation and loading of data within Redshift itself. |
Amazon Redshift Spectrum | Direct running of SQL queries against large amounts of unstructured data in S3 with no loading or transformation. |
| As we’ve shared in a previous article, we sometimes use Redshift Spectrum in place of staging tables, avoiding the need to physically load data into Redshift before transformation. Also, as with Athena, Redshift Spectrum is an efficient way of quickly performing one-off analysis of large or small data sets without the need to load them. |
API Gateway | Easy to create, maintain, monitor and secure APIs at any scale. |
| API Gateway can be used to provide a RESTful style API into a data lake or warehouse. It can query data in real-time using Lambda or Athena. For longer queries (API Gateway has a 30 second timeout) we use asynchronously invoked Lambda calls or AWS Batch behind the API. Our Pipeline API cleaning and tokenising service is implemented in this way. |
Amazon Simple Queue Service | Managed message queueing service |
| The real pipe of the data pipeline. Used primarily to decouple different services to improve scalability and reliability. Also useful for batching streamed data. Polled service as opposed to SNS which is subscription. |
Amazon Simple Notification Service | Managed publish/subscribe notification service |
| As well as the obvious use of distributing alarms and other status notifications this service can also be used to deliver data to clients (e.g. using HTTPS subscription) and sending batches of data for processing by Lambda functions. |
Amazon DynamoDB | Fast, flexible, non-relational database with low latency and automatic scaling. |
| Depending on the application we sometimes find that storing logs and status in DynamoDB gives us greater flexibility for report generation and service monitoring. Its low latency and autoscaling mean there’s practically no administrative overhead and the Time-To-Live feature means we can expire (or archive) old data automatically. |
Amazon CloudWatch | Monitoring service for AWS cloud resources and applications. |
| Custom metrics allow us to monitor performance of all parts of our data architecture with alarms generated as appropriate for significant events. Log files allow for easy debug and basic monitoring of services where a little more detail is required. |
AWS CloudTrail | Allows governance, compliance, and auditing of AWS resources |
| A much more rigorous form of audit than CloudWatch, this service allows logging, continuous monitoring, and retention of all account activity across your AWS infrastructure. |
Amazon Glacier | Low-cost, durable and secure storage service for data archiving. |
| We use this service for long term archiving of logs and data, usually for audit purposes. Objects in S3 can be configured to be automatically archived to Glacier after a set period of time, significantly reducing storage costs. |
AWS CloudFormation | A common format to describe and provision all AWS cloud infrastructure |
| We love CloudFormation! Imagine being able to define and deploy an entire data centre from a single text file - that’s what it can do for you. The most powerful aspect is having your cloud infrastructure under version control and being able to deploy and update it from your CI/CD pipeline. It allows us to automatically spin up development, test and staging instances of all of our deployments and run thorough automated test cycles before we even consider going live with any changes to our production deployments. |