AWS offers dozens of isolated tools for managing huge amounts of data under the four V’s of Big Data: Volume, Variety, Velocity and Veracity. The main challenge is how to connect the dots in a way that generates a comprehensive Data Pipeline which runs either periodically (in batches) or constantly, as soon as a new event or message enters the Analytic system (streams) , a.k.a. an Event Driven Analytic Platform.
Traditionally, data pipelines ran batches in Hadoop clusters. It could be done on premise, within your own infrastructure, or in the cloud, for example, by consuming AWS EC2 or GCP Compute Engine resources.
Hadoop comes in as many flavors as there are vendors. Just to mention a few of them:
- Cloudera and its open source version CDH (Cloudera Distribution Hadoop)
- Hortonworks Data Platform (HDP)
- IBM & Intel also provide their own service based on Hadoop Framework
And finally, the top Cloud providers, Google and AWS:
- AWS EMR
- GCP Dataproc
Cluster setup is becoming obsolete because new analytic pipelines are running in Serverless mode.
What does that mean? It doesn’t strictly mean that there are no servers. When the code is running, you of course need a server to run it on.
The main difference with Serverless architectures is that, where before you had to preemptively setup the needed infrastructure (cpu, memory, disk, etc), now your code runs no matter which infrastructure is behind the process. Your cloud provider takes care of providing the required infrastructure and you pay only for the costs of the code you execute on it. Architecture is often used for real time data processing. AWS Lambda and AWS Kinesis are good examples of this.
This makes it seem like EMR is an obsolete tool for running batch processes under pre-provisioned infrastructure, and that AWS Lambda is the replacement tool for running real time computation in a serverless architecture, right? Well… that is partially correct or partially incorrect, depending on whether you choose to see the glass as half empty or half full.
On one hand,
- AWS Lambda is a “cutting-edge“ tool
- Lambda works for event-driven platforms, real time processing.
- Lambda is serverless
On the other hand,
- EMR is still a great tool
- EMR is useful for batch processing
- EMR is not serverless
But, a great complement for EMR is the AWS Data Pipeline Tool. Thanks to AWS Data Pipeline we can run EMR Batch processes on a schedule, in a serverless architecture.
That is not the only purpose of AWS Data Pipeline, though. It also allows us to connect all the dots (EMR, Lambda, S3, RDS, Glue, SNS, etc) that we referred to at the beginning of this post, in a simple and straightforward way, thanks to its intuitive graphical user interface.