Date

December 7, 2023

3 min read

How to Deploy a Data Analytic Pipeline in AWS

AWS offers dozens of isolated tools for managing huge amounts of data under the four Big Data V’s: Volume, Variety, Velocity and Veracity. The main challenge is how to connect the dots for generating a comprehensive Data Pipeline which runs either periodically (in batches) or constantly, as soon a new event or message gets into the Analytic system (streams) , a.k.a. an Event Driven Analytic Platform.

A tech-savvy man in a suit is holding a tablet in front of a server, exemplifying global innovation.

Author Jose Alvarez Muguerza

Date

December 7, 2023

Traditionally, data pipelines ran batches in Hadoop clusters. It could be done on premise, within your own infrastructure, or in the cloud, for example, by consuming AWS EC2 or GCP Compute Engine resources.

Hadoop comes in as many flavors as there are vendors. Just to mention a few of them:

Cloudera and its open source version CDH (Cloudera Distribution Hadoop)
Hortonworks Data Platform (HDP)
MapR
IBM & Intel also provide their own service based on Hadoop Framework

And finally, the top Cloud providers, Google and AWS:

AWS EMR
GCP Dataproc

Cluster setup is becoming obsolete because new analytic pipelines are running in Serverless mode.
What does that mean? It doesn’t strictly mean that there are no servers. When the code is running, you of course need a server to run it on.

The main difference with Serverless architectures is that, where before you had to preemptively setup the needed infrastructure (cpu, memory, disk, etc), now your code runs no matter which infrastructure is behind the process. Your cloud provider takes care of providing the required infrastructure and you pay only for the costs of the code you execute on it. Architecture is often used for real time data processing. AWS Lambda and AWS Kinesis are good examples of this.

This makes it seem like EMR is an obsolete tool for running batch processes under pre-provisioned infrastructure, and that AWS Lambda is the replacement tool for running real time computation in a serverless architecture, right? Well… that is partially correct or partially incorrect, depending on whether you choose to see the glass as half empty or half full.

On one hand,

AWS Lambda is a “cutting-edge“ tool
Lambda works for event-driven platforms, real time processing.
Lambda is serverless

On the other hand,

EMR is still a great tool
EMR is useful for batch processing
EMR is not serverless

But, a great complement for EMR is the AWS Data Pipeline Tool. Thanks to AWS Data Pipeline we can run EMR Batch processes on a schedule, in a serverless architecture.

That is not the only purpose of AWS Data Pipeline, though. It also allows us to connect all the dots (EMR, Lambda, S3, RDS, Glue, SNS, etc) that we referred to at the beginning of this post, in a simple and straightforward way, thanks to its intuitive graphical user interface.

We encourage you to surf the official AWS Data Pipeline document site, as well as Edureka’s free educational content on YouTube.

‍

Innovation starts with a conversation.

Fill out this email form and we’ll connect you with the right person for your needs.

BY INDUSTRY

BY NEED

LEARN

TRENDING TOPICS

Snowflake Summit 2025: Game-Changing Platform Innovations Unveiled

Takeaways from AWS re:Invent 2024

BY INDUSTRY

BY NEED

LEARN

TRENDING TOPICS

Snowflake Summit 2025: Game-Changing Platform Innovations Unveiled

Takeaways from AWS re:Invent 2024

How to Deploy a Data Analytic Pipeline in AWS

Innovation starts with a conversation.

Snowflake Summit 2025: Game-Changing Platform Innovations Unveiled

Takeaways from AWS re:Invent 2024

Snowflake Summit 2025: Game-Changing Platform Innovations Unveiled

Takeaways from AWS re:Invent 2024

Share

How to Deploy a Data Analytic Pipeline in AWS

Share

Share

Innovation starts with a conversation.

Related Posts

Takeaways from AWS re:Invent 2024

An Executive's Guide to Intelligent Document Processing

Real-World Examples of Intelligent Document Processing

Your Cookie Settings