Slides from the August 2021 St. Louis Big Data IDEA meeting from Sam Portillo. The presentation covers AWS EMR including comparisons to other similar projects and lessons learned. A recording is available in the comments for the meeting.
IPPON 2020
IPPON 2019
Introduction.
Data engineer at Ippon
Technologies, a boutique
consulting firm specializing in
Data, Cloud, DevOps, and Full
Stack Application
Development.
Ippon has deep expertise
across all major cloud
platforms:
★ AWS
★ Azure
★ GCP
Delivers Data Initiative
Architecture consulting from scoping
Project delivery from POC
Robust client portfolio
Led by Peter Choe
We’re hiring!
About Ippon’s Data Practice
Engineer with hands-on
experience with AWS and
Azure across multiple data
projects.
Currently working with
Cassandra (NoSQL) and
AWS compute and analytics
services.
Fun fact: My team won first
place at the 2020 Virginia
Governor's Datathon
I also organize the
Richmond (VA) Data
Engineering Meetup
Sam Portillo
Data Engineer
IPPON 2020
IPPON 2019
Agenda.
1. Introduction
2. Architecture and Design
3. Contrast with Similar Services
4. Lessons Learned
5. Q&A
IPPON
2020
Background on Big Data Computing.
❖ Pre 1994 - mainframe era
❖ 1994 to 2010 - cluster era
➢ Identical machines connected to the same network to run
workloads
➢ Price varies from 5-6 figures for hardware alone
❖ 2000s to present - GPU era
➢ GPU processing accelerates specific types of
problems (ML training, gaming graphics, etc.)
➢ Hardware can also vary from 5-6 figures
❖ 2010s to present - Cloud era
➢ Introduces pay-as-you-go IaaS and PaaS with, firms can run
Big Data applications without buying/managing hardware
Benefits of IaaS.
❖ No longer need to buy/manage
hardware
❖ “Wait minutes not months”
❖ Pay-as-you-go beats financing
hardware in most cases
❖ Scale without physical hardware
limits
❖ User needs to manage OS,
middleware, applications
Benefits of PaaS.
❖ Cloud provider manages OS,
database, development tools
❖ Can be cheaper than IaaS
❖ Faster development time because
software comes preconfigured
❖ User needs to applications, data,
etc.
Managed Cluster Services.
❖ PaaS offering that allows cloud customers to easily run Big Data
workloads
❖ Support for open source tools like Spark and Hadoop
❖ Examples are
➢ AWS Elastic MapReduce
➢ Azure HDInsight
➢ Google Cloud Dataproc
Core Benefits of Managed Cluster Services.
❖ Cost savings associated with PaaS
❖ Ease of use for developers
❖ Integration with other cloud services
❖ Elastic
➢ Scales up and down as needed
❖ Flexibility
➢ Users can customize environments to solve a variety of problems
Use Cases.
❖ ETL Pipelines
➢ Build reliable data pipelines with Spark
❖ Big data migration
➢ Utilize the power of a cluster to transfer data in a distributed way
❖ Interactive analytics
➢ Quickly ingest terabytes of data to do initial analysis
❖ Machine learning
➢ Train ML models with Tensorflow or Spark MLlib
Takeaway.
❖ Managed cluster services can be versatile and used in different workloads
❖ Benefits that cloud services are typically known for
❖ A good understanding of architecture principles will help an engineer
support a variety of projects
Architecture in AWS EMR.
❖ Each node is an EC2 instance
❖ Master node
➢ Responsible for coordinating applications
among the cluster
➢ Runs driver component for Spark apps
➢ SSH access
❖ Core nodes
➢ Do heavy lifting for applications
➢ Store data for Hadoop Distributed File
System
How distributed computing speeds up workloads.
❖ Workloads get distributed among a cluster
❖ Hadoop MapReduce
❖ Heavily relies on disk; stores data back to HDFS after each operation
How distributed computing speeds up workloads.
❖ Apache Spark uses a directed acyclic graph (DAG) to plan workloads
❖ Advantages over MapReduce:
➢ Utilizes RAM and caches datasets between map-reduce jobs
➢ Doesn’t need to write to disk between map-reduce jobs
➢ Rich API that allows for dataset transformations
➢ Potentially 100 times faster than Hadoop MapReduce.
Databricks.
❖ Databricks is a managed Spark service
➢ Databricks manages compute
➢ Workflow automation and data pipelines
➢ Integrated workspace
❖ A managed cluster service just runs Spark jobs
ETL Tools.
❖ Can have similar ETL functions but managed cluster
services offer more than ETL
❖ ETL on ETL tools is generally much easier to configure, but
less flexible
❖ AWS Glue works on top of a Spark environment
Batch Services.
❖ Batch services are meant to run any batch computing job
at any scale.
❖ They may be able to do some ETL use cases
❖ Not a cluster service, don’t run Spark or Hadoop
❖ Not an environment for interactive analytics
IPPON 2020
IPPON 2019
Things to keep in mind.
1. Tuning can be difficult
2. Adopt best practices early
3. Assess what parts of a workload warrant a cluster service
4. Know your data
5. Explore different options before committing to a solution
IPPON
2020
IPPON 2020
Connect with me :D
❖ Sam Portillo
❖ https://www.linkedin.com/in/portillosc
❖ Email: sportillo@ippon.fr
Introduce myself and stuff
Talk about rvade maybe or the cats
Been with ippon for about a year
Things i like
I’m on the qomplx project and use a lot of aws services and cassandra
I like python
Skim over this but focus on the about me
Offer:
Ippon helps companies delivering Data initiative from massive workloads (FastData) to massive storage (BigData).
Services:
Architecture consulting from scoping
DataEngineering industrialisation
Project delivery from POC
US Reference:
Ippon is delivering the full Data capability of SwissRe in the US from realtime analysis to legal longterm storage in a secured Cloud.
Read off slides
Small demo with architecture and design
Pre 1994 - super computer/mainframe; single big computer
1994 - 2010 - linux got popular and this started with the beowulf project. Folks started buying commodity hardware with two network cards and distributing workloads
Present - there’s still benefit in on prem hardware, which is why the GPU era is still relevant. I know carmax has on prem resources for their machine learning workloads. Makes sense when you do the quick math between running a workload in the cloud vs buying the hardware (not the same as managing an entire data center).
Acknowledge the overlap
Examples of these are amazon ec2, azure vms, gcp compute engine
Aws elastic beanstalk is a paas
orchestration service offered by Amazon Web Services for deploying applications which orchestrates various AWS services
Here’s an overview of iaas, paas, and saas, i see new as-a-service acronyms each year but most cloud services fall into these categories
Most cloud certification courses ask questions about these
a saas example would be the microsoft office 365 suite
Tell audience we’re focusing on EMR because i’ve used it on a couple projects within the DP
Not a sales pitch for EMR I’ve just used it too much
Ease of use: easy for devs to get started with; learning curve is pretty low
Use “emerging problems”
Any questions so far?
At its core, EMR is a platform as a service that offers on-demand distributed computing clusters. This service comes with the benefits such as scalability and cost-savings that AWS services are typically known for. AWS also manages the installation of a variety of popular distributed computing and data frameworks like Spark, Hadoop, Tensorflow, and many more. This makes EMR especially versatile as engineers can work on almost any type of problem without too much trouble switching contexts.
With EMR being a cornerstone for many workloads, it’s an incredibly helpful tool to keep in the toolbox. A strong understanding of its architecture principles coupled with it’s variety of popular frameworks available makes it a good choice for emerging problems.
Ignore the “mapr node” thing. I stole this image
Most services are phasing out hadoop verbiage on their product pages. And apache retired some hadoop related projects recently
Explanation of graphic: say we want to count the occurrences of each animal in this input dataset
In the Map phase, each line gets parsed and all of the animals occurrences are extracted. The output of each step of the Map phase is just a list of of each animal and a 1 associated with it.
Now that we have all of the animal occurrences by line, we want a count of occurrences across the dataset for each animal.
In the Reduce phase, we combine all of the same animals from different mappers and "reduce" them to a single animal and a count.
Any questions?
Introducing Spark
Go straight to the console
Theres a lot of similar services out there.
I’d like to clear up some differences.
Bottom line will be that the service you choose depends on your workload
Spark tuning is really hard, figure out what needs to be done correctly otherwise you may end up just pulling a bunch of different levers
Whenever working with a new service or framework, get to know the best practices, you do not want to be three weeks into something when you realize you’re doing something anti pattern
Theres a lot of components of an ETL pipeline, but just because one part needs to be done in spark doesn’t mean the whole thing does
Omg, self explanatory. Exhaustively study it to make sure nothing will come in to break your pipeline/business logic. Found hundred line sql statements in breach data
Don’t fall into the hype of the latest and greatest service. Extensively research your use-case. There’s a lot of ways to get something done, make sure you pick the right tool for the job
If you like to talk about data/tinker with tools/technology/services OR like to throw it down on the dance floor. Connect with me, let’s be friends
I also welcome philosophical debate, whether its about material i presented, thoughts on architecture paradigms/strategies, the future of aws/data.