DATA ENGINEERING QUICK GUIDE
ASIM JALIS
GALVANIZE
BIG DATA
WHY HADOOP?
How can we create a supercomputer
Using cheap Linux boxes?
WHAT IS HADOOP?
Operating system for cluster of machines
Combines small weak computers
To create Big Data system
Unified disk and processing power
HADOOP
WHY HDFS?
How can we store a petabyte-sized file
Using cheap Linux boxes?
WHAT IS HDFS?
Split petabye file into 128 MB blocks
Distribute blocks across Hadoop cluster
Make 3 copies of each block for insurance
HDFS
WHY MAPREDUCE?
How can we process the data in HDFS
Without pulling it out and pushing the result back?
WHAT IS MAPREDUCE?
Send program to where the data is on HDFS
Process petabyte file by processing each block
Then combining the result
MAPREDUCE
WHY HIVE?
How can people who don’t know Java
Write MapReduce jobs?
WHAT IS HIVE?
Hive translates SQL to MapReduce jobs
HIVE
SELECT *
FROM sales
WHERE amount > 400;
HIVE
WHY PIG?
How can people who don’t know Java or SQL
Write MapReduce jobs?
WHAT IS PIG?
Pig translates PigLatin to MapReduce jobs
PigLatin is a scripting language comparable to SQL
PIG
high_sales =
FILTER sales_data
BY amount > 400;
PIG
WHY SPARK?
How can we make MapReduce faster
And the API less clunky?
WHAT IS SPARK?
Spark is like MapReduce
Spark has a cleaner API and is faster
Speed up because it saves intermediate results in memory
SPARK
sc.textFile("shakespeare.txt").
flatMap(line => line.split("W+")).
map(word => (word,1)).
reduceByKey((count1,count2) => (count1 + count2)).
saveAsTextFile("output")
SPARK
WHY SPARK SQL?
How can people who don’t know Scala, Python, or Java
Write Spark code?
WHAT IS SPARK SQL?
Spark SQL is like Hive for Spark
Hive translates SQL to MapReduce
Spark SQL translates SQL to Spark
SPARK SQL
SELECT *
FROM sales
WHERE amount > 400;
SPARK SQL
WHAT IS SPARK MLLIB?
Machine Learning algorithms on Spark
Analyze data to extract insights
WHAT IS MACHINE LEARNING?
Technique Question
Regression Predict revenue next month
Classification Is tumor cancerous or benign
Clustering Which customers are similar to each
other
Recommendation Which movie will you like
REAL-TIME TECHNOLOGIES
WHAT IS THE DIFFERENCE
BETWEEN REAL-TIME AND
BATCH?
Term Means Example
Real-
Time
Process data when
it arrives
Reject credit card
transaction
Batch Process data
periodically
Flag suspicious
transaction at night
BATCH
Processing Layer SQL Layer
MapReduce Hive, Pig
Spark Spark SQL
REAL-TIME
HBase
Kafka
Spark Streaming
Lambda Architecture
WHY HBASE?
How can we store petabytes of data on HDFS
And do fast read and writes like a database?
WHAT IS HBASE?
HBase is a NoSQL database on top of HDFS
Can store petabytes of data
Reads/writes much faster than traditional database and
HDFS
HBASE
WHY KAFKA?
How can we hold onto incoming data and not lose it
When we are getting a million messages per second?
WHAT IS KAFKA?
Kafka is TiVo for the cluster
It stores real-time data as it comes in
Can store a week of data
Queuing system for Hadoop cluster
KAFKA
WHY SPARK STREAMING?
How can we process data as it comes in
Instead of every night (using Spark or MapReduce)
WHAT IS SPARK STREAMING?
Spark Streaming is a library on top of Spark
It allows processing data as soon as it comes in
Sits in front of Kafka
SPARK STREAMING
WHY LAMBDA ARCHITECTURE?
How can we watch historical trends and what is
happening right now?
How can we show bestsellers from this year and from last
hour?
WHAT IS LAMBDA
ARCHITECTURE?
Big Data system which can handle both batch and real-
time
Uses historical data as well as real-time data
Best of both worlds
LAMBDA ARCHITECTURE
REVIEW
BATCH REVIEW
Technology Description
Hadoop Cluster operating system
HDFS Stores petabytes of data on 100s or 1000s of
machines
MapReduce Processes data in HDFS
Hive SQL MapReduce
Pig PigLatin MapReduce
Spark Faster MapReduce
Spark SQL SQL Spark
REAL-TIME REVIEW
Technology Description
HBase Fast NoSQL database on top of
HDFS
Kafka Queues incoming data into cluster
Spark Streaming Process in real-time
Lambda
Architecture
Combines real-time and batch

Data Engineering Quick Guide