SlideShare a Scribd company logo
Supercharging ETL with Spark 
Rafal Kwasny 
First Spark London Meetup 
2014-05-28
Who are you?
About me 
• Sysadmin/DevOps background 
• Worked as DevOps @Visualdna 
• Now building game analytics platform 
@Sony Computer Entertainment Europe
Outline 
• What is ETL 
• How do we do it in the standard Hadoop stack 
• How can we supercharge it with Spark 
• Real-life use cases 
• How to deploy Spark 
• Lessons learned
Standard technology stack 
Get the data
Standard technology stack 
Load into HDFS / S3
Standard technology stack 
Extract & Transform & Load
Standard technology stack 
Query, Analyze, train ML models
Standard technology stack 
Real Time pipeline
Hadoop 
• Industry standard 
• Have you ever looked at Hadoop code and 
tried to fix something?
How simple is simple? 
”Simple YARN application to run n copies of a unix command - 
deliberately kept simple (with minimal error handling etc.)” 
➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git 
(…) 
➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l 
232
ETL Workflow 
• Get some data from S3/HDFS 
• Map 
• Shuffle 
• Reduce 
• Save to S3/HDFS
ETL Workflow 
• Get some data from S3/HDFS 
• Map 
• Shuffle 
• Reduce 
• Save to S3/HDFS 
Repeat 10 times
Issue: Test run time 
• Job startup time ~20s to run a job that does nothing 
• Hard to test the code without a cluster ( cascading 
simulation mode != real life )
Issue: new applications 
MapReduce awkward for key big data workloads: 
• Low latency dispatch (E.G. quick queries) 
• Iterative algorithms (E.G. ML, Graph…) 
• Streaming data ingest
Issue: hardware is moving on 
Hardware had advanced since Hadoop started: 
• Very large RAMs, Faster networks (10Gb+) 
• Bandwidth to disk not keeping up 
• 1 GB of RAM ~ $0.75/month * 
*based on a spot price of AWS r3.8xlarge instance
How can we 
supercharge our ETL?
Use Spark 
• Fast and Expressive Cluster Computing Engine 
• Compatible with Apache Hadoop 
• In-memory storage 
• Rich APIs in Java, Scala, Python
Why Spark? 
• Up to 40x faster than Hadoop MapReduce 
( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ ) 
• Jobs can be scheduled and run in <1s 
• Typically less code (2-5x) 
• Seamless Hadoop/HDFS integration 
• REPL 
• Accessible Source in terms of LOC and modularity
Why Spark? 
• Berkeley Data Analytics Stack ecosystem: 
• Spark, Spark Streaming, Shark, BlinkDB, MLlib 
• Deep integration into Hadoop ecosystem 
• Read/write Hadoop formats 
• Interoperability with other ecosystem components 
• Runs on Mesos & YARN, also MR1 
• EC2, EMR 
• HDFS, S3
Why Spark?
Using RAM for in-memory caching
Fault recovery
Stack 
Also: 
• SHARK ( Hive on Spark ) 
• Tachyon ( off heap caching ) 
• SparkR ( R wrapper ) 
• BlinkDB ( Approximate Queries)
Real-life use
Spark use-cases 
• next-generation ETL platform 
• No more “multiple chained MapReduce jobs” 
architecture 
• Less jobs to worry about 
• Better sleep for your DevOps team
Sessionization 
Add session_id to events
Why add session id? 
Combine all user activity into user sessions
Adding session ID 
user_id timestamp Referrer URL 
user1 1401207490 http://fb.com http://webpage/ 
user2 1401207491 http://twitter.com http://webpage/ 
user1 1401207543 http://webpage/ http://webpage/login 
user1 140120841 http://webpage/login http://webpage/add_to_cart 
user2 1401207491 http://webpage/ http://webpage/product1
Group by user 
user_id timestamp Referrer URL 
user1 1401207490 http://fb.com http://webpage/ 
user1 1401207543 http://webpage/ http://webpage/login 
user1 140120841 http://webpage/login http://webpage/add_to_cart 
user2 1401207491 http://twitter.com http://webpage/ 
user2 1401207491 http://webpage/ http://webpage/product1
Add unique session id 
user_id timestamp session_id Referrer URL 
user1 
140120749 
0 
8fddc743bfbafdc 
45e071e5c126ce 
ca7 
http://fb.com http://webpage/ 
user1 
140120754 
3 
8fddc743bfbafdc 
45e071e5c126ce 
ca7 
http://webpage/ http://webpage/login 
user1 140120841 
8fddc743bfbafdc 
45e071e5c126ce 
ca7 
http://webpage/lo 
gin 
http://webpage/add_to_ 
cart 
user2 
140120749 
1 
c00e742152500 
8584d9d1ff4201 
cbf65 
http://twitter.com http://webpage/ 
140120749 
c00e742152500 
http://webpage/product
Join with external data 
user_id timestamp session_id new_user Referrer URL 
user1 1401207490 
8fddc743bfba 
fdc45e071e5 
c126ceca7 
TRUE http://fb.com http://webpage/ 
user1 1401207543 
8fddc743bfba 
fdc45e071e5 
c126ceca7 
TRUE 
http://webpag 
e/ 
http://webpage/l 
ogin 
user1 140120841 
8fddc743bfba 
fdc45e071e5 
c126ceca7 
TRUE 
http://webpag 
e/login 
http://webpage/ 
add_to_cart 
user2 1401207491 
c00e7421525 
008584d9d1ff 
4201cbf65 
FALSE http://twitter.c 
om 
http://webpage/ 
c00e7421525
Sessionize user clickstream 
• Filter interesting events 
• Group by user 
• Add unique sessionId 
• Join with external data sources 
• Write output
val input = sc.textFile("file:///tmp/input") 
val rawEvents = input 
.map(line => line.split("t")) 
val userInfo = sc.textFile("file:///tmp/userinfo") 
.map(line => line.split("t")) 
.map(user => (user(0),user)) 
val processedEvents = rawEvents 
.map(arr => (arr(0),arr)) 
.cogroup(userInfo) 
.flatMapValues(k => { 
val new_user = k._2.length match { 
case x if x > 0 => "true" 
case _ => "false" 
} 
val session_id = java.util.UUID.randomUUID.toString 
k._1.map(line => 
line.slice(0,3) ++ Array(session_id) ++ Array(new_user) ++ line.drop(3) 
) 
}) 
.map(k => k._2)
Why is it better? 
• Single spark job 
• Easier to maintain than 3 consecutive map reduce 
stages 
• Can be unit tested
From the DevOps 
perspective
v1.0 - running on EC2 
• Start with an EC2 script 
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> 
—instance-type=c3.xlarge launch <cluster-name> 
If it does not work for you - modify it, it’s just a simple 
python+boto
v2.0 - Autoscaling on spot instances 
1x Master - on-demand (c3.large) 
XX Slaves - spot instances depending on usage patterns (r3.*) 
• no HDFS 
• persistence in memory + S3
Other options 
• Mesos 
• YARN 
• MR1
Lessons learned
JVM issues 
• java.lang.OutOfMemoryError: GC overhead limit exceeded 
• add more memory? 
val sparkConf = new SparkConf() 
.set("spark.executor.memory", "120g") 
.set("spark.storage.memoryFraction","0.3") 
.set("spark.shuffle.memoryFraction","0.3") 
• increase parallelism: 
sc.textFile("s3://..path", 10000) 
groupByKey(10000)
Full GC 
2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G- 
>45G(110G), 79.3771030 secs] 
2014-05-21T10:16:42.580+0000: 280.087: Total time for which 
application threads were stopped: 79.3773830 seconds 
we want to avoid this 
• Use G1GC + Java 8 
• Store data serialized 
set("spark.serializer","org.apache.spark.serializer.KryoSerializer") 
set("spark.kryo.registrator","scee.SceeKryoRegistrator")
Bugs 
• for example: cdh5 does not work with Amazon S3 out of the 
box ( thx to Sean it will be fixed in next release ) 
• If in doubt use the provided ec2/spark-ec2 script 
• ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> 
—instance-type=c3.xlarge launch <cluster-name>
Tips & Tricks 
• you do not need to package whole spark with your app, just 
specify dependencies as provided in sbt 
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" % 
„provided" 
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" % 
"provided" 
assembly jar size from 120MB -> 5MB 
• always ensure you are compiling agains the same version of 
artifacts, if not ”bad things will happen”™
Future - Spark 1.0 
• Voting in progress to release Spark 1.0.0 RC11 
• Spark SQL 
• History server 
• Job Submission Tool 
• Java 8 support
Spark - Hadoop done right 
• Faster to run, less code to write 
• Deploying Spark can be easy and cost-effective 
• Still rough around the edges but improves quickly
Thank you for listening 
:)

More Related Content

What's hot

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
Databricks
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
Databricks
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Databricks
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
Spark Summit
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
DataWorks Summit
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 

What's hot (20)

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 

Viewers also liked

Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Databricks
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
datamantra
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
Sandeep Patil
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
DataWorks Summit
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
DataWorks Summit
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
gethue
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
Sandy Ryza
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
Databricks
 
Proxy Servers
Proxy ServersProxy Servers
Proxy Servers
Sourav Roy
 
Proxy Server
Proxy ServerProxy Server
Proxy Server
guest095022
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
Zeppelin(Spark)으로 데이터 분석하기
Zeppelin(Spark)으로 데이터 분석하기Zeppelin(Spark)으로 데이터 분석하기
Zeppelin(Spark)으로 데이터 분석하기
SangWoo Kim
 

Viewers also liked (16)

Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
SocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean ManualSocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean Manual
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Proxy Servers
Proxy ServersProxy Servers
Proxy Servers
 
Proxy Server
Proxy ServerProxy Server
Proxy Server
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Zeppelin(Spark)으로 데이터 분석하기
Zeppelin(Spark)으로 데이터 분석하기Zeppelin(Spark)으로 데이터 분석하기
Zeppelin(Spark)으로 데이터 분석하기
 

Similar to ETL with SPARK - First Spark London meetup

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
Jean-Georges Perrin
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
Jakub Hajek
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 

Similar to ETL with SPARK - First Spark London meetup (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Recently uploaded

What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Ukraine
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
HarpalGohil4
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
Fwdays
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 

Recently uploaded (20)

What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 

ETL with SPARK - First Spark London meetup

  • 1. Supercharging ETL with Spark Rafal Kwasny First Spark London Meetup 2014-05-28
  • 3. About me • Sysadmin/DevOps background • Worked as DevOps @Visualdna • Now building game analytics platform @Sony Computer Entertainment Europe
  • 4. Outline • What is ETL • How do we do it in the standard Hadoop stack • How can we supercharge it with Spark • Real-life use cases • How to deploy Spark • Lessons learned
  • 6. Standard technology stack Load into HDFS / S3
  • 7. Standard technology stack Extract & Transform & Load
  • 8. Standard technology stack Query, Analyze, train ML models
  • 9. Standard technology stack Real Time pipeline
  • 10. Hadoop • Industry standard • Have you ever looked at Hadoop code and tried to fix something?
  • 11. How simple is simple? ”Simple YARN application to run n copies of a unix command - deliberately kept simple (with minimal error handling etc.)” ➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git (…) ➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l 232
  • 12. ETL Workflow • Get some data from S3/HDFS • Map • Shuffle • Reduce • Save to S3/HDFS
  • 13. ETL Workflow • Get some data from S3/HDFS • Map • Shuffle • Reduce • Save to S3/HDFS Repeat 10 times
  • 14. Issue: Test run time • Job startup time ~20s to run a job that does nothing • Hard to test the code without a cluster ( cascading simulation mode != real life )
  • 15. Issue: new applications MapReduce awkward for key big data workloads: • Low latency dispatch (E.G. quick queries) • Iterative algorithms (E.G. ML, Graph…) • Streaming data ingest
  • 16. Issue: hardware is moving on Hardware had advanced since Hadoop started: • Very large RAMs, Faster networks (10Gb+) • Bandwidth to disk not keeping up • 1 GB of RAM ~ $0.75/month * *based on a spot price of AWS r3.8xlarge instance
  • 17. How can we supercharge our ETL?
  • 18. Use Spark • Fast and Expressive Cluster Computing Engine • Compatible with Apache Hadoop • In-memory storage • Rich APIs in Java, Scala, Python
  • 19. Why Spark? • Up to 40x faster than Hadoop MapReduce ( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ ) • Jobs can be scheduled and run in <1s • Typically less code (2-5x) • Seamless Hadoop/HDFS integration • REPL • Accessible Source in terms of LOC and modularity
  • 20. Why Spark? • Berkeley Data Analytics Stack ecosystem: • Spark, Spark Streaming, Shark, BlinkDB, MLlib • Deep integration into Hadoop ecosystem • Read/write Hadoop formats • Interoperability with other ecosystem components • Runs on Mesos & YARN, also MR1 • EC2, EMR • HDFS, S3
  • 22. Using RAM for in-memory caching
  • 24. Stack Also: • SHARK ( Hive on Spark ) • Tachyon ( off heap caching ) • SparkR ( R wrapper ) • BlinkDB ( Approximate Queries)
  • 25.
  • 27. Spark use-cases • next-generation ETL platform • No more “multiple chained MapReduce jobs” architecture • Less jobs to worry about • Better sleep for your DevOps team
  • 29. Why add session id? Combine all user activity into user sessions
  • 30. Adding session ID user_id timestamp Referrer URL user1 1401207490 http://fb.com http://webpage/ user2 1401207491 http://twitter.com http://webpage/ user1 1401207543 http://webpage/ http://webpage/login user1 140120841 http://webpage/login http://webpage/add_to_cart user2 1401207491 http://webpage/ http://webpage/product1
  • 31. Group by user user_id timestamp Referrer URL user1 1401207490 http://fb.com http://webpage/ user1 1401207543 http://webpage/ http://webpage/login user1 140120841 http://webpage/login http://webpage/add_to_cart user2 1401207491 http://twitter.com http://webpage/ user2 1401207491 http://webpage/ http://webpage/product1
  • 32. Add unique session id user_id timestamp session_id Referrer URL user1 140120749 0 8fddc743bfbafdc 45e071e5c126ce ca7 http://fb.com http://webpage/ user1 140120754 3 8fddc743bfbafdc 45e071e5c126ce ca7 http://webpage/ http://webpage/login user1 140120841 8fddc743bfbafdc 45e071e5c126ce ca7 http://webpage/lo gin http://webpage/add_to_ cart user2 140120749 1 c00e742152500 8584d9d1ff4201 cbf65 http://twitter.com http://webpage/ 140120749 c00e742152500 http://webpage/product
  • 33. Join with external data user_id timestamp session_id new_user Referrer URL user1 1401207490 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://fb.com http://webpage/ user1 1401207543 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://webpag e/ http://webpage/l ogin user1 140120841 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://webpag e/login http://webpage/ add_to_cart user2 1401207491 c00e7421525 008584d9d1ff 4201cbf65 FALSE http://twitter.c om http://webpage/ c00e7421525
  • 34. Sessionize user clickstream • Filter interesting events • Group by user • Add unique sessionId • Join with external data sources • Write output
  • 35. val input = sc.textFile("file:///tmp/input") val rawEvents = input .map(line => line.split("t")) val userInfo = sc.textFile("file:///tmp/userinfo") .map(line => line.split("t")) .map(user => (user(0),user)) val processedEvents = rawEvents .map(arr => (arr(0),arr)) .cogroup(userInfo) .flatMapValues(k => { val new_user = k._2.length match { case x if x > 0 => "true" case _ => "false" } val session_id = java.util.UUID.randomUUID.toString k._1.map(line => line.slice(0,3) ++ Array(session_id) ++ Array(new_user) ++ line.drop(3) ) }) .map(k => k._2)
  • 36. Why is it better? • Single spark job • Easier to maintain than 3 consecutive map reduce stages • Can be unit tested
  • 37. From the DevOps perspective
  • 38. v1.0 - running on EC2 • Start with an EC2 script ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> —instance-type=c3.xlarge launch <cluster-name> If it does not work for you - modify it, it’s just a simple python+boto
  • 39. v2.0 - Autoscaling on spot instances 1x Master - on-demand (c3.large) XX Slaves - spot instances depending on usage patterns (r3.*) • no HDFS • persistence in memory + S3
  • 40. Other options • Mesos • YARN • MR1
  • 42. JVM issues • java.lang.OutOfMemoryError: GC overhead limit exceeded • add more memory? val sparkConf = new SparkConf() .set("spark.executor.memory", "120g") .set("spark.storage.memoryFraction","0.3") .set("spark.shuffle.memoryFraction","0.3") • increase parallelism: sc.textFile("s3://..path", 10000) groupByKey(10000)
  • 43. Full GC 2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G- >45G(110G), 79.3771030 secs] 2014-05-21T10:16:42.580+0000: 280.087: Total time for which application threads were stopped: 79.3773830 seconds we want to avoid this • Use G1GC + Java 8 • Store data serialized set("spark.serializer","org.apache.spark.serializer.KryoSerializer") set("spark.kryo.registrator","scee.SceeKryoRegistrator")
  • 44. Bugs • for example: cdh5 does not work with Amazon S3 out of the box ( thx to Sean it will be fixed in next release ) • If in doubt use the provided ec2/spark-ec2 script • ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> —instance-type=c3.xlarge launch <cluster-name>
  • 45. Tips & Tricks • you do not need to package whole spark with your app, just specify dependencies as provided in sbt libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" % „provided" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" % "provided" assembly jar size from 120MB -> 5MB • always ensure you are compiling agains the same version of artifacts, if not ”bad things will happen”™
  • 46. Future - Spark 1.0 • Voting in progress to release Spark 1.0.0 RC11 • Spark SQL • History server • Job Submission Tool • Java 8 support
  • 47. Spark - Hadoop done right • Faster to run, less code to write • Deploying Spark can be easy and cost-effective • Still rough around the edges but improves quickly
  • 48. Thank you for listening :)

Editor's Notes

  1. My experience supercharging Extract Transform Load workloads with Spark
  2. Get the data (access logs + application logs )
  3. Put it into S3 Load into HDFS
  4. Transform using Hive/Streaming/Cascading/Scalding into flat structure you can query
  5. Load into MPP database / Query using HIVE
  6. Rewrite all the logic for real-time On top of completely different technology Storm/Samza etc.
  7. Is it the best option?
  8. read–eval–print loop