SlideShare a Scribd company logo
1 of 21
Download to read offline
‹#›
January 31, 2015
Using Spark in
Production
Pete Gamache / @gamache / pete@localytics.com
Using Spark in Production
Quick Introduction to Localytics
•Localytics is an app marketing service with a strong analytics
engine at its core.
•We process billions of events per day in real time, using a
backend made of Scala and AWS.
•Every data point we’ve ever received is available for querying,
and we integrate analytics insight with marketing automation.
2
Using Spark in Production
Original Architecture
•Rails
•MySQL
•It was 2009 and people did that back then
3
Using Spark in Production
Current Backend Architecture
•Scala services loosely coupled through queues
•“Queuer” service is our front line, and collects and enqueues
events sent from mobile or web apps
•“Processor” service handles deduplication, device recognition,
and a lot more
•Data ends up in MPP DB (querying) and in S3 (export)
4
Using Spark in Production
Our Processor Service
•Processor service is the place we have traditionally added new
functionality
•After four years of this, it’s rather monolithic
•Parts are absolutely critical to our company; other parts are not
•Different parts scale differently
5
Using Spark in Production
Rethinking Our Data Processing
•Treat the Processor service as the “source of truth”
•Build outboard services which consume this truth
•Simplify the existing Processor by moving some logic out
•Add specialized data sources to take burden off MPP DB
6
Using Spark in Production
The Source of Truth
•S3 bucket containing batches of events, filed by date and hour
of processing
•Events may (will) be out of order
•A single user’s session may (will) be split across files
•At press time: each file ~500MB of JSON objects; ~500 files per
hour (~6 TB/day)
7
Using Spark in Production
Side Note: Lambda Architecture
•Batch and stream processing, at once
•Most of what we do right now is batch processing, where
“batch” means “all of our data ever”
•MPP DB provides arbitrary querying of entire data set
•Outboard services can provide a speed layer for simple/
common queries
8
Using Spark in Production
Feature Requirements
•Every event we receive has a number of attributes passed
along with it
•e.g., Country, Timezone, OS Version, etc.
•We want to keep the most recent attribute values for each of
our users in a key-value store
•This provides fast real-time access to current truths, rather
than sifting through a series of events for point-in-time facts
9
Using Spark in Production
Apache Spark
•Most often envisioned as multi-node, long-lived cluster à la
Hadoop
•This is a great fit for many batch-layer tasks
•Spark Streaming makes sliding-window querying very simple
•Spark SQL provides a familiar interface to non-production
coders (business intelligence, data science, etc.)
10
Using Spark in Production
A Few Gripes about Spark Clusters
•Weak tooling around cluster spin-up, monitoring, maintenance,
scaling
•Tough to scale horizontally, compared to Amazon ELB
•Building fat JARs when you use any non-Spark dependencies
at all is a god-damned nightmare
11
Using Spark in Production
I’m Not Kidding About the Fat JARs
libraryDependencies ++= Seq(

//

("org.apache.spark" %% "spark-sql" % "1.1.0").

exclude("com.typesafe.akka", "akka-actor_2.10").

exclude("com.typesafe.akka", "akka-slf4j_2.10").

exclude("org.mortbay.jetty", "servlet-api").

exclude("javax.transaction", "jta").

exclude("commons-beanutils", "commons-beanutils-core").

exclude("commons-logging", "commons-logging").

exclude("org.slf4j", "jcl-over-slf4j").

exclude("org.slf4j", "slf4j-log4j12").

exclude("commons-collections", "commons-collections").

exclude("com.esotericsoftware.minlog", "minlog"),

//

("com.typesafe.play" %% "play-json" % "2.4.0-M2").

exclude("com.typesafe.akka", "akka-slf4j_2.10").

exclude("javax.transaction", "jta").

exclude("commons-logging", "commons-logging").

exclude("org.slf4j", "jcl-over-slf4j").

12
Using Spark in Production
Spark Standalone FTW
•Great for small bites of data (tens of GB)
•Few dependency issues
•Horizontal scale comes easy/free
•Hadoop file input adapters are excellent
•Works great for tests, too!
13
Using Spark in Production
A Proposed Product Architecture
•Processor service spits truth into S3 bucket
•S3 Event Notifications are posted to an SQS queue
•A pool of post-processing servers, each running Spark
standalone, drains the queue. Scale up on queue depth
•Each S3 file generates ~25K updates to be applied
•Actor pool of HTTP clients applies updates to key-value web
service
•Akka ask pattern provides per-server synchronization
•Play framework as app container
14
Using Spark in Production
How’s it work?
•Great, that’s how
•Each server ingests 5-10 files at a time (~3-5GB)
•Pre-tuning: T_spark=1m20s, 25K updates per minute on
c3.8xlarge EC2 instance (60GB RAM, 32 cores, 10GB network)
•Post-tuning: T_spark=1m20s, 75K updates per minute on
c3.2xlarge EC2 instance (15GB RAM, 8 cores, 1GB network)
•That’s a 4x speedup on Spark and 12x on HTTP
15
Using Spark in Production
Tuning
•Ingesting files in batches of 5 or more
•Replacing default Play logger with AsyncLogger
•HTTP tuning
•Gave up play-ws for Apache HttpClient
•Single-threaded client per actor
•Keepalive
•Preëmptive HTTP Basic Auth
16
Using Spark in Production
Gotchas and Issues
•Mostly transient Spark issues
•Network timeouts
•File not found
•Too many open files
•Solution:
•Akka SupervisorStrategy + actor restart
•SQS message timeout ensures data will be processed
17
Using Spark in Production
Future Work
•Establish long-lived Spark Streaming cluster(s) for batch layer
•Create more standalone Spark services
•Disassemble the Processor monolith brick by brick
•???
18
‹#›
If this kind of thing sounds like fun, let’s talk!
Email me at pete@localytics.com or visit
www.localytics.com/jobs/.
We’re Hiring!
‹#›
Q&A
‹#›
Now let’s go have a beer.
Thanks!

More Related Content

Viewers also liked

CISCO Virtual Private LAN Service (VPLS) Technical Deployment Overview
CISCO Virtual Private LAN Service (VPLS) Technical Deployment OverviewCISCO Virtual Private LAN Service (VPLS) Technical Deployment Overview
CISCO Virtual Private LAN Service (VPLS) Technical Deployment Overview
Ameen Wayok
 
Comandos avanzados autocad
Comandos avanzados autocadComandos avanzados autocad
Comandos avanzados autocad
Jose Luis Lopez
 

Viewers also liked (18)

Sesión 01: SAP ECC6 Como Plataforma de Desarrollo & Proyecto Hrbiz
Sesión 01: SAP ECC6 Como Plataforma de Desarrollo & Proyecto HrbizSesión 01: SAP ECC6 Como Plataforma de Desarrollo & Proyecto Hrbiz
Sesión 01: SAP ECC6 Como Plataforma de Desarrollo & Proyecto Hrbiz
 
El Fondo Verde del Clima: Historia y gobernanza
El Fondo Verde del Clima: Historia y gobernanzaEl Fondo Verde del Clima: Historia y gobernanza
El Fondo Verde del Clima: Historia y gobernanza
 
Coke zero vs Pepsi max
Coke zero vs Pepsi maxCoke zero vs Pepsi max
Coke zero vs Pepsi max
 
SYMPOSIUM 2014 FRANZ DAUBLEBSKY-EICHHAIN: Prävention Why – Why not Stress- u...
SYMPOSIUM 2014  FRANZ DAUBLEBSKY-EICHHAIN: Prävention Why – Why not Stress- u...SYMPOSIUM 2014  FRANZ DAUBLEBSKY-EICHHAIN: Prävention Why – Why not Stress- u...
SYMPOSIUM 2014 FRANZ DAUBLEBSKY-EICHHAIN: Prävention Why – Why not Stress- u...
 
El turismo 2 b
El turismo 2 bEl turismo 2 b
El turismo 2 b
 
Histórico Ney Braga - Bom Jardim - MA
Histórico  Ney Braga - Bom Jardim - MAHistórico  Ney Braga - Bom Jardim - MA
Histórico Ney Braga - Bom Jardim - MA
 
Give your community owners the reports they really need
Give your community owners the reports they really needGive your community owners the reports they really need
Give your community owners the reports they really need
 
5 Myths about Spark and Big Data by Nik Rouda
5 Myths about Spark and Big Data by Nik Rouda5 Myths about Spark and Big Data by Nik Rouda
5 Myths about Spark and Big Data by Nik Rouda
 
V-ELEC 11 Gestión estratégica de la energía - La Evolución de la Iluminación
V-ELEC 11 Gestión estratégica de la energía - La Evolución de la IluminaciónV-ELEC 11 Gestión estratégica de la energía - La Evolución de la Iluminación
V-ELEC 11 Gestión estratégica de la energía - La Evolución de la Iluminación
 
CISCO Virtual Private LAN Service (VPLS) Technical Deployment Overview
CISCO Virtual Private LAN Service (VPLS) Technical Deployment OverviewCISCO Virtual Private LAN Service (VPLS) Technical Deployment Overview
CISCO Virtual Private LAN Service (VPLS) Technical Deployment Overview
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Protect your APIs from Cyber Threats
Protect your APIs from Cyber ThreatsProtect your APIs from Cyber Threats
Protect your APIs from Cyber Threats
 
company profile reklamasi
company profile reklamasicompany profile reklamasi
company profile reklamasi
 
De Bitcoin a Ethereum: Criptomonedas, Contratos Inteligentes y Corporaciones ...
De Bitcoin a Ethereum: Criptomonedas, Contratos Inteligentes y Corporaciones ...De Bitcoin a Ethereum: Criptomonedas, Contratos Inteligentes y Corporaciones ...
De Bitcoin a Ethereum: Criptomonedas, Contratos Inteligentes y Corporaciones ...
 
Reading Approaches For An EFL Classroom
Reading Approaches For An EFL ClassroomReading Approaches For An EFL Classroom
Reading Approaches For An EFL Classroom
 
Comandos avanzados autocad
Comandos avanzados autocadComandos avanzados autocad
Comandos avanzados autocad
 
Can't Buy Me Love
Can't Buy Me LoveCan't Buy Me Love
Can't Buy Me Love
 
Virtueller Deutschunterricht
Virtueller DeutschunterrichtVirtueller Deutschunterricht
Virtueller Deutschunterricht
 

Recently uploaded

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 

Recently uploaded (20)

20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Using Spark in Production at Localytics - 2015-01-31

  • 1. ‹#› January 31, 2015 Using Spark in Production Pete Gamache / @gamache / pete@localytics.com
  • 2. Using Spark in Production Quick Introduction to Localytics •Localytics is an app marketing service with a strong analytics engine at its core. •We process billions of events per day in real time, using a backend made of Scala and AWS. •Every data point we’ve ever received is available for querying, and we integrate analytics insight with marketing automation. 2
  • 3. Using Spark in Production Original Architecture •Rails •MySQL •It was 2009 and people did that back then 3
  • 4. Using Spark in Production Current Backend Architecture •Scala services loosely coupled through queues •“Queuer” service is our front line, and collects and enqueues events sent from mobile or web apps •“Processor” service handles deduplication, device recognition, and a lot more •Data ends up in MPP DB (querying) and in S3 (export) 4
  • 5. Using Spark in Production Our Processor Service •Processor service is the place we have traditionally added new functionality •After four years of this, it’s rather monolithic •Parts are absolutely critical to our company; other parts are not •Different parts scale differently 5
  • 6. Using Spark in Production Rethinking Our Data Processing •Treat the Processor service as the “source of truth” •Build outboard services which consume this truth •Simplify the existing Processor by moving some logic out •Add specialized data sources to take burden off MPP DB 6
  • 7. Using Spark in Production The Source of Truth •S3 bucket containing batches of events, filed by date and hour of processing •Events may (will) be out of order •A single user’s session may (will) be split across files •At press time: each file ~500MB of JSON objects; ~500 files per hour (~6 TB/day) 7
  • 8. Using Spark in Production Side Note: Lambda Architecture •Batch and stream processing, at once •Most of what we do right now is batch processing, where “batch” means “all of our data ever” •MPP DB provides arbitrary querying of entire data set •Outboard services can provide a speed layer for simple/ common queries 8
  • 9. Using Spark in Production Feature Requirements •Every event we receive has a number of attributes passed along with it •e.g., Country, Timezone, OS Version, etc. •We want to keep the most recent attribute values for each of our users in a key-value store •This provides fast real-time access to current truths, rather than sifting through a series of events for point-in-time facts 9
  • 10. Using Spark in Production Apache Spark •Most often envisioned as multi-node, long-lived cluster à la Hadoop •This is a great fit for many batch-layer tasks •Spark Streaming makes sliding-window querying very simple •Spark SQL provides a familiar interface to non-production coders (business intelligence, data science, etc.) 10
  • 11. Using Spark in Production A Few Gripes about Spark Clusters •Weak tooling around cluster spin-up, monitoring, maintenance, scaling •Tough to scale horizontally, compared to Amazon ELB •Building fat JARs when you use any non-Spark dependencies at all is a god-damned nightmare 11
  • 12. Using Spark in Production I’m Not Kidding About the Fat JARs libraryDependencies ++= Seq(
 //
 ("org.apache.spark" %% "spark-sql" % "1.1.0").
 exclude("com.typesafe.akka", "akka-actor_2.10").
 exclude("com.typesafe.akka", "akka-slf4j_2.10").
 exclude("org.mortbay.jetty", "servlet-api").
 exclude("javax.transaction", "jta").
 exclude("commons-beanutils", "commons-beanutils-core").
 exclude("commons-logging", "commons-logging").
 exclude("org.slf4j", "jcl-over-slf4j").
 exclude("org.slf4j", "slf4j-log4j12").
 exclude("commons-collections", "commons-collections").
 exclude("com.esotericsoftware.minlog", "minlog"),
 //
 ("com.typesafe.play" %% "play-json" % "2.4.0-M2").
 exclude("com.typesafe.akka", "akka-slf4j_2.10").
 exclude("javax.transaction", "jta").
 exclude("commons-logging", "commons-logging").
 exclude("org.slf4j", "jcl-over-slf4j").
 12
  • 13. Using Spark in Production Spark Standalone FTW •Great for small bites of data (tens of GB) •Few dependency issues •Horizontal scale comes easy/free •Hadoop file input adapters are excellent •Works great for tests, too! 13
  • 14. Using Spark in Production A Proposed Product Architecture •Processor service spits truth into S3 bucket •S3 Event Notifications are posted to an SQS queue •A pool of post-processing servers, each running Spark standalone, drains the queue. Scale up on queue depth •Each S3 file generates ~25K updates to be applied •Actor pool of HTTP clients applies updates to key-value web service •Akka ask pattern provides per-server synchronization •Play framework as app container 14
  • 15. Using Spark in Production How’s it work? •Great, that’s how •Each server ingests 5-10 files at a time (~3-5GB) •Pre-tuning: T_spark=1m20s, 25K updates per minute on c3.8xlarge EC2 instance (60GB RAM, 32 cores, 10GB network) •Post-tuning: T_spark=1m20s, 75K updates per minute on c3.2xlarge EC2 instance (15GB RAM, 8 cores, 1GB network) •That’s a 4x speedup on Spark and 12x on HTTP 15
  • 16. Using Spark in Production Tuning •Ingesting files in batches of 5 or more •Replacing default Play logger with AsyncLogger •HTTP tuning •Gave up play-ws for Apache HttpClient •Single-threaded client per actor •Keepalive •Preëmptive HTTP Basic Auth 16
  • 17. Using Spark in Production Gotchas and Issues •Mostly transient Spark issues •Network timeouts •File not found •Too many open files •Solution: •Akka SupervisorStrategy + actor restart •SQS message timeout ensures data will be processed 17
  • 18. Using Spark in Production Future Work •Establish long-lived Spark Streaming cluster(s) for batch layer •Create more standalone Spark services •Disassemble the Processor monolith brick by brick •??? 18
  • 19. ‹#› If this kind of thing sounds like fun, let’s talk! Email me at pete@localytics.com or visit www.localytics.com/jobs/. We’re Hiring!
  • 21. ‹#› Now let’s go have a beer. Thanks!