SlideShare a Scribd company logo
1 of 15
© 2013 MediaCrossing, Inc. All rights reserved.
Spark’s Role at
MediaCrossing
Gary Malouf
Architect at MediaCrossing
@GaryMalouf
Boston Spark User Group July 15, 2014
About Me
• Functional Programming enthusiast (Formerly a Java
Developer)
• Enjoy building fault-tolerant, highly scalable software
• Continuously looking for ways to make software easier
to reason about
• Leading development of an ad trading system at
MediaCrossing
2
MediaCrossing
• A Market Maker for digital media
• Treat online ads as a financial instrument
• Trade the rights to deliver ad impressions on behalf of
clients, but bear the risk ourselves to get the best
possible price points
• There are 100’s of 1000s of ad impression
opportunities per second able to be bought or sold –
even servicing a slice of this results in you needing to
handle large swaths (big) of data.
3
Our Development Approach
• Functional Programming is the ‘default’, will use
mutable state/imperative approaches where it makes
sense.
• We compose microservices together to form a more
‘antifragile’ system
• Scala is a great fit for this, also a language most of us
had used previously with success
• Company inception December 2012 – approximately
99% of our code base to date is in Scala
4
System Responsibilities
• Two Major Focuses
• High throughput, low latency trading
• Analytical feedback loop to enrich strategies and alter
behavior based on market conditions
• The ‘feedback loop’ is where much of the secret sauce
is created for the execution platform to act on
• Once we had an interface between our focuses, we
could choose the best technologies possible to
address each system’s needs individually
5
Concerning the Feedback Loop
• Inspired by Nathan Marz’s “Lambda Architecture”
principles, our team leverages a unified view of
realtime and historic user behavior to constantly adjust
our buying and selling models
• Realtime data is aggregated via Storm and stored in
time series within Cassandra
• Historic data is fed into HDFS via Storm -> Flume, we
then use Spark to build the time series aggregates and
write them to Cassandra
6
Why Choose Spark?
• Smart use of memory (vs disk-based processing with
Map/Reduce)
• Spark’s API is focused on solving business problems,
the map/reduce API forces developers to think a lot
more about infrastructure
• General aversion to what is now a bloated Hadoop
‘ecosystem’ – much of the things you need are built
into Spark
• Spark is written in Scala >> Synergy!
7
How we use Spark
• Data Aggregation – outputs used for reporting, live
system decision-making and analysis
• Ad-hoc Queries via Spark Shell – quantitative
analysis, issue investigations, sample data
• Machine Learning via MLlib and custom queries
• SparkSQL for those less eager to do all of their work in
Scala
8
9
Designing your Stack
• Before Building Out: Think about what you want to do
with your data and how it will get into your system
• Sequence Files vs Text Files
• Automate your deployment/configuration
• Co-locate Spark workers with your raw data
• Ideal World: Separate research cluster from
‘production’
• Choose a cluster manager
• Standalone/Yarn/Mesos
• We went with Mesos – Berkeley stack preference
10
Designing your Stack (Cont.)
• How will data get into your system?
• Largely depends on your requirements
• Continuous streaming data - Apache Flume, Spark
Streaming
• Large batches – Spark jobs or plain old scripts
11
Things to be aware of
• Spark’s development cycle is fast – do not trust that
the latest release will work for your unique
combination of libraries – TEST!!
• For real world applications, need to understand your
storage options and their limitations (HDFS small files)
• Lazy Evaluation – data does not start moving around
until some type of result/side-effect is specified
• Be conscious about how data is serialized during
transformations
12
Where we are Going..
• Continue to ramp up quant/analytic usage of Spark
• We intentionally have minimized our focus on the
Hadoop ecosystem (Hive, Pig, Parquet, etc) and plan
to continue this approach
• Increasingly focusing on the Berkeley stack, planning
to investigate BlinkDB next as a way to derive
probabilistic results from our data quickly
13
Final Thoughts
• Spark is awesome – but if you do not have actual big
data there are plenty of other solutions for you
• You do not have to be a Scala expert to have a very
positive experience with Spark
• If you are lucky enough to be starting fresh – prefer
Mesos or Spark Standalone over Yarn
• Most of today’s Hadoop libraries exist to work around
the problems that Map/Reduce presented – Spark is a
reset on how we work with big data
14
Thank you for your time!
Questions?
15

More Related Content

What's hot

MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
MapR Technologies
 
C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?
DataStax
 

What's hot (20)

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches
 
C*ollege Credit: Keep the DB, Lose the A
C*ollege Credit: Keep the DB, Lose the AC*ollege Credit: Keep the DB, Lose the A
C*ollege Credit: Keep the DB, Lose the A
 
What's next for Big Data? -- Apache Spark
What's next for Big Data? -- Apache SparkWhat's next for Big Data? -- Apache Spark
What's next for Big Data? -- Apache Spark
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
Disrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the CloudDisrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the Cloud
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Concur Discovers the True Value of Data
Concur Discovers the True Value of DataConcur Discovers the True Value of Data
Concur Discovers the True Value of Data
 
C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?
 
Spark Usage in Enterprise Business Operations
Spark Usage in Enterprise Business OperationsSpark Usage in Enterprise Business Operations
Spark Usage in Enterprise Business Operations
 
Big Data Ecosystem- Impetus Technologies
Big Data Ecosystem-  Impetus TechnologiesBig Data Ecosystem-  Impetus Technologies
Big Data Ecosystem- Impetus Technologies
 
Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...
Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...
Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...
 
DataStax Enterprise in Practice (Field Notes)
DataStax Enterprise in Practice (Field Notes)DataStax Enterprise in Practice (Field Notes)
DataStax Enterprise in Practice (Field Notes)
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
NYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talkNYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talk
 
Cassandra Development Nirvana
Cassandra Development Nirvana Cassandra Development Nirvana
Cassandra Development Nirvana
 

Viewers also liked

Presentasiku
PresentasikuPresentasiku
Presentasiku
onimr
 
Vingadores a era de ultron
Vingadores a era de ultronVingadores a era de ultron
Vingadores a era de ultron
Meio & Mensagem
 
Golden Arms Corporate Presentation 2015
Golden Arms Corporate Presentation 2015Golden Arms Corporate Presentation 2015
Golden Arms Corporate Presentation 2015
Tapish Arora
 
Vingadores a era de ultron
Vingadores a era de ultronVingadores a era de ultron
Vingadores a era de ultron
Meio & Mensagem
 

Viewers also liked (20)

A New Approach to Making Copper Testing Easier
A New Approach to Making Copper Testing EasierA New Approach to Making Copper Testing Easier
A New Approach to Making Copper Testing Easier
 
Historia del Diseño
Historia del DiseñoHistoria del Diseño
Historia del Diseño
 
Prism Group's corporate Presentation
Prism Group's corporate PresentationPrism Group's corporate Presentation
Prism Group's corporate Presentation
 
Develop Conference 14': Blast off! How to get a game startup off the ground
Develop Conference 14': Blast off! How to get a game startup off the groundDevelop Conference 14': Blast off! How to get a game startup off the ground
Develop Conference 14': Blast off! How to get a game startup off the ground
 
IEEE_802.11e
IEEE_802.11eIEEE_802.11e
IEEE_802.11e
 
Tourisme a saint marc
Tourisme a saint marcTourisme a saint marc
Tourisme a saint marc
 
Prism group's new corporate Presentation
Prism group's new corporate PresentationPrism group's new corporate Presentation
Prism group's new corporate Presentation
 
Stack and Queue (brief)
Stack and Queue (brief)Stack and Queue (brief)
Stack and Queue (brief)
 
Image Degradation & Resoration
Image Degradation & ResorationImage Degradation & Resoration
Image Degradation & Resoration
 
Pgft pt-dl-reference-form
Pgft pt-dl-reference-formPgft pt-dl-reference-form
Pgft pt-dl-reference-form
 
Fault Tree Analysis
Fault Tree AnalysisFault Tree Analysis
Fault Tree Analysis
 
Beneplan Presentation to EO Ottawa - Legal, Premiums & Genetics - April 2015
Beneplan Presentation to EO Ottawa -  Legal, Premiums & Genetics - April 2015Beneplan Presentation to EO Ottawa -  Legal, Premiums & Genetics - April 2015
Beneplan Presentation to EO Ottawa - Legal, Premiums & Genetics - April 2015
 
Presentasiku
PresentasikuPresentasiku
Presentasiku
 
Vingadores a era de ultron
Vingadores a era de ultronVingadores a era de ultron
Vingadores a era de ultron
 
Visual Content Marketing
Visual Content MarketingVisual Content Marketing
Visual Content Marketing
 
Golden Arms Corporate Presentation 2015
Golden Arms Corporate Presentation 2015Golden Arms Corporate Presentation 2015
Golden Arms Corporate Presentation 2015
 
leedsalloy-alloy wheel repair
leedsalloy-alloy wheel repairleedsalloy-alloy wheel repair
leedsalloy-alloy wheel repair
 
Alvaro workshop yotsuya art school 2007
Alvaro workshop yotsuya art school 2007Alvaro workshop yotsuya art school 2007
Alvaro workshop yotsuya art school 2007
 
Vingadores a era de ultron
Vingadores a era de ultronVingadores a era de ultron
Vingadores a era de ultron
 
Sintesis informativa 01 04 2015
Sintesis informativa 01 04 2015Sintesis informativa 01 04 2015
Sintesis informativa 01 04 2015
 

Similar to Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014

Similar to Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014 (20)

Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edge
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big Data
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
963
963963
963
 

Recently uploaded

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Recently uploaded (20)

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 

Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014

  • 1. © 2013 MediaCrossing, Inc. All rights reserved. Spark’s Role at MediaCrossing Gary Malouf Architect at MediaCrossing @GaryMalouf Boston Spark User Group July 15, 2014
  • 2. About Me • Functional Programming enthusiast (Formerly a Java Developer) • Enjoy building fault-tolerant, highly scalable software • Continuously looking for ways to make software easier to reason about • Leading development of an ad trading system at MediaCrossing 2
  • 3. MediaCrossing • A Market Maker for digital media • Treat online ads as a financial instrument • Trade the rights to deliver ad impressions on behalf of clients, but bear the risk ourselves to get the best possible price points • There are 100’s of 1000s of ad impression opportunities per second able to be bought or sold – even servicing a slice of this results in you needing to handle large swaths (big) of data. 3
  • 4. Our Development Approach • Functional Programming is the ‘default’, will use mutable state/imperative approaches where it makes sense. • We compose microservices together to form a more ‘antifragile’ system • Scala is a great fit for this, also a language most of us had used previously with success • Company inception December 2012 – approximately 99% of our code base to date is in Scala 4
  • 5. System Responsibilities • Two Major Focuses • High throughput, low latency trading • Analytical feedback loop to enrich strategies and alter behavior based on market conditions • The ‘feedback loop’ is where much of the secret sauce is created for the execution platform to act on • Once we had an interface between our focuses, we could choose the best technologies possible to address each system’s needs individually 5
  • 6. Concerning the Feedback Loop • Inspired by Nathan Marz’s “Lambda Architecture” principles, our team leverages a unified view of realtime and historic user behavior to constantly adjust our buying and selling models • Realtime data is aggregated via Storm and stored in time series within Cassandra • Historic data is fed into HDFS via Storm -> Flume, we then use Spark to build the time series aggregates and write them to Cassandra 6
  • 7. Why Choose Spark? • Smart use of memory (vs disk-based processing with Map/Reduce) • Spark’s API is focused on solving business problems, the map/reduce API forces developers to think a lot more about infrastructure • General aversion to what is now a bloated Hadoop ‘ecosystem’ – much of the things you need are built into Spark • Spark is written in Scala >> Synergy! 7
  • 8. How we use Spark • Data Aggregation – outputs used for reporting, live system decision-making and analysis • Ad-hoc Queries via Spark Shell – quantitative analysis, issue investigations, sample data • Machine Learning via MLlib and custom queries • SparkSQL for those less eager to do all of their work in Scala 8
  • 9. 9
  • 10. Designing your Stack • Before Building Out: Think about what you want to do with your data and how it will get into your system • Sequence Files vs Text Files • Automate your deployment/configuration • Co-locate Spark workers with your raw data • Ideal World: Separate research cluster from ‘production’ • Choose a cluster manager • Standalone/Yarn/Mesos • We went with Mesos – Berkeley stack preference 10
  • 11. Designing your Stack (Cont.) • How will data get into your system? • Largely depends on your requirements • Continuous streaming data - Apache Flume, Spark Streaming • Large batches – Spark jobs or plain old scripts 11
  • 12. Things to be aware of • Spark’s development cycle is fast – do not trust that the latest release will work for your unique combination of libraries – TEST!! • For real world applications, need to understand your storage options and their limitations (HDFS small files) • Lazy Evaluation – data does not start moving around until some type of result/side-effect is specified • Be conscious about how data is serialized during transformations 12
  • 13. Where we are Going.. • Continue to ramp up quant/analytic usage of Spark • We intentionally have minimized our focus on the Hadoop ecosystem (Hive, Pig, Parquet, etc) and plan to continue this approach • Increasingly focusing on the Berkeley stack, planning to investigate BlinkDB next as a way to derive probabilistic results from our data quickly 13
  • 14. Final Thoughts • Spark is awesome – but if you do not have actual big data there are plenty of other solutions for you • You do not have to be a Scala expert to have a very positive experience with Spark • If you are lucky enough to be starting fresh – prefer Mesos or Spark Standalone over Yarn • Most of today’s Hadoop libraries exist to work around the problems that Map/Reduce presented – Spark is a reset on how we work with big data 14
  • 15. Thank you for your time! Questions? 15