© 2013 MediaCrossing, Inc. All rights reserved.
Spark’s Role at
MediaCrossing
Gary Malouf
Architect at MediaCrossing
@GaryMalouf
Boston Spark User Group July 15, 2014
About Me
• Functional Programming enthusiast (Formerly a Java
Developer)
• Enjoy building fault-tolerant, highly scalable software
• Continuously looking for ways to make software easier
to reason about
• Leading development of an ad trading system at
MediaCrossing
2
MediaCrossing
• A Market Maker for digital media
• Treat online ads as a financial instrument
• Trade the rights to deliver ad impressions on behalf of
clients, but bear the risk ourselves to get the best
possible price points
• There are 100’s of 1000s of ad impression
opportunities per second able to be bought or sold –
even servicing a slice of this results in you needing to
handle large swaths (big) of data.
3
Our Development Approach
• Functional Programming is the ‘default’, will use
mutable state/imperative approaches where it makes
sense.
• We compose microservices together to form a more
‘antifragile’ system
• Scala is a great fit for this, also a language most of us
had used previously with success
• Company inception December 2012 – approximately
99% of our code base to date is in Scala
4
System Responsibilities
• Two Major Focuses
• High throughput, low latency trading
• Analytical feedback loop to enrich strategies and alter
behavior based on market conditions
• The ‘feedback loop’ is where much of the secret sauce
is created for the execution platform to act on
• Once we had an interface between our focuses, we
could choose the best technologies possible to
address each system’s needs individually
5
Concerning the Feedback Loop
• Inspired by Nathan Marz’s “Lambda Architecture”
principles, our team leverages a unified view of
realtime and historic user behavior to constantly adjust
our buying and selling models
• Realtime data is aggregated via Storm and stored in
time series within Cassandra
• Historic data is fed into HDFS via Storm -> Flume, we
then use Spark to build the time series aggregates and
write them to Cassandra
6
Why Choose Spark?
• Smart use of memory (vs disk-based processing with
Map/Reduce)
• Spark’s API is focused on solving business problems,
the map/reduce API forces developers to think a lot
more about infrastructure
• General aversion to what is now a bloated Hadoop
‘ecosystem’ – much of the things you need are built
into Spark
• Spark is written in Scala >> Synergy!
7
How we use Spark
• Data Aggregation – outputs used for reporting, live
system decision-making and analysis
• Ad-hoc Queries via Spark Shell – quantitative
analysis, issue investigations, sample data
• Machine Learning via MLlib and custom queries
• SparkSQL for those less eager to do all of their work in
Scala
8
9
Designing your Stack
• Before Building Out: Think about what you want to do
with your data and how it will get into your system
• Sequence Files vs Text Files
• Automate your deployment/configuration
• Co-locate Spark workers with your raw data
• Ideal World: Separate research cluster from
‘production’
• Choose a cluster manager
• Standalone/Yarn/Mesos
• We went with Mesos – Berkeley stack preference
10
Designing your Stack (Cont.)
• How will data get into your system?
• Largely depends on your requirements
• Continuous streaming data - Apache Flume, Spark
Streaming
• Large batches – Spark jobs or plain old scripts
11
Things to be aware of
• Spark’s development cycle is fast – do not trust that
the latest release will work for your unique
combination of libraries – TEST!!
• For real world applications, need to understand your
storage options and their limitations (HDFS small files)
• Lazy Evaluation – data does not start moving around
until some type of result/side-effect is specified
• Be conscious about how data is serialized during
transformations
12
Where we are Going..
• Continue to ramp up quant/analytic usage of Spark
• We intentionally have minimized our focus on the
Hadoop ecosystem (Hive, Pig, Parquet, etc) and plan
to continue this approach
• Increasingly focusing on the Berkeley stack, planning
to investigate BlinkDB next as a way to derive
probabilistic results from our data quickly
13
Final Thoughts
• Spark is awesome – but if you do not have actual big
data there are plenty of other solutions for you
• You do not have to be a Scala expert to have a very
positive experience with Spark
• If you are lucky enough to be starting fresh – prefer
Mesos or Spark Standalone over Yarn
• Most of today’s Hadoop libraries exist to work around
the problems that Map/Reduce presented – Spark is a
reset on how we work with big data
14
Thank you for your time!
Questions?
15

Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014

  • 1.
    © 2013 MediaCrossing,Inc. All rights reserved. Spark’s Role at MediaCrossing Gary Malouf Architect at MediaCrossing @GaryMalouf Boston Spark User Group July 15, 2014
  • 2.
    About Me • FunctionalProgramming enthusiast (Formerly a Java Developer) • Enjoy building fault-tolerant, highly scalable software • Continuously looking for ways to make software easier to reason about • Leading development of an ad trading system at MediaCrossing 2
  • 3.
    MediaCrossing • A MarketMaker for digital media • Treat online ads as a financial instrument • Trade the rights to deliver ad impressions on behalf of clients, but bear the risk ourselves to get the best possible price points • There are 100’s of 1000s of ad impression opportunities per second able to be bought or sold – even servicing a slice of this results in you needing to handle large swaths (big) of data. 3
  • 4.
    Our Development Approach •Functional Programming is the ‘default’, will use mutable state/imperative approaches where it makes sense. • We compose microservices together to form a more ‘antifragile’ system • Scala is a great fit for this, also a language most of us had used previously with success • Company inception December 2012 – approximately 99% of our code base to date is in Scala 4
  • 5.
    System Responsibilities • TwoMajor Focuses • High throughput, low latency trading • Analytical feedback loop to enrich strategies and alter behavior based on market conditions • The ‘feedback loop’ is where much of the secret sauce is created for the execution platform to act on • Once we had an interface between our focuses, we could choose the best technologies possible to address each system’s needs individually 5
  • 6.
    Concerning the FeedbackLoop • Inspired by Nathan Marz’s “Lambda Architecture” principles, our team leverages a unified view of realtime and historic user behavior to constantly adjust our buying and selling models • Realtime data is aggregated via Storm and stored in time series within Cassandra • Historic data is fed into HDFS via Storm -> Flume, we then use Spark to build the time series aggregates and write them to Cassandra 6
  • 7.
    Why Choose Spark? •Smart use of memory (vs disk-based processing with Map/Reduce) • Spark’s API is focused on solving business problems, the map/reduce API forces developers to think a lot more about infrastructure • General aversion to what is now a bloated Hadoop ‘ecosystem’ – much of the things you need are built into Spark • Spark is written in Scala >> Synergy! 7
  • 8.
    How we useSpark • Data Aggregation – outputs used for reporting, live system decision-making and analysis • Ad-hoc Queries via Spark Shell – quantitative analysis, issue investigations, sample data • Machine Learning via MLlib and custom queries • SparkSQL for those less eager to do all of their work in Scala 8
  • 9.
  • 10.
    Designing your Stack •Before Building Out: Think about what you want to do with your data and how it will get into your system • Sequence Files vs Text Files • Automate your deployment/configuration • Co-locate Spark workers with your raw data • Ideal World: Separate research cluster from ‘production’ • Choose a cluster manager • Standalone/Yarn/Mesos • We went with Mesos – Berkeley stack preference 10
  • 11.
    Designing your Stack(Cont.) • How will data get into your system? • Largely depends on your requirements • Continuous streaming data - Apache Flume, Spark Streaming • Large batches – Spark jobs or plain old scripts 11
  • 12.
    Things to beaware of • Spark’s development cycle is fast – do not trust that the latest release will work for your unique combination of libraries – TEST!! • For real world applications, need to understand your storage options and their limitations (HDFS small files) • Lazy Evaluation – data does not start moving around until some type of result/side-effect is specified • Be conscious about how data is serialized during transformations 12
  • 13.
    Where we areGoing.. • Continue to ramp up quant/analytic usage of Spark • We intentionally have minimized our focus on the Hadoop ecosystem (Hive, Pig, Parquet, etc) and plan to continue this approach • Increasingly focusing on the Berkeley stack, planning to investigate BlinkDB next as a way to derive probabilistic results from our data quickly 13
  • 14.
    Final Thoughts • Sparkis awesome – but if you do not have actual big data there are plenty of other solutions for you • You do not have to be a Scala expert to have a very positive experience with Spark • If you are lucky enough to be starting fresh – prefer Mesos or Spark Standalone over Yarn • Most of today’s Hadoop libraries exist to work around the problems that Map/Reduce presented – Spark is a reset on how we work with big data 14
  • 15.
    Thank you foryour time! Questions? 15