2. About Me
• Functional Programming enthusiast (Formerly a Java
Developer)
• Enjoy building fault-tolerant, highly scalable software
• Continuously looking for ways to make software easier
to reason about
• Leading development of an ad trading system at
MediaCrossing
2
3. MediaCrossing
• A Market Maker for digital media
• Treat online ads as a financial instrument
• Trade the rights to deliver ad impressions on behalf of
clients, but bear the risk ourselves to get the best
possible price points
• There are 100’s of 1000s of ad impression
opportunities per second able to be bought or sold –
even servicing a slice of this results in you needing to
handle large swaths (big) of data.
3
4. Our Development Approach
• Functional Programming is the ‘default’, will use
mutable state/imperative approaches where it makes
sense.
• We compose microservices together to form a more
‘antifragile’ system
• Scala is a great fit for this, also a language most of us
had used previously with success
• Company inception December 2012 – approximately
99% of our code base to date is in Scala
4
5. System Responsibilities
• Two Major Focuses
• High throughput, low latency trading
• Analytical feedback loop to enrich strategies and alter
behavior based on market conditions
• The ‘feedback loop’ is where much of the secret sauce
is created for the execution platform to act on
• Once we had an interface between our focuses, we
could choose the best technologies possible to
address each system’s needs individually
5
6. Concerning the Feedback Loop
• Inspired by Nathan Marz’s “Lambda Architecture”
principles, our team leverages a unified view of
realtime and historic user behavior to constantly adjust
our buying and selling models
• Realtime data is aggregated via Storm and stored in
time series within Cassandra
• Historic data is fed into HDFS via Storm -> Flume, we
then use Spark to build the time series aggregates and
write them to Cassandra
6
7. Why Choose Spark?
• Smart use of memory (vs disk-based processing with
Map/Reduce)
• Spark’s API is focused on solving business problems,
the map/reduce API forces developers to think a lot
more about infrastructure
• General aversion to what is now a bloated Hadoop
‘ecosystem’ – much of the things you need are built
into Spark
• Spark is written in Scala >> Synergy!
7
8. How we use Spark
• Data Aggregation – outputs used for reporting, live
system decision-making and analysis
• Ad-hoc Queries via Spark Shell – quantitative
analysis, issue investigations, sample data
• Machine Learning via MLlib and custom queries
• SparkSQL for those less eager to do all of their work in
Scala
8
10. Designing your Stack
• Before Building Out: Think about what you want to do
with your data and how it will get into your system
• Sequence Files vs Text Files
• Automate your deployment/configuration
• Co-locate Spark workers with your raw data
• Ideal World: Separate research cluster from
‘production’
• Choose a cluster manager
• Standalone/Yarn/Mesos
• We went with Mesos – Berkeley stack preference
10
11. Designing your Stack (Cont.)
• How will data get into your system?
• Largely depends on your requirements
• Continuous streaming data - Apache Flume, Spark
Streaming
• Large batches – Spark jobs or plain old scripts
11
12. Things to be aware of
• Spark’s development cycle is fast – do not trust that
the latest release will work for your unique
combination of libraries – TEST!!
• For real world applications, need to understand your
storage options and their limitations (HDFS small files)
• Lazy Evaluation – data does not start moving around
until some type of result/side-effect is specified
• Be conscious about how data is serialized during
transformations
12
13. Where we are Going..
• Continue to ramp up quant/analytic usage of Spark
• We intentionally have minimized our focus on the
Hadoop ecosystem (Hive, Pig, Parquet, etc) and plan
to continue this approach
• Increasingly focusing on the Berkeley stack, planning
to investigate BlinkDB next as a way to derive
probabilistic results from our data quickly
13
14. Final Thoughts
• Spark is awesome – but if you do not have actual big
data there are plenty of other solutions for you
• You do not have to be a Scala expert to have a very
positive experience with Spark
• If you are lucky enough to be starting fresh – prefer
Mesos or Spark Standalone over Yarn
• Most of today’s Hadoop libraries exist to work around
the problems that Map/Reduce presented – Spark is a
reset on how we work with big data
14