Taboola's experience with Apache Spark (presentation @ Reversim 2014)


Published on

At taboola we are getting a constant feed of data (many billions of user events a day) and are using Apache Spark together with Cassandra for both real time data stream processing as well as offline data processing. We'd like to share our experience with these cutting edge technologies.

Apache Spark is an open source project - Hadoop-compatible computing engine that makes big data analysis drastically faster, through in-memory computing, and simpler to write, through easy APIs in Java, Scala and Python. This project was born as part of a PHD work in UC Berkley's AMPLab (part of the BDAS - pronounced "Bad Ass") and turned into an incubating Apache project with more active contributors than Hadoop. Surprisingly, Yahoo! are one of the biggest contributors to the project and already have large production clusters of Spark on YARN.
Spark can run either standalone cluster, or using either Apache mesos and ZooKeeper or YARN and can run side by side with Hadoop/Hive on the same data.

One of the biggest benefits of Spark is that the API is very simple and the same analytics code can be used for both streaming data and offline data processing.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • One of the most exciting things you’ll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involved…If your logo is not up here, forgive us – it’s hard to keep up!
  • RDD  Colloquially referred to as RDDs (e.g. caching in RAM)Lazy operations to build RDDs from other RDDsReturn a result or write it to storage
  • Let me illustrate this with some bad powerpoint diagrams and animationsThis diagram is LOGICAL,
  • Add “variables” to the “functions” in functional programming
  • NOT a modified versionof Hadoop
  • Taboola's experience with Apache Spark (presentation @ Reversim 2014)

    1. 1. Taboola's Experience with Apache Spark
    2. 2. Engine Focused on Maximizing CTR & Post Click Engagement Context Metadata Geo Region-based Recommendations User Behavior Cookie Data Social Facebook/Twitter API Collaborative Filtering Bucketed Consumption Groups
    3. 3. Largest Content Discovery and Monetization Network 3B Daily recommendations 1M+ sourced content providers 0M monthly unique users 1M+ sourced content items
    4. 4. What Does it Mean? • 5 Data Centers across the globe • Tera-bytes of data / day (many billion events) • Data must be processed and analyzed in real time, for example: – – – – – Real-time, per user content recommendations Real-time expenditure reports Automated campaign management Automated recommendation algorithms calibration Real-time analytics
    5. 5. About Spark • • • • • Open Sourced Apache top level project (since Feb. 19th) DataBricks - A commercial company that supports it Hadoop-compatible computing engine Can run side-by-side with Hadoop/Hive on the same data • Drastically faster than Hadoop through in-memory computing • Multiple H/A options - standalone cluster, Apache mesos and ZooKeeper or YARN
    6. 6. Spark Development Community • With over 100 developers and 25 companies, one of the most active communities in big data Comparison: Storm (48), Giraph (52), Drill (18), Tez (12) Past 6 months: more active devs than Hadoop MapReduce!
    7. 7. The Spark Community
    8. 8. 15 25 20 10 5 0 Streaming Response Time (s) Storm 25 20 15 10 5 0 SQL 30 10 5 0 Graph GraphX 25 Giraph 40 Hadoop 35 Response Time (min) Shark (mem) Shark (disk) Hive 45 Impala (mem) 35 Impala (disk) 30 Spark Throughput (MB/s/node) Spark Performance 30 20 15
    9. 9. Spark API • Simple to write through easy APIs in Java, Scala and Python • The same analytics code can be used for both streaming data and offline data processing
    10. 10. Spark Key Concepts Write programs in terms of transformations on distributed datasets Resilient Distributed Operations Datasets • Transformations • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure (e.g. map, filter, group By) • Actions (e.g. count, collect, sa ve)
    11. 11. Working With RDDs textFile = sc.textFile(”SomeFile.txt”) RDD RDD RDD RDD Action Value Transformations linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark linesWithSpark = textFile.filter(lambda line: "Spark” in line)
    12. 12. Example: Log Mining Transformed RDD Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD lines = spark.textFile(“hdfs://...”) results Worker errors = lines.filter(lambda s: s.startswith(“ERROR”)) tasks messages = s: s.split(“t”)[2]) messages.cache() messages.filter(lambda s: “mysql” in s).count() Driver Action Block 1 Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 . . . Full-text search of Wikipedia • 60GB on 20 EC2 machine • 0.5 sec vs. 20s for on-disk Worker Block 3 Block 2
    13. 13. Task Scheduler • General task graphs • Automatically pipelines functions • Data locality aware • Partitioning aware to avoid shuffles B: A: F: Stage 1 C: groupBy D: E: join Stage 2 map = RDD filter = cached partition Stage 3
    14. 14. Software Components • Spark runs as a library in your program (1 instance per app) • Runs tasks locally or on cluster – Mesos, YARN or standalone mode • Accesses storage systems via Hadoop InputFormat API – Can use HBase, HDFS, S3, … Your application SparkContext Cluster manager Local threads Worker Worker Spark executor Spark executor HDFS or other storage
    15. 15. System Architecture & Data Flow @ Taboola Driver + Consumers Spark Cluster FE Servers MySQL Cluster C* Cluster FE Servers
    16. 16. Execution Graph @ Taboola rdd1 = Context.parallize([data]) • Data start point (dates, etc) rdd2 = rdd1.mapPartitions(loadfunc()) • Loading data from external sources (Cassandra, MySQL, etc) rdd3 = rdd2.reduce(reduceFunc()) rdd4 = rdd3.mapPartitions(saverfunc()) rdd4.count() • Aggregating the data and storing results • Saving the results to a DB • Executing the above graph by forcing an output operation
    17. 17. Cassandra as a Distributed Storage • • • Event Log Files saved as blobs to a dedicated keyspace C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is easier for maintenance and simpler to load into Spark Using Astyanax driver + CQL3 – – • Wrote hadoop InputFormat that supports loading this into a lines RDD<String> – • The DataStax InputFormat had issues and at the time was not formally supported Worked well, but ended up not using it – instead using mapPartitions – – • Recipe to load all keys of a table very fast (hundred of thousands / sec) Split by keys and then load data by key in batches – in parallel partitions Very simple, no overhead, no need to be tied to hadoop Will probably use the InputFormat when we deploy a Shark solution Plans to open source all this userevent_2014-02-19 userevent_2014-02-20 Key (String) Data (blob) Key (String) Data (blob) GUID (originally log file name) Gzipped file GUID (originally log file name) Gzipped file GUID Gzipped file GUID Gzipped file … … … …
    18. 18. Sample – Click Counting for Campaign Stopping 1. mapPartitions – mapping from strings to objects with a pre designed click key 2. reduceByKey – removing duplicate clicks (see next slide) 3. Map – switch keys to a campaign+day key 4. reduceByKey – aggregate the data by campaign+day
    19. 19. Campaign Stopping – Removing Dup Clicks • When more than 1 click found from the same user on the same item, leave only the oldest • Using accumulators to track duplicate numbers
    20. 20. Our Deployment • 7 nodes, each– – – – 24 cores 256G Ram 6 1TB SSD Disks – JBOD configuration 10G Ethernet • Total Cluster Power – – – 1760GB Ram 168 CPUs 42 TB storage – (effective space is less, Cassandra Keyspaces defined with replication factor 3) • Symmetric Deployment – – Mesos + Spark Cassandra • More – – – Rabbit MQ on 3 nodes ZooKeeper on 3 nodes MySQL cluster outside this cluster • Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes
    21. 21. Things that work well with Spark (from our experience) • Very easy to code complex jobs – Harder than SQL, but better than other Map Reduce options – Simple concepts, “small” API • Easy to Unit Test – Runs in local mode, so ideal for micro E2E tests – Each mapper/reducer can be unit tested without Spark – if you do not use anonymous classes • Very resilient • Can read/write to/from any data source, including RDBMS, Cassandra, HDFS, local files, etc. • Great monitoring • Easy to deploy & upgrade • Blazing fast
    22. 22. Things that do not work that well (from our experience) • Long (endless) running tasks require some workarounds – Temp files - Spark creates a lot of files in spark.local.dir, requires periodic cleanup – Use spark.cleaner.ttl for long running tasks – Periodic Driver restarts (spark 0.8.1) • Spark Streaming – not fully mature – Some end cases can cause loss of data – Sliding window / batch model does not fit our needs • We always load some history to deal with late arriving data • State management left to the user and not trivial – BUT – we were able to easily implement a bullet proof home grown, near real time, streaming solution with minimal amount of code
    23. 23. General / Optimization Tips • Use Spark Accumulators to collect and report operational data • 10G Ethernet • Multiple SSD disks per node, JBOD configuration • A lot of memory for the cluster • Use Leader Election for Driver H/A – In Spark 0.9 may not be needed with the new option to run the driver inside the cluster
    24. 24. Technologies Taboola Uses for Spark • Spark – computing cluster • Mesos – cluster resource manager – Better resource allocation (coarse grained) for Spark • ZooKeeper – distributed coordination – Enables multi master for mesos & spark • Curator – Leader Election – Taboola’s Spark Driver • Cassandra – Distributed Data Store • Monitoring –
    25. 25. Attributions Many of the general Spark slides were taken from the DataBricks Spark Summit 2013 slides. There are great materials at:
    26. 26. Thank You!