Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accelerating Data Science with Better Data Engineering on Databricks


Published on

Whether you’re processing IoT data from millions of sensors or building a recommendation engine to provide a more engaging customer experience, the ability to derive actionable insights from massive volumes of diverse data is critical to success. MediaMath, a leading adtech company, relies on Apache Spark to process billions of data points ranging from ads, user cookies, impressions, clicks, and more — translating to several terabytes of data per day. To support the needs of the data science teams, data engineering must build data pipelines for both ETL and feature engineering that are scalable, performant, and reliable.

Join this webinar to learn how MediaMath leverages Databricks to simplify mission-critical data engineering tasks that surface data directly to clients and drive actionable business outcomes. This webinar will cover:

- Transforming TBs of data with RDDs and PySpark responsibly
- Using the JDBC connector to write results to production databases seamlessly
- Comparisons with a similar approach using Hive

Published in: Software
  • Be the first to comment

Accelerating Data Science with Better Data Engineering on Databricks

  2. 2. WHAT IS MEDIAMATH? • MediaMath is a demand-side media buying platform. • We bid on ad opportunities from exchanges like google and facebook and serve our client’s ads in those spots. • We leverage any information we can use in order to increase performance of adds and target more efficiently.
  3. 3. WHAT IS ANALYTICS AT MEDIAMATH? • A bunch of wannabe Data Scientists turned Data Engineers • Have a ton of data and good ideas but were limited by computational capabilities • Learned as we went. Databricks accelerated this journey
  4. 4. WE NEED TO PROCESS TERABYTES OF DATA WRITTEN TO S3 EVERY DAY IN ORDER TO: • Build new features for models • Build new reporting for clients • Set up internal data pipelines
  5. 5. THE VISION Aggregate hundreds of TBs from S3 and write results to PostgreSQL to get things like…
  6. 6. THE AUDIENCE INDEX REPORT (AIR) Allows advertisers to gain insights into demographics of users visiting their sites
  8. 8. THE INDEX FOR SITE P • a measure of how many users of a certain type that we saw vs how many we expect to see • To find the index we need to compute the size of 4 groups of users
  9. 9. THE RAW DATA IN S3 • The data is provided by our partners and is made available for our team in S3. • It consists of at least one record for each user-segment combination per day. • MASSIVE redundancy (all we care about is membership as of a certain time) User ID (String) Segment ID (Integer) Unix Timestamp
(Int eger) A 1 1495129113 A 2 1495129245 A 2 1495129250 B 1 1495129245
  10. 10. This is really all we’re trying to do
  12. 12. WE STARTED BY TRYING HIVE. WE STRUGGLED. • There is a lot of data: Segments table has about 3 trillion rows. Pixel table has 16 billion. • Naively joining and aggregating with hive is the worst way to do it • Data can be transformed into a manageable format, but one that is awkward to express with SQL
  13. 13. LIFE BEFORE DATABRICKS (HIVE) • One row per user/segment joined to one row per user/pixel on userID • Ran once a week and took days to complete on a cluster of 65 M4.2xl nodes. • Had to take care - one M/R job would write > 1TB to HDFS • The join and then the shuffle after the join was killing us
  14. 14. KEY:VALUE FORMAT (UDB) • One record per user - key • Value is a python dictionary of segment information (max/min timestamps) • Much wider than original but far fewer records (on the order of 200x less) • Users expire out of UDB after not being loaded to a segment for 30 days • Persist to s3 as sequence files bucketed by key
  15. 15. WHY NOT USE DATAFRAMES? • The nesting of the records makes it awkward to deal with using SQL • flatMap() and combineByKey() are the main drivers of increased performance
  16. 16. THE MAIN DRAWBACK: MAINTAINING UDB • Add new users • Expire old users • Update existing users • Hard, but worth it if we run this multiple times • All this logic is conveniently expressed with python
  17. 17. THE JOIN • Super easy as the data is already in pair RDDs. • Since the records are wide this step is not as painful as it used to be • Data must be shuffled (this sucks). It’s in S3 already clustered by user (the join key) but spark doesn’t know and shuffles anyway
  19. 19. COUNTING RESULTS IS NOT EASY • Each record represents a unique user. Just need to count up how many pixel/segment pairs I see across all records. • Initially tried exploding on pixel/segment and converting to a dataframe • Records are so heavily nested that fully exploding causes spill to disk. We were filling up 100GB EBS volumes attached to the nodes. Our production cluster uses 65 nodes. • Skew was not an issue in this case
  20. 20. EXPLODE, BUT CAREFULLY • flatMap - create one row per pixel for each user (pixelA, {seg1, segA, …}), #from user 1 (pixelB, {seg1, segB …}), #from user 1 (pixelA, {seg1, seg3, …}), # from user 2 • combineByKey - keep a running tally of how many segments are seen by each pixel. Since data is first combined by pixel on the nodes, the shuffle stage has far less data to deal with (pixelA,{seg1:2, seg2:1, seg3:1, …}), (pixelB,{seg1:1, seg3:1, …}) • flatMap again - make the final dataset with one row per pixel/segment combination pixelA, seg1,2 pixelA,seg2,1 • Much better. Now nothing spills to disk, takes about half the time.
  22. 22. • Convert aggregated RDD to dataframe. Filter out garbage records and use the jdbc connector • df.write.jdbc(jdbcURL, MyPostgresTable, mode=‘overwrite’) • Fair performance (1.6 hours for about 41MM rows) - not affected by presence of index. Write takes just as long on unindexed table • Write to a staging table, then swap the staging and prod tables so the view that the app uses points to the refreshed data • Careful indexing of tables makes selects fast!
  23. 23. PUTTING IT ALL TOGETHER • Wrap logic, execution and monitoring into objects/functions in a notebook • Run notebook as a job • Schedule that job right from the Databricks UI. • Reporting, monitoring and some retry logic comes for free!
  24. 24. CONCLUSION: HIVE VS SPARK • Cannot compare directly because implementation is different • Spark implementation performs FAR better: run time approximately 11 hours vs two days on similar hardware. 1/4 time and price! • Development time is the real win. It’s far faster and easier to develop new pipelines
  25. 25. OUR CURRENT WORKFLOW Now we have a very versatile tool that allows us to monetize the data in S3
  26. 26. LIFE WITH DATABRICKS • If you are familiar with python then Databricks and PySpark unlocks a huge range of capabilities. • We are much more productive. • Our jobs are fun again! PySpark RDD APIs make it easy to work with big data in an accessible way.