Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Download to read offline

Real-Time Forecasting at Scale using Delta Lake and Delta Caching

Download to read offline

GumGum receives around 30 billion programmatic inventory impressions amounting to 25 TB of data each day. Inventory impression is the real estate to show potential ads on a publisher page. By generating near-real-time inventory forecast based on campaign-specific targeting rules, GumGum enables the account managers to set up successful future campaigns.

  • Be the first to like this

Real-Time Forecasting at Scale using Delta Lake and Delta Caching

  1. 1. REAL-TIME FORECASTING AT SCALE USING DELTA LAKE AND DELTA CACHING Jatinder Assi, Data Engineering Manager Rashmina Menon, Senior Data Engineer
  2. 2. Introducing GumGum AI Company
  3. 3. Advertising We leverage our computer vision and NLP technology to detect unsafe text and imagery, allowing us to deliver ads in brand safe, contextually relevant environments. Sports Valuation We help marketers and rights holders capture the full media value of sponsorships across broadcast TV, streaming and social media. Our Divisions
  4. 4. Programmatic Inventory Intro Architecture Data Sampling Search and Forecast Application Data Caching Forecast Accuracy Q & A 01 02 03 04 05 06 07 Agenda
  5. 5. Programmatic Inventory Intro
  6. 6. Advertising Inventory Real estate to show potential ads in a publisher page
  7. 7. Programmatic Advertising Ecosystem Technology ecosystem to automatically buy and sell targeted online advertising in real-time Looking for cost- effective way to place media buys Bids on the inventory impression Makes inventory available Where the auction happens Display Ads AD EXCHANGESDSPsADVERTISERS PUBLISHERS
  8. 8. Why Forecast Inventory? ● Our sellers are trying to setup an ad campaign with certain targeting rules and would like to know GumGum publisher network has enough inventory to fulfill it. ● Faster forecasting response time allows our sellers to iterate, propose and sell ad campaigns faster. Here are some scenarios: Forecast the inventory available in US in cities Los Angeles and San Diego from premium websites for the next 30 days for bidder XYZ and ad product 1 Forecast the inventory available in US and Canada for pages related to Sports and Entertainment targeting males of age 25 to 40 for the next 30 days for bidder ABC and ad product 2
  9. 9. 30B+ Inventory Forecast At Scale Setup future campaigns for success by generating real-time inventory forecast subject to campaign-specific restrictions Programmatic Inventory per day Compressed size per day Avg. Forecast response time 25 TB+ < 30 secs
  10. 10. Architecture
  11. 11. Architecture 25+ TB/day 1.5 GB/day in delta format Daily - Data Pipeline Real-time forecast (< 30 secs avg. response time) Search Samples for past 365 days Ready to Forecast
  12. 12. Data Sampling
  13. 13. Why Sampling? ● Waste of compute resources to process all of inventory data for past 365 days (~9 PB) ● Hard to attain <30 secs of forecasting response time even with most optimized forecasting model Instead, we can pre-process the impressions per day using distributed sampling algorithm to capture most relevant subset of inventory population
  14. 14. Sampling Approach Query Base Data Sample Data Exact Result Approx ResultEstimator *Estimator (Scale-up factor) lets us relate results about the sample to the original dataset
  15. 15. Types Of Sampling There is an equal probability of selecting any particular item Augmented min-hash distinct item sampling with <= M distinct items Biased toward commonly-occurring items (not great for frequent cap). Problem Uniform Sampling AMIND(M) Distinct Item Sampling Sample uniformly from the distinct items to support frequent cap per user Frequency cap: One of the key factors in how GumGum serves ad where we don’t show ads to the same user frequently (once an hour or day)
  16. 16. AMIND Samping Daily Job h00 h01 h23 h00* h01* h23* pre-daily Daily sample AMIND(M) is parallelized per hour and separate hourly AMIND(M) samples are combined Daily multiplier ● Create samples for all hours in parallel with M distinct hash values of user ip per hour ● Combines samples, group by hash values of user ip ● Sort and take items with M smallest hash values to generate daily sample and multiplier (scale up factor)
  17. 17. Search And Forecast Application
  18. 18. Search And Forecast SEARCH APPLICATION FORECASTING APPLICATION Real-time forecast (<30 secs avg. response time) Invoke job Get results API
  19. 19. Search Application Reads past 365 days of data based on user filters and builds time series SAMPLE DATA (365 DAYS) AND MULTIPLIER DATA User inputs # impressions = # sampled impressions * multiplier day 800000 600000 400000 200000 0 Impressions per day
  20. 20. Data Caching
  21. 21. To Cache Or Not To Cache? DELTA LAKE WITH DELTA CACHE NO CACHING SPARK IN MEMORY CACHE SPARK DISK CACHE Read directly from S3 StorageLevel .MEMORY_ONLY StorageLevel .DISK_ONLY Minimal Cost Slow, cannot guarantee on SLA Fast Expensive Requires daily refresh Not very efficient Utilize memory for compute Requires daily refresh
  22. 22. Delta Lake Open source storage layer that brings ACID transactions to Apache Spark Streaming Batch Ingestion Table (Bronze) Refined Table (Silver) Feature/Agg Data Store (Gold) Analytics & Machine Learning Your existing data lake
  23. 23. Delta Lake (Contd.) dataframe .write .format(“parquet”) .save(“/data”) dataframe .write .format(“delta”) .save(“/data”) Instead of parquet.. use delta..
  24. 24. Delta Cache Accelerated data reads by creating copies of remote files in nodes’ local storage using a fast intermediate format
  25. 25. Create Delta Table Cache Delta Table Refresh cache associated with the Table
  26. 26. Optimize the data layout or run compaction on Delta lake Delete compacted files (default retention = 7 days) Collect statistics about the table ZORDER to collocate same information in same set of files
  27. 27. To Cache Or Not To Cache? Utilize memory for compute Least expensive Warm up queries take longer time NO CACHING SPARK IN MEMORY CACHE SPARK DISK CACHE DELTA LAKE WITH DELTA CACHE Read directly from S3 StorageLevel .MEMORY_ONLY StorageLevel .DISK_ONLY Minimal Cost Slow, cannot guarantee on SLA of 30 secs Fast Expensive Requires daily refresh Not very performing Utilize memory for compute Requires daily refresh
  28. 28. Caching on Cluster + Delta Cache Format Storage Application Search Application (Cont.)
  29. 29. Search Application (Cont.) Reads past 365 days of data based on user filters and builds time series AGGREGATE If frequency cap is present, cap the number of impressions per user at frequency cap Aggregate data by day to build time series for impressions Multiply with the multiplier to get the projected impressions READ Read data from Delta Cache based on partition keys matching user inputs Partition Keys: ● Date ● Product ● Bidder TRANSFORM Filter the data to further filter the remaining fields based on user inputs Example: ● Country ● BrowserType ● 3rd party segments ● Page categories
  30. 30. Forecasting Application Forecasts the time series for the next n days based on the time series trends in the past 365 days
  31. 31. Forecasting Application (Cont.) Time series forecasting in R Popular time series forecasting model (non- seasonal) used to describe the autocorrelations in the data AUTO ARIMA in R - ● Finds the best ARIMA model for a given time series ● Hence the model generated for time series X will be different from the model generated for time series Y Problem ARIMA (AutoRegressive Integrated Moving Average) How do we generate different ARIMA models for different user inputs which result in different time series? Solution
  32. 32. Forecasting Application (Cont.) Forecast = General trend (Auto ARIMA) + Weekly Trend (sinusoids) + Quarterly Trend (sinusoids) Average forecast execution time < 2 seconds, runs on driver
  33. 33. FORECAST RESULT actual forecast
  34. 34. Measure Forecasting Accuracy Compute mean absolute percentage error for predefined forecasting requests
  35. 35. Q&A
  36. 36. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

GumGum receives around 30 billion programmatic inventory impressions amounting to 25 TB of data each day. Inventory impression is the real estate to show potential ads on a publisher page. By generating near-real-time inventory forecast based on campaign-specific targeting rules, GumGum enables the account managers to set up successful future campaigns.

Views

Total views

400

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

21

Shares

0

Comments

0

Likes

0

×