Rocket Fuel
Big Data and Artificial Intelligence for Digital Advertising
Abhijit Pol
Marilson Campos
Designing Data Pipeli...
What We Do?
Data
Partners*
Optimize
Bid
Request
Rocket Fuel
Winning Ad
Ad Request
Ad Served to
User
Page
Request
Bid & Ad
...
How Big Is This Problem Each Day?
Trades on NASDAQ
Facebook Page Views
Searches on Google
Bid Requests Considered by Rocke...
How Big Is This Problem Each Day?
Trades on NASDAQ
Facebook Page Views
Searches on Google
Bid Requests Considered by Rocke...
BIG DATA + AI
Advertising That Learns
Outline
•Architecture Evolution
•Hurdles and Challenges Faced
•Data Pipelines Best Practices
Architecture for Growth
•20 GB/month to 2 PB/month in 3 years
•New and complex requirements
•More consumers
•Rapid growth
How We Started?
Architecture 2.0
Current Architecture
Outline
•Architecture Evolution
•Hurdles and Challenges Faced
•Data Pipelines Best Practices
Hurdles and Challenges Faced
•Exponential data growth and user
queries
•Network issues
•Bots
•Bad user queries
Outline
•Architecture Evolution
•Hurdles and Challenges Faced
•Data Pipelines Best Practices
Data Pipeline Design Best Practices
Job Design
Consistency
Job Features
Avoid Re-work Golden Input
Shadow Cluster
Data Col...
Job Design / Consistency
•Idempotent
•Execution by different users
•Account for Execution Time
Job Execution Timeline
Job Features / Re-Work
•Smaller Jobs
•Record completion of steps
Recording completion times
Start
Is mark
already
there?
Step of workflow, job or script
Yes
No
Execute work
for the step.
...
Golden Input / Shadow Cluster
•Integration tests on realistic data sets.
•Safe environment to innovate.
Data Collection - Delivery time view
J
Data product
Workflow Workflow
Job
Job
Job Job
Job Job
Job
Job
JobJob
Job
Hive/Pig ...
Data collection : Data profiles view
Data product
Data set
Data set
= Data Set
= Transformation
Record Size & Type
Job
Cou...
Data Collection Hierarchy
wk_external_events
wk_build_profile
user_profile
extract_fields
consolidate_metrics
load_into_da...
Golden Input / Shadow Cluster
•Integration tests on realistic data sets.
•Safe environment to innovate.
Dashboard
• Delivery Time
• Data Profile Ratios
• Counters
• Alarms
Thank you
www.rocketfuel.com
Upcoming SlideShare
Loading in...5
×

Designing Data Pipelines Using Hadoop

5,923

Published on

This presentation will cover the design principles and techniques used to build data pipelines taking into consideration the following aspects: architecture evolution, capacity, data quality, performance, flexibility and alignment with business objectives. The discussions will be based on the context of managing a pipeline with multi-petabyte data sets; a code-base composed of Java map/reduce jobs with HBase integration; Hive scripts and Kafka/Storm inputs. We?ll talk about how to make sure that data pipelines have the following features: 1) Assurance that the input data is ready at each step. 2) Workflows are easy to maintain. 3) Data quality and validation comes included in the architecture. Part of presentation will be dedicated to show how to organize the warehouse using layers of data sets. A suggested starting point for these layers are: 1) Raw Input (Logs, Messages, etc.), 2) Logical Input (Scrubbed data), 3) Foundational Warehouse Data (Most relevant joins), 4) Departmental/Project Data Sets and 5) Report Data Sets. (Used by Traditional Report engines) The final part will discuss the design of a rule-based system to perform validation and trending reporting.

Published in: Technology, Business

Designing Data Pipelines Using Hadoop

  1. 1. Rocket Fuel Big Data and Artificial Intelligence for Digital Advertising Abhijit Pol Marilson Campos Designing Data Pipelines July, 2013
  2. 2. What We Do? Data Partners* Optimize Bid Request Rocket Fuel Winning Ad Ad Request Ad Served to User Page Request Bid & Ad Web Browser Rocket Fuel Platform Real-time Bidder Automated Decisions Response Prediction Model Publishers User Engagement Recorded User Engages with Ad Refresh learning Campaign & User Data Warehouse Qualify Audience Some Exchange Partners Ad Excha nge Ads & Budget
  3. 3. How Big Is This Problem Each Day? Trades on NASDAQ Facebook Page Views Searches on Google Bid Requests Considered by Rocket Fuel
  4. 4. How Big Is This Problem Each Day? Trades on NASDAQ Facebook Page Views Searches on Google Bid Requests Considered by Rocket Fuel ~5 billion 10 million 30 billion ~20 billion
  5. 5. BIG DATA + AI
  6. 6. Advertising That Learns
  7. 7. Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices
  8. 8. Architecture for Growth •20 GB/month to 2 PB/month in 3 years •New and complex requirements •More consumers •Rapid growth
  9. 9. How We Started?
  10. 10. Architecture 2.0
  11. 11. Current Architecture
  12. 12. Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices
  13. 13. Hurdles and Challenges Faced •Exponential data growth and user queries •Network issues •Bots •Bad user queries
  14. 14. Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices
  15. 15. Data Pipeline Design Best Practices Job Design Consistency Job Features Avoid Re-work Golden Input Shadow Cluster Data Collection Dashboard
  16. 16. Job Design / Consistency •Idempotent •Execution by different users •Account for Execution Time
  17. 17. Job Execution Timeline
  18. 18. Job Features / Re-Work •Smaller Jobs •Record completion of steps
  19. 19. Recording completion times Start Is mark already there? Step of workflow, job or script Yes No Execute work for the step. Create the mark End Collect other data (Optional)
  20. 20. Golden Input / Shadow Cluster •Integration tests on realistic data sets. •Safe environment to innovate.
  21. 21. Data Collection - Delivery time view J Data product Workflow Workflow Job Job Job Job Job Job Job Job JobJob Job Hive/Pig SSH Script J J… J J Hive J J J Pig …
  22. 22. Data collection : Data profiles view Data product Data set Data set = Data Set = Transformation Record Size & Type Job Counts Join success ratios Data Set Consistency
  23. 23. Data Collection Hierarchy wk_external_events wk_build_profile user_profile extract_fields consolidate_metrics load_into_data_centers extract_features compact_user_profile Workflow/Job/Script StepData Product
  24. 24. Golden Input / Shadow Cluster •Integration tests on realistic data sets. •Safe environment to innovate.
  25. 25. Dashboard • Delivery Time • Data Profile Ratios • Counters • Alarms
  26. 26. Thank you www.rocketfuel.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×