Designing Data Pipelines Using Hadoop

7,674 views
7,169 views

Published on

This presentation will cover the design principles and techniques used to build data pipelines taking into consideration the following aspects: architecture evolution, capacity, data quality, performance, flexibility and alignment with business objectives. The discussions will be based on the context of managing a pipeline with multi-petabyte data sets; a code-base composed of Java map/reduce jobs with HBase integration; Hive scripts and Kafka/Storm inputs. We?ll talk about how to make sure that data pipelines have the following features: 1) Assurance that the input data is ready at each step. 2) Workflows are easy to maintain. 3) Data quality and validation comes included in the architecture. Part of presentation will be dedicated to show how to organize the warehouse using layers of data sets. A suggested starting point for these layers are: 1) Raw Input (Logs, Messages, etc.), 2) Logical Input (Scrubbed data), 3) Foundational Warehouse Data (Most relevant joins), 4) Departmental/Project Data Sets and 5) Report Data Sets. (Used by Traditional Report engines) The final part will discuss the design of a rule-based system to perform validation and trending reporting.

Published in: Technology, Business
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,674
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
146
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Designing Data Pipelines Using Hadoop

  1. 1. Rocket Fuel Big Data and Artificial Intelligence for Digital Advertising Abhijit Pol Marilson Campos Designing Data Pipelines July, 2013
  2. 2. What We Do? Data Partners* Optimize Bid Request Rocket Fuel Winning Ad Ad Request Ad Served to User Page Request Bid & Ad Web Browser Rocket Fuel Platform Real-time Bidder Automated Decisions Response Prediction Model Publishers User Engagement Recorded User Engages with Ad Refresh learning Campaign & User Data Warehouse Qualify Audience Some Exchange Partners Ad Excha nge Ads & Budget
  3. 3. How Big Is This Problem Each Day? Trades on NASDAQ Facebook Page Views Searches on Google Bid Requests Considered by Rocket Fuel
  4. 4. How Big Is This Problem Each Day? Trades on NASDAQ Facebook Page Views Searches on Google Bid Requests Considered by Rocket Fuel ~5 billion 10 million 30 billion ~20 billion
  5. 5. BIG DATA + AI
  6. 6. Advertising That Learns
  7. 7. Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices
  8. 8. Architecture for Growth •20 GB/month to 2 PB/month in 3 years •New and complex requirements •More consumers •Rapid growth
  9. 9. How We Started?
  10. 10. Architecture 2.0
  11. 11. Current Architecture
  12. 12. Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices
  13. 13. Hurdles and Challenges Faced •Exponential data growth and user queries •Network issues •Bots •Bad user queries
  14. 14. Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices
  15. 15. Data Pipeline Design Best Practices Job Design Consistency Job Features Avoid Re-work Golden Input Shadow Cluster Data Collection Dashboard
  16. 16. Job Design / Consistency •Idempotent •Execution by different users •Account for Execution Time
  17. 17. Job Execution Timeline
  18. 18. Job Features / Re-Work •Smaller Jobs •Record completion of steps
  19. 19. Recording completion times Start Is mark already there? Step of workflow, job or script Yes No Execute work for the step. Create the mark End Collect other data (Optional)
  20. 20. Golden Input / Shadow Cluster •Integration tests on realistic data sets. •Safe environment to innovate.
  21. 21. Data Collection - Delivery time view J Data product Workflow Workflow Job Job Job Job Job Job Job Job JobJob Job Hive/Pig SSH Script J J… J J Hive J J J Pig …
  22. 22. Data collection : Data profiles view Data product Data set Data set = Data Set = Transformation Record Size & Type Job Counts Join success ratios Data Set Consistency
  23. 23. Data Collection Hierarchy wk_external_events wk_build_profile user_profile extract_fields consolidate_metrics load_into_data_centers extract_features compact_user_profile Workflow/Job/Script StepData Product
  24. 24. Golden Input / Shadow Cluster •Integration tests on realistic data sets. •Safe environment to innovate.
  25. 25. Dashboard • Delivery Time • Data Profile Ratios • Counters • Alarms
  26. 26. Thank you www.rocketfuel.com

×