• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Designing Data Pipelines Using Hadoop
 

Designing Data Pipelines Using Hadoop

on

  • 2,978 views

This presentation will cover the design principles and techniques used to build data pipelines taking into consideration the following aspects: architecture evolution, capacity, data quality, ...

This presentation will cover the design principles and techniques used to build data pipelines taking into consideration the following aspects: architecture evolution, capacity, data quality, performance, flexibility and alignment with business objectives. The discussions will be based on the context of managing a pipeline with multi-petabyte data sets; a code-base composed of Java map/reduce jobs with HBase integration; Hive scripts and Kafka/Storm inputs. We?ll talk about how to make sure that data pipelines have the following features: 1) Assurance that the input data is ready at each step. 2) Workflows are easy to maintain. 3) Data quality and validation comes included in the architecture. Part of presentation will be dedicated to show how to organize the warehouse using layers of data sets. A suggested starting point for these layers are: 1) Raw Input (Logs, Messages, etc.), 2) Logical Input (Scrubbed data), 3) Foundational Warehouse Data (Most relevant joins), 4) Departmental/Project Data Sets and 5) Report Data Sets. (Used by Traditional Report engines) The final part will discuss the design of a rule-based system to perform validation and trending reporting.

Statistics

Views

Total Views
2,978
Views on SlideShare
2,976
Embed Views
2

Actions

Likes
2
Downloads
66
Comments
0

2 Embeds 2

http://www.linkedin.com 1
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Designing Data Pipelines Using Hadoop Designing Data Pipelines Using Hadoop Presentation Transcript

    • Rocket Fuel Big Data and Artificial Intelligence for Digital Advertising Abhijit Pol Marilson Campos Designing Data Pipelines July, 2013
    • What We Do? Data Partners* Optimize Bid Request Rocket Fuel Winning Ad Ad Request Ad Served to User Page Request Bid & Ad Web Browser Rocket Fuel Platform Real-time Bidder Automated Decisions Response Prediction Model Publishers User Engagement Recorded User Engages with Ad Refresh learning Campaign & User Data Warehouse Qualify Audience Some Exchange Partners Ad Excha nge Ads & Budget
    • How Big Is This Problem Each Day? Trades on NASDAQ Facebook Page Views Searches on Google Bid Requests Considered by Rocket Fuel
    • How Big Is This Problem Each Day? Trades on NASDAQ Facebook Page Views Searches on Google Bid Requests Considered by Rocket Fuel ~5 billion 10 million 30 billion ~20 billion
    • BIG DATA + AI
    • Advertising That Learns
    • Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices
    • Architecture for Growth •20 GB/month to 2 PB/month in 3 years •New and complex requirements •More consumers •Rapid growth
    • How We Started?
    • Architecture 2.0
    • Current Architecture
    • Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices
    • Hurdles and Challenges Faced •Exponential data growth and user queries •Network issues •Bots •Bad user queries
    • Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices
    • Data Pipeline Design Best Practices Job Design Consistency Job Features Avoid Re-work Golden Input Shadow Cluster Data Collection Dashboard
    • Job Design / Consistency •Idempotent •Execution by different users •Account for Execution Time
    • Job Execution Timeline
    • Job Features / Re-Work •Smaller Jobs •Record completion of steps
    • Recording completion times Start Is mark already there? Step of workflow, job or script Yes No Execute work for the step. Create the mark End Collect other data (Optional)
    • Golden Input / Shadow Cluster •Integration tests on realistic data sets. •Safe environment to innovate.
    • Data Collection - Delivery time view J Data product Workflow Workflow Job Job Job Job Job Job Job Job JobJob Job Hive/Pig SSH Script J J… J J Hive J J J Pig …
    • Data collection : Data profiles view Data product Data set Data set = Data Set = Transformation Record Size & Type Job Counts Join success ratios Data Set Consistency
    • Data Collection Hierarchy wk_external_events wk_build_profile user_profile extract_fields consolidate_metrics load_into_data_centers extract_features compact_user_profile Workflow/Job/Script StepData Product
    • Golden Input / Shadow Cluster •Integration tests on realistic data sets. •Safe environment to innovate.
    • Dashboard • Delivery Time • Data Profile Ratios • Counters • Alarms
    • Thank you www.rocketfuel.com