• Save
Hadoop Frameworks Panel__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Hadoop Frameworks Panel__HadoopSummit2010

  • 4,569 views
Uploaded on

Hadoop Summit 2010 - Developers Track ...

Hadoop Summit 2010 - Developers Track
Hadoop Frameworks Panel: Pig, Hive, Cascading, Cloudera Desktop, LinkedIn Voldemort, Twitter ElephantBird
Moderator: Sanjay Radia, Yahoo!

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • pig, hive, cascading
    elephant bird(twitter), voldemort(linkedln), hue(cloudera)
    Are you sure you want to
    Your message goes here
  • Corresponding video:
    http://developer.yahoo.net/blogs/theater/archives/2010/07/hadoop_frameworks_panel.html
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
4,569
On Slideshare
4,569
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
2
Likes
13

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is the agenda slide. There is only one of these in the deck.
  • This is the agenda slide. There is only one of these in the deck.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.

Transcript

  • 1. Hadoop Frameworks and Tools Panel
    • Moderator: Sanjay Radia
    Yahoo!
  • 2. Hadoop Frameworks and Tools Panel
    • Pig - Alan Gates, Yahoo!
    • Hive - Ashish Thusoo, Facebook
    • Cascading – Chris K Wensel, Concurrent, Inc
    • Elephant Bird –  Kevin Weil, Twitter
    • Voldemort – Jay Kreps, LInkedIn
    • Hue (Desktop), Philip Zeyliger, Cloudera
  • 3. Format
    • Each framework/tool:
      • The problem space and target audience
      • Plans for the future enhancements
    • Questions and discussion
  • 4. Pig
    • Alan F. Gates
    Yahoo! [email_address] [email_address] [email_address]
  • 5.
    • Data pipelines
      • Live in the “data factory” where data is cleansed and transformed
      • Run at regular time intervals
    • Research
      • data is often short lived
      • data often semi-structured or unstructured
      • rapid prototyping
    Target Use Cases
  • 6.
    • Turing complete
      • Add loops, branches, functions, modules
      • Will enable more complex pipelines, iterative computations, and cleaner programming
    • Workflow integration
      • What interfaces does Pig need to provide to enable better workflow integration?
    Future Enhancements
  • 7. What’s Missing: Table Management Service
    • What we have now
    • Hive has its own data catalog
    • Pig, Map Reduce can
      • Use a InputFormat or loader that knows the schema (e.g. ElephantBird)
      • Describe the schema in code A = load ‘foo’ as (x:int, y:float)
      • Still have to know where to read and write files themselves
    • Must write Loader, and SerDe to read new file type in Pig, and Hive
    • Workflow systems must poll HDFS to see when data is available
  • 8. What We Want
    • Given an InputFormat and OutputFormat only need to write one piece of code to read/write data for all tools
    • Schema shared across tools
    • Disk location and storage format abstracted by service
    • Workflow notified of data availability by service
    table mgmt service Pig Hive Map Reduce Streaming RCFile Sequence File Text File
  • 9. Hive
    • Ashish Thusoo
    Facebook
  • 10.
    • A system for managing, querying and analyzing structured data stored in Hadoop
      • Stores metadata in an RDBMS
      • Stores data in HDFS
      • Uses Map/Reduce for Computation
    Hive – Brief Introduction
  • 11.
    • Easy to Use
      • Familiar Data Organization (Tables, Columns and Partitions)
      • SQL like language for querying this data
    • Easy to Extend
      • Interfaces to add User Defined Functions
      • Language constructs to embed user programs in the data flow
    • Flexible
      • Different storage formats
      • Support for user defined types
    Hive – Core Principals
  • 12.
    • Transparent Optimizations
      • Optimizations for data skews
      • Different types of join and group by optimizations
    • Interoperable
      • JDBC and ODBC drivers
      • Thrift interfaces
    Hive – Core Principals
  • 13.
    • Where we are?
      • Diverse user community
    • Where we want to be?
      • Diverse developer community
    Hive – Major Future Goals
  • 14.
    • JDBC / MicroStrategy compatbility
    • Integration with PIG
    • Predicate Pushdown to HBase
    • Cost-based Optimizer
    • Quotas
    • Security/ACLs
    • Indexing
    • SQL compliance
    • Unstructured Data
    Hive – Things to work on
    • Statistics
    • Archival (HAR)
    • HBase Integration
    • Improvements to Test Frameworks
    • Storage Handlers
  • 15. Cascading
    • Chris K Wensel
    Concurrent, Inc. http://cascading.org/ [email_address] @cwensel
  • 16.
    • An alternative API to MapReduce for assembling complex data processing applications
    • Provides implementations of all common MR patterns
    • Fail fast query planner and topological job scheduler
    • Integration is first class
    • Works with structured and unstructured data
    Cascading
  • 17. Cascading
  • 18.
    • Not a syntax (like PigLatin or SQL)
    • Allows developers to build tools on top
      • Cascalog – interactive query language (Backtype)
      • Bixo – scalable web-mining toolkit (Bixo Labs)
      • Cascading.JRuby – JRuby based DSL (Etsy)
      • More >> http://www.cascading.org/modules.html
    • Is complimentary to alternative tools
      • Riffle annotations to bridge the gap with Mahout
    Cascading
  • 19.
    • Implement the simplest thing possible
    • Focus on the problem, not the system
    • ETL, processing, analytics become logical
    Cascading
  • 20.
    • Log processing
      • Amazon CloudFront log analyzer
    • Machine Learning
      • Predicting flight delays (FlightCaster)
    • Behavioral ad-targeting
      • RazorFish (see case study on Amazon site)
    • Integration
      • With HBase (StumbleUpon), Hypertable (ZVents), & AsterData (ShareThis)
    • Social Media Analytics
      • BackType (see Cascalog)
    Cascading – Common Uses
  • 21.
    • Ships with Karmasphere Studio
    • Runs great on Appistry CloudIQ Storage
    • Tested and optimized for Amazon Elastic MapReduce
    Cascading - Compatibility
  • 22. Elephant Bird
    • Kevin Weil @kevinweil
    Twitter
  • 23.
    • A framework for working with structured data within the Hadoop ecosystem
    Elephant Bird
  • 24.
    • A framework for working with structured data within the Hadoop ecosystem
      • Protocol Buffers
      • Thrift
      • JSON
      • W3C Logs
    Elephant Bird
  • 25.
    • A framework for working with structured data within the Hadoop ecosystem
      • InputFormats
      • OutputFormats
      • Hadoop Writables
      • Pig LoadFuncs
      • Pig StoreFuncs
      • Hbase LoadFuncs
    Elephant Bird
  • 26.
    • A framework for working with structured data within the Hadoop ecosystem… plus:
      • LZO Compression
      • Code Generation
      • Hadoop Counter Utilities
      • Misc Pig UDFs
    Elephant Bird
  • 27.
    • You should only need to specify the data schema
    Why?
  • 28.
    • You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema
    Why?
  • 29.
    • You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema
    • Everything else can be codegen’d.
    Why?
  • 30.
    • You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema
    • Everything else can be codegen’d.
    • Less Code. Efficient Storage. Focus on the Data.
    Why?
  • 31.
    • You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema
    • Everything else can be codegen’d.
    • Less Code. Efficient Storage. Focus on the Data.
    • Underlies 20,000 Hadoop jobs at Twitter every day.
    Why?
  • 32.
    • You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema
    • Everything else can be codegen’d.
    • Less Code. Efficient Storage. Focus on the Data.
    • Underlies 20,000 Hadoop jobs at Twitter every day.
    • http://github.com/kevinweil/elephant-bird : contributors welcome!
    Why?
  • 33. Project Voldemort
    • Jay Kreps
    LinkedIn
  • 34.
    • Key-value storage
    • No single point of failure
    • Focused on “live serving” not offline analysis
    • Excellent support for online/offline data cycle
    • Used for many parts of linkedin.com
    Project Voldemort
  • 35. Online/Offline architecture
  • 36. Why online/offline split?
  • 37. Project Voldemort: Hadoop integration
    • Three key-metrics to balance
      • Build time
      • Load time
      • Live request performance
    • Meet lots of other needs:
      • Atomic swap of data sets & rollback
      • Failover, checksums
  • 38. Hue (formerly Cloudera Desktop)
    • Philip Zeyliger
    [email_address] @philz42
  • 39.  
  • 40. 2
  • 41. 3 What’s Hue?
    • a unified web-based UI for interacting with Hadoop
    • includes applications for looking at running jobs , launching jobs , browsing the file system, interacting with Hive
    • is an environment for building additional applications near the existing ones
  • 42. 3 Why Hue SDK?
    • Re-use components for talking to Hadoop
    • Re-use patterns for developing apps that talk to Hadoop
    • Centralize Hadoop usage through one interface
  • 43.
    • Open Source
    • Apache 2.0 licensed
    • http://github.com/cloudera/hue
    Oh, by the way
  • 44. Questions for the panel
    • What is missing in the overall space?
    • Questions from the audience