Your SlideShare is downloading. ×
0
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop Frameworks Panel__HadoopSummit2010

3,732

Published on

Hadoop Summit 2010 - Developers Track …

Hadoop Summit 2010 - Developers Track
Hadoop Frameworks Panel: Pig, Hive, Cascading, Cloudera Desktop, LinkedIn Voldemort, Twitter ElephantBird
Moderator: Sanjay Radia, Yahoo!

Published in: Technology
2 Comments
13 Likes
Statistics
Notes
  • pig, hive, cascading
    elephant bird(twitter), voldemort(linkedln), hue(cloudera)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Corresponding video:
    http://developer.yahoo.net/blogs/theater/archives/2010/07/hadoop_frameworks_panel.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
3,732
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
2
Likes
13
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is the agenda slide. There is only one of these in the deck.
  • This is the agenda slide. There is only one of these in the deck.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • Transcript

    • 1. Hadoop Frameworks and Tools Panel <ul><li>Moderator: Sanjay Radia </li></ul>Yahoo!
    • 2. Hadoop Frameworks and Tools Panel <ul><li>Pig - Alan Gates, Yahoo! </li></ul><ul><li>Hive - Ashish Thusoo, Facebook </li></ul><ul><li>Cascading – Chris K Wensel, Concurrent, Inc </li></ul><ul><li>Elephant Bird –  Kevin Weil, Twitter </li></ul><ul><li>Voldemort – Jay Kreps, LInkedIn </li></ul><ul><li>Hue (Desktop), Philip Zeyliger, Cloudera </li></ul>
    • 3. Format <ul><li>Each framework/tool: </li></ul><ul><ul><li>The problem space and target audience </li></ul></ul><ul><ul><li>Plans for the future enhancements </li></ul></ul><ul><li>Questions and discussion </li></ul>
    • 4. Pig <ul><li>Alan F. Gates </li></ul>Yahoo! [email_address] [email_address] [email_address]
    • 5. <ul><li>Data pipelines </li></ul><ul><ul><li>Live in the “data factory” where data is cleansed and transformed </li></ul></ul><ul><ul><li>Run at regular time intervals </li></ul></ul><ul><li>Research </li></ul><ul><ul><li>data is often short lived </li></ul></ul><ul><ul><li>data often semi-structured or unstructured </li></ul></ul><ul><ul><li>rapid prototyping </li></ul></ul>Target Use Cases
    • 6. <ul><li>Turing complete </li></ul><ul><ul><li>Add loops, branches, functions, modules </li></ul></ul><ul><ul><li>Will enable more complex pipelines, iterative computations, and cleaner programming </li></ul></ul><ul><li>Workflow integration </li></ul><ul><ul><li>What interfaces does Pig need to provide to enable better workflow integration? </li></ul></ul>Future Enhancements
    • 7. What’s Missing: Table Management Service <ul><li>What we have now </li></ul><ul><li>Hive has its own data catalog </li></ul><ul><li>Pig, Map Reduce can </li></ul><ul><ul><li>Use a InputFormat or loader that knows the schema (e.g. ElephantBird) </li></ul></ul><ul><ul><li>Describe the schema in code A = load ‘foo’ as (x:int, y:float) </li></ul></ul><ul><ul><li>Still have to know where to read and write files themselves </li></ul></ul><ul><li>Must write Loader, and SerDe to read new file type in Pig, and Hive </li></ul><ul><li>Workflow systems must poll HDFS to see when data is available </li></ul>
    • 8. What We Want <ul><li>Given an InputFormat and OutputFormat only need to write one piece of code to read/write data for all tools </li></ul><ul><li>Schema shared across tools </li></ul><ul><li>Disk location and storage format abstracted by service </li></ul><ul><li>Workflow notified of data availability by service </li></ul>table mgmt service Pig Hive Map Reduce Streaming RCFile Sequence File Text File
    • 9. Hive <ul><li>Ashish Thusoo </li></ul>Facebook
    • 10. <ul><li>A system for managing, querying and analyzing structured data stored in Hadoop </li></ul><ul><ul><li>Stores metadata in an RDBMS </li></ul></ul><ul><ul><li>Stores data in HDFS </li></ul></ul><ul><ul><li>Uses Map/Reduce for Computation </li></ul></ul>Hive – Brief Introduction
    • 11. <ul><li>Easy to Use </li></ul><ul><ul><li>Familiar Data Organization (Tables, Columns and Partitions) </li></ul></ul><ul><ul><li>SQL like language for querying this data </li></ul></ul><ul><li>Easy to Extend </li></ul><ul><ul><li>Interfaces to add User Defined Functions </li></ul></ul><ul><ul><li>Language constructs to embed user programs in the data flow </li></ul></ul><ul><li>Flexible </li></ul><ul><ul><li>Different storage formats </li></ul></ul><ul><ul><li>Support for user defined types </li></ul></ul>Hive – Core Principals
    • 12. <ul><li>Transparent Optimizations </li></ul><ul><ul><li>Optimizations for data skews </li></ul></ul><ul><ul><li>Different types of join and group by optimizations </li></ul></ul><ul><li>Interoperable </li></ul><ul><ul><li>JDBC and ODBC drivers </li></ul></ul><ul><ul><li>Thrift interfaces </li></ul></ul>Hive – Core Principals
    • 13. <ul><li>Where we are? </li></ul><ul><ul><li>Diverse user community </li></ul></ul><ul><li>Where we want to be? </li></ul><ul><ul><li>Diverse developer community </li></ul></ul>Hive – Major Future Goals
    • 14. <ul><li>JDBC / MicroStrategy compatbility </li></ul><ul><li>Integration with PIG </li></ul><ul><li>Predicate Pushdown to HBase </li></ul><ul><li>Cost-based Optimizer </li></ul><ul><li>Quotas </li></ul><ul><li>Security/ACLs </li></ul><ul><li>Indexing </li></ul><ul><li>SQL compliance </li></ul><ul><li>Unstructured Data </li></ul>Hive – Things to work on <ul><li>Statistics </li></ul><ul><li>Archival (HAR) </li></ul><ul><li>HBase Integration </li></ul><ul><li>Improvements to Test Frameworks </li></ul><ul><li>Storage Handlers </li></ul>
    • 15. Cascading <ul><li>Chris K Wensel </li></ul>Concurrent, Inc. http://cascading.org/ [email_address] @cwensel
    • 16. <ul><li>An alternative API to MapReduce for assembling complex data processing applications </li></ul><ul><li>Provides implementations of all common MR patterns </li></ul><ul><li>Fail fast query planner and topological job scheduler </li></ul><ul><li>Integration is first class </li></ul><ul><li>Works with structured and unstructured data </li></ul>Cascading
    • 17. Cascading
    • 18. <ul><li>Not a syntax (like PigLatin or SQL) </li></ul><ul><li>Allows developers to build tools on top </li></ul><ul><ul><li>Cascalog – interactive query language (Backtype) </li></ul></ul><ul><ul><li>Bixo – scalable web-mining toolkit (Bixo Labs) </li></ul></ul><ul><ul><li>Cascading.JRuby – JRuby based DSL (Etsy) </li></ul></ul><ul><ul><li>More >> http://www.cascading.org/modules.html </li></ul></ul><ul><li>Is complimentary to alternative tools </li></ul><ul><ul><li>Riffle annotations to bridge the gap with Mahout </li></ul></ul>Cascading
    • 19. <ul><li>Implement the simplest thing possible </li></ul><ul><li>Focus on the problem, not the system </li></ul><ul><li>ETL, processing, analytics become logical </li></ul>Cascading
    • 20. <ul><li>Log processing </li></ul><ul><ul><li>Amazon CloudFront log analyzer </li></ul></ul><ul><li>Machine Learning </li></ul><ul><ul><li>Predicting flight delays (FlightCaster) </li></ul></ul><ul><li>Behavioral ad-targeting </li></ul><ul><ul><li>RazorFish (see case study on Amazon site) </li></ul></ul><ul><li>Integration </li></ul><ul><ul><li>With HBase (StumbleUpon), Hypertable (ZVents), & AsterData (ShareThis) </li></ul></ul><ul><li>Social Media Analytics </li></ul><ul><ul><li>BackType (see Cascalog) </li></ul></ul>Cascading – Common Uses
    • 21. <ul><li>Ships with Karmasphere Studio </li></ul><ul><li>Runs great on Appistry CloudIQ Storage </li></ul><ul><li>Tested and optimized for Amazon Elastic MapReduce </li></ul>Cascading - Compatibility
    • 22. Elephant Bird <ul><li>Kevin Weil @kevinweil </li></ul>Twitter
    • 23. <ul><li>A framework for working with structured data within the Hadoop ecosystem </li></ul>Elephant Bird
    • 24. <ul><li>A framework for working with structured data within the Hadoop ecosystem </li></ul><ul><ul><li>Protocol Buffers </li></ul></ul><ul><ul><li>Thrift </li></ul></ul><ul><ul><li>JSON </li></ul></ul><ul><ul><li>W3C Logs </li></ul></ul>Elephant Bird
    • 25. <ul><li>A framework for working with structured data within the Hadoop ecosystem </li></ul><ul><ul><li>InputFormats </li></ul></ul><ul><ul><li>OutputFormats </li></ul></ul><ul><ul><li>Hadoop Writables </li></ul></ul><ul><ul><li>Pig LoadFuncs </li></ul></ul><ul><ul><li>Pig StoreFuncs </li></ul></ul><ul><ul><li>Hbase LoadFuncs </li></ul></ul>Elephant Bird
    • 26. <ul><li>A framework for working with structured data within the Hadoop ecosystem… plus: </li></ul><ul><ul><li>LZO Compression </li></ul></ul><ul><ul><li>Code Generation </li></ul></ul><ul><ul><li>Hadoop Counter Utilities </li></ul></ul><ul><ul><li>Misc Pig UDFs </li></ul></ul>Elephant Bird
    • 27. <ul><li>You should only need to specify the data schema </li></ul>Why?
    • 28. <ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema </li></ul>Why?
    • 29. <ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema </li></ul><ul><li>Everything else can be codegen’d. </li></ul>Why?
    • 30. <ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema </li></ul><ul><li>Everything else can be codegen’d. </li></ul><ul><li>Less Code. Efficient Storage. Focus on the Data. </li></ul>Why?
    • 31. <ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema </li></ul><ul><li>Everything else can be codegen’d. </li></ul><ul><li>Less Code. Efficient Storage. Focus on the Data. </li></ul><ul><li>Underlies 20,000 Hadoop jobs at Twitter every day. </li></ul>Why?
    • 32. <ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema </li></ul><ul><li>Everything else can be codegen’d. </li></ul><ul><li>Less Code. Efficient Storage. Focus on the Data. </li></ul><ul><li>Underlies 20,000 Hadoop jobs at Twitter every day. </li></ul><ul><li>http://github.com/kevinweil/elephant-bird : contributors welcome! </li></ul>Why?
    • 33. Project Voldemort <ul><li>Jay Kreps </li></ul>LinkedIn
    • 34. <ul><li>Key-value storage </li></ul><ul><li>No single point of failure </li></ul><ul><li>Focused on “live serving” not offline analysis </li></ul><ul><li>Excellent support for online/offline data cycle </li></ul><ul><li>Used for many parts of linkedin.com </li></ul>Project Voldemort
    • 35. Online/Offline architecture
    • 36. Why online/offline split?
    • 37. Project Voldemort: Hadoop integration <ul><li>Three key-metrics to balance </li></ul><ul><ul><li>Build time </li></ul></ul><ul><ul><li>Load time </li></ul></ul><ul><ul><li>Live request performance </li></ul></ul><ul><li>Meet lots of other needs: </li></ul><ul><ul><li>Atomic swap of data sets & rollback </li></ul></ul><ul><ul><li>Failover, checksums </li></ul></ul>
    • 38. Hue (formerly Cloudera Desktop) <ul><li>Philip Zeyliger </li></ul>[email_address] @philz42
    • 39.  
    • 40. 2
    • 41. 3 What’s Hue? <ul><li>a unified web-based UI for interacting with Hadoop </li></ul><ul><li>includes applications for looking at running jobs , launching jobs , browsing the file system, interacting with Hive </li></ul><ul><li>is an environment for building additional applications near the existing ones </li></ul>
    • 42. 3 Why Hue SDK? <ul><li>Re-use components for talking to Hadoop </li></ul><ul><li>Re-use patterns for developing apps that talk to Hadoop </li></ul><ul><li>Centralize Hadoop usage through one interface </li></ul>
    • 43. <ul><li>Open Source </li></ul><ul><li>Apache 2.0 licensed </li></ul><ul><li>http://github.com/cloudera/hue </li></ul>Oh, by the way
    • 44. Questions for the panel <ul><li>What is missing in the overall space? </li></ul><ul><li>Questions from the audience </li></ul>

    ×