Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hadoop Frameworks and Tools Panel <ul><li>Moderator: Sanjay Radia </li></ul>Yahoo!
Hadoop Frameworks and Tools Panel <ul><li>Pig - Alan Gates, Yahoo! </li></ul><ul><li>Hive - Ashish Thusoo, Facebook </li><...
Format <ul><li>Each framework/tool: </li></ul><ul><ul><li>The problem space and target audience </li></ul></ul><ul><ul><li...
Pig <ul><li>Alan F. Gates </li></ul>Yahoo! [email_address] [email_address] [email_address]
<ul><li>Data pipelines </li></ul><ul><ul><li>Live in the “data factory” where data is cleansed and transformed </li></ul><...
<ul><li>Turing complete </li></ul><ul><ul><li>Add loops, branches, functions, modules </li></ul></ul><ul><ul><li>Will enab...
What’s Missing:  Table Management Service <ul><li>What we have now </li></ul><ul><li>Hive has its own data catalog </li></...
What We Want <ul><li>Given an InputFormat and OutputFormat only need to write one piece of code to read/write data for all...
Hive <ul><li>Ashish Thusoo </li></ul>Facebook
<ul><li>A system for managing, querying and analyzing structured data stored in Hadoop </li></ul><ul><ul><li>Stores metada...
<ul><li>Easy to Use </li></ul><ul><ul><li>Familiar Data Organization (Tables, Columns and Partitions) </li></ul></ul><ul><...
<ul><li>Transparent Optimizations </li></ul><ul><ul><li>Optimizations for data skews </li></ul></ul><ul><ul><li>Different ...
<ul><li>Where we are? </li></ul><ul><ul><li>Diverse user community </li></ul></ul><ul><li>Where we want to be? </li></ul><...
<ul><li>JDBC / MicroStrategy compatbility </li></ul><ul><li>Integration with PIG </li></ul><ul><li>Predicate Pushdown to H...
Cascading <ul><li>Chris K Wensel </li></ul>Concurrent, Inc. http://cascading.org/ [email_address] @cwensel
<ul><li>An alternative API to MapReduce for assembling complex data processing applications </li></ul><ul><li>Provides imp...
Cascading
<ul><li>Not a syntax (like PigLatin or SQL) </li></ul><ul><li>Allows developers to build tools on top </li></ul><ul><ul><l...
<ul><li>Implement the simplest thing possible </li></ul><ul><li>Focus on the problem, not the system </li></ul><ul><li>ETL...
<ul><li>Log processing </li></ul><ul><ul><li>Amazon CloudFront log analyzer </li></ul></ul><ul><li>Machine Learning </li><...
<ul><li>Ships with Karmasphere Studio </li></ul><ul><li>Runs great on Appistry CloudIQ Storage </li></ul><ul><li>Tested an...
Elephant Bird <ul><li>Kevin Weil  @kevinweil </li></ul>Twitter
<ul><li>A framework for working with structured data within the Hadoop ecosystem </li></ul>Elephant Bird
<ul><li>A framework for working with  structured  data within the Hadoop ecosystem </li></ul><ul><ul><li>Protocol Buffers ...
<ul><li>A framework for working with structured data within the  Hadoop ecosystem </li></ul><ul><ul><li>InputFormats </li>...
<ul><li>A framework for working with structured data within the  Hadoop ecosystem… plus: </li></ul><ul><ul><li>LZO Compres...
<ul><li>You should only need to specify the data schema </li></ul>Why?
<ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting )   data schema </li...
<ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting )   data schema </li...
<ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting )   data schema </li...
<ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting )   data schema </li...
<ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting )   data schema </li...
Project Voldemort <ul><li>Jay Kreps </li></ul>LinkedIn
<ul><li>Key-value storage </li></ul><ul><li>No single point of failure </li></ul><ul><li>Focused on “live serving” not off...
Online/Offline architecture
Why online/offline split?
Project Voldemort: Hadoop integration <ul><li>Three key-metrics to balance </li></ul><ul><ul><li>Build time </li></ul></ul...
Hue  (formerly Cloudera Desktop) <ul><li>Philip Zeyliger </li></ul>[email_address] @philz42
 
2
3 What’s Hue? <ul><li>a unified web-based UI for interacting with Hadoop </li></ul><ul><li>includes applications for  look...
3 Why Hue SDK? <ul><li>Re-use components for talking to Hadoop </li></ul><ul><li>Re-use patterns for developing apps that ...
<ul><li>Open Source </li></ul><ul><li>Apache 2.0 licensed </li></ul><ul><li>http://github.com/cloudera/hue </li></ul>Oh, b...
Questions for the panel <ul><li>What is missing in the overall space? </li></ul><ul><li>Questions from the audience </li><...
Upcoming SlideShare
Loading in …5
×

Hadoop Frameworks Panel__HadoopSummit2010

4,186 views

Published on

Hadoop Summit 2010 - Developers Track
Hadoop Frameworks Panel: Pig, Hive, Cascading, Cloudera Desktop, LinkedIn Voldemort, Twitter ElephantBird
Moderator: Sanjay Radia, Yahoo!

Published in: Technology
  • pig, hive, cascading
    elephant bird(twitter), voldemort(linkedln), hue(cloudera)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Corresponding video:
    http://developer.yahoo.net/blogs/theater/archives/2010/07/hadoop_frameworks_panel.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Hadoop Frameworks Panel__HadoopSummit2010

  1. 1. Hadoop Frameworks and Tools Panel <ul><li>Moderator: Sanjay Radia </li></ul>Yahoo!
  2. 2. Hadoop Frameworks and Tools Panel <ul><li>Pig - Alan Gates, Yahoo! </li></ul><ul><li>Hive - Ashish Thusoo, Facebook </li></ul><ul><li>Cascading – Chris K Wensel, Concurrent, Inc </li></ul><ul><li>Elephant Bird –  Kevin Weil, Twitter </li></ul><ul><li>Voldemort – Jay Kreps, LInkedIn </li></ul><ul><li>Hue (Desktop), Philip Zeyliger, Cloudera </li></ul>
  3. 3. Format <ul><li>Each framework/tool: </li></ul><ul><ul><li>The problem space and target audience </li></ul></ul><ul><ul><li>Plans for the future enhancements </li></ul></ul><ul><li>Questions and discussion </li></ul>
  4. 4. Pig <ul><li>Alan F. Gates </li></ul>Yahoo! [email_address] [email_address] [email_address]
  5. 5. <ul><li>Data pipelines </li></ul><ul><ul><li>Live in the “data factory” where data is cleansed and transformed </li></ul></ul><ul><ul><li>Run at regular time intervals </li></ul></ul><ul><li>Research </li></ul><ul><ul><li>data is often short lived </li></ul></ul><ul><ul><li>data often semi-structured or unstructured </li></ul></ul><ul><ul><li>rapid prototyping </li></ul></ul>Target Use Cases
  6. 6. <ul><li>Turing complete </li></ul><ul><ul><li>Add loops, branches, functions, modules </li></ul></ul><ul><ul><li>Will enable more complex pipelines, iterative computations, and cleaner programming </li></ul></ul><ul><li>Workflow integration </li></ul><ul><ul><li>What interfaces does Pig need to provide to enable better workflow integration? </li></ul></ul>Future Enhancements
  7. 7. What’s Missing: Table Management Service <ul><li>What we have now </li></ul><ul><li>Hive has its own data catalog </li></ul><ul><li>Pig, Map Reduce can </li></ul><ul><ul><li>Use a InputFormat or loader that knows the schema (e.g. ElephantBird) </li></ul></ul><ul><ul><li>Describe the schema in code A = load ‘foo’ as (x:int, y:float) </li></ul></ul><ul><ul><li>Still have to know where to read and write files themselves </li></ul></ul><ul><li>Must write Loader, and SerDe to read new file type in Pig, and Hive </li></ul><ul><li>Workflow systems must poll HDFS to see when data is available </li></ul>
  8. 8. What We Want <ul><li>Given an InputFormat and OutputFormat only need to write one piece of code to read/write data for all tools </li></ul><ul><li>Schema shared across tools </li></ul><ul><li>Disk location and storage format abstracted by service </li></ul><ul><li>Workflow notified of data availability by service </li></ul>table mgmt service Pig Hive Map Reduce Streaming RCFile Sequence File Text File
  9. 9. Hive <ul><li>Ashish Thusoo </li></ul>Facebook
  10. 10. <ul><li>A system for managing, querying and analyzing structured data stored in Hadoop </li></ul><ul><ul><li>Stores metadata in an RDBMS </li></ul></ul><ul><ul><li>Stores data in HDFS </li></ul></ul><ul><ul><li>Uses Map/Reduce for Computation </li></ul></ul>Hive – Brief Introduction
  11. 11. <ul><li>Easy to Use </li></ul><ul><ul><li>Familiar Data Organization (Tables, Columns and Partitions) </li></ul></ul><ul><ul><li>SQL like language for querying this data </li></ul></ul><ul><li>Easy to Extend </li></ul><ul><ul><li>Interfaces to add User Defined Functions </li></ul></ul><ul><ul><li>Language constructs to embed user programs in the data flow </li></ul></ul><ul><li>Flexible </li></ul><ul><ul><li>Different storage formats </li></ul></ul><ul><ul><li>Support for user defined types </li></ul></ul>Hive – Core Principals
  12. 12. <ul><li>Transparent Optimizations </li></ul><ul><ul><li>Optimizations for data skews </li></ul></ul><ul><ul><li>Different types of join and group by optimizations </li></ul></ul><ul><li>Interoperable </li></ul><ul><ul><li>JDBC and ODBC drivers </li></ul></ul><ul><ul><li>Thrift interfaces </li></ul></ul>Hive – Core Principals
  13. 13. <ul><li>Where we are? </li></ul><ul><ul><li>Diverse user community </li></ul></ul><ul><li>Where we want to be? </li></ul><ul><ul><li>Diverse developer community </li></ul></ul>Hive – Major Future Goals
  14. 14. <ul><li>JDBC / MicroStrategy compatbility </li></ul><ul><li>Integration with PIG </li></ul><ul><li>Predicate Pushdown to HBase </li></ul><ul><li>Cost-based Optimizer </li></ul><ul><li>Quotas </li></ul><ul><li>Security/ACLs </li></ul><ul><li>Indexing </li></ul><ul><li>SQL compliance </li></ul><ul><li>Unstructured Data </li></ul>Hive – Things to work on <ul><li>Statistics </li></ul><ul><li>Archival (HAR) </li></ul><ul><li>HBase Integration </li></ul><ul><li>Improvements to Test Frameworks </li></ul><ul><li>Storage Handlers </li></ul>
  15. 15. Cascading <ul><li>Chris K Wensel </li></ul>Concurrent, Inc. http://cascading.org/ [email_address] @cwensel
  16. 16. <ul><li>An alternative API to MapReduce for assembling complex data processing applications </li></ul><ul><li>Provides implementations of all common MR patterns </li></ul><ul><li>Fail fast query planner and topological job scheduler </li></ul><ul><li>Integration is first class </li></ul><ul><li>Works with structured and unstructured data </li></ul>Cascading
  17. 17. Cascading
  18. 18. <ul><li>Not a syntax (like PigLatin or SQL) </li></ul><ul><li>Allows developers to build tools on top </li></ul><ul><ul><li>Cascalog – interactive query language (Backtype) </li></ul></ul><ul><ul><li>Bixo – scalable web-mining toolkit (Bixo Labs) </li></ul></ul><ul><ul><li>Cascading.JRuby – JRuby based DSL (Etsy) </li></ul></ul><ul><ul><li>More >> http://www.cascading.org/modules.html </li></ul></ul><ul><li>Is complimentary to alternative tools </li></ul><ul><ul><li>Riffle annotations to bridge the gap with Mahout </li></ul></ul>Cascading
  19. 19. <ul><li>Implement the simplest thing possible </li></ul><ul><li>Focus on the problem, not the system </li></ul><ul><li>ETL, processing, analytics become logical </li></ul>Cascading
  20. 20. <ul><li>Log processing </li></ul><ul><ul><li>Amazon CloudFront log analyzer </li></ul></ul><ul><li>Machine Learning </li></ul><ul><ul><li>Predicting flight delays (FlightCaster) </li></ul></ul><ul><li>Behavioral ad-targeting </li></ul><ul><ul><li>RazorFish (see case study on Amazon site) </li></ul></ul><ul><li>Integration </li></ul><ul><ul><li>With HBase (StumbleUpon), Hypertable (ZVents), & AsterData (ShareThis) </li></ul></ul><ul><li>Social Media Analytics </li></ul><ul><ul><li>BackType (see Cascalog) </li></ul></ul>Cascading – Common Uses
  21. 21. <ul><li>Ships with Karmasphere Studio </li></ul><ul><li>Runs great on Appistry CloudIQ Storage </li></ul><ul><li>Tested and optimized for Amazon Elastic MapReduce </li></ul>Cascading - Compatibility
  22. 22. Elephant Bird <ul><li>Kevin Weil @kevinweil </li></ul>Twitter
  23. 23. <ul><li>A framework for working with structured data within the Hadoop ecosystem </li></ul>Elephant Bird
  24. 24. <ul><li>A framework for working with structured data within the Hadoop ecosystem </li></ul><ul><ul><li>Protocol Buffers </li></ul></ul><ul><ul><li>Thrift </li></ul></ul><ul><ul><li>JSON </li></ul></ul><ul><ul><li>W3C Logs </li></ul></ul>Elephant Bird
  25. 25. <ul><li>A framework for working with structured data within the Hadoop ecosystem </li></ul><ul><ul><li>InputFormats </li></ul></ul><ul><ul><li>OutputFormats </li></ul></ul><ul><ul><li>Hadoop Writables </li></ul></ul><ul><ul><li>Pig LoadFuncs </li></ul></ul><ul><ul><li>Pig StoreFuncs </li></ul></ul><ul><ul><li>Hbase LoadFuncs </li></ul></ul>Elephant Bird
  26. 26. <ul><li>A framework for working with structured data within the Hadoop ecosystem… plus: </li></ul><ul><ul><li>LZO Compression </li></ul></ul><ul><ul><li>Code Generation </li></ul></ul><ul><ul><li>Hadoop Counter Utilities </li></ul></ul><ul><ul><li>Misc Pig UDFs </li></ul></ul>Elephant Bird
  27. 27. <ul><li>You should only need to specify the data schema </li></ul>Why?
  28. 28. <ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema </li></ul>Why?
  29. 29. <ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema </li></ul><ul><li>Everything else can be codegen’d. </li></ul>Why?
  30. 30. <ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema </li></ul><ul><li>Everything else can be codegen’d. </li></ul><ul><li>Less Code. Efficient Storage. Focus on the Data. </li></ul>Why?
  31. 31. <ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema </li></ul><ul><li>Everything else can be codegen’d. </li></ul><ul><li>Less Code. Efficient Storage. Focus on the Data. </li></ul><ul><li>Underlies 20,000 Hadoop jobs at Twitter every day. </li></ul>Why?
  32. 32. <ul><li>You should only need to specify the ( flexible, forward-backward compatible, self-documenting ) data schema </li></ul><ul><li>Everything else can be codegen’d. </li></ul><ul><li>Less Code. Efficient Storage. Focus on the Data. </li></ul><ul><li>Underlies 20,000 Hadoop jobs at Twitter every day. </li></ul><ul><li>http://github.com/kevinweil/elephant-bird : contributors welcome! </li></ul>Why?
  33. 33. Project Voldemort <ul><li>Jay Kreps </li></ul>LinkedIn
  34. 34. <ul><li>Key-value storage </li></ul><ul><li>No single point of failure </li></ul><ul><li>Focused on “live serving” not offline analysis </li></ul><ul><li>Excellent support for online/offline data cycle </li></ul><ul><li>Used for many parts of linkedin.com </li></ul>Project Voldemort
  35. 35. Online/Offline architecture
  36. 36. Why online/offline split?
  37. 37. Project Voldemort: Hadoop integration <ul><li>Three key-metrics to balance </li></ul><ul><ul><li>Build time </li></ul></ul><ul><ul><li>Load time </li></ul></ul><ul><ul><li>Live request performance </li></ul></ul><ul><li>Meet lots of other needs: </li></ul><ul><ul><li>Atomic swap of data sets & rollback </li></ul></ul><ul><ul><li>Failover, checksums </li></ul></ul>
  38. 38. Hue (formerly Cloudera Desktop) <ul><li>Philip Zeyliger </li></ul>[email_address] @philz42
  39. 40. 2
  40. 41. 3 What’s Hue? <ul><li>a unified web-based UI for interacting with Hadoop </li></ul><ul><li>includes applications for looking at running jobs , launching jobs , browsing the file system, interacting with Hive </li></ul><ul><li>is an environment for building additional applications near the existing ones </li></ul>
  41. 42. 3 Why Hue SDK? <ul><li>Re-use components for talking to Hadoop </li></ul><ul><li>Re-use patterns for developing apps that talk to Hadoop </li></ul><ul><li>Centralize Hadoop usage through one interface </li></ul>
  42. 43. <ul><li>Open Source </li></ul><ul><li>Apache 2.0 licensed </li></ul><ul><li>http://github.com/cloudera/hue </li></ul>Oh, by the way
  43. 44. Questions for the panel <ul><li>What is missing in the overall space? </li></ul><ul><li>Questions from the audience </li></ul>

×