Cloudera - Amr Awadallah - Hadoop World 2010

2,711 views

Published on

Business Analyst Tools & Applications for Hadoop

Amr Awadallah
Cloudera

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,711
On SlideShare
0
From Embeds
0
Number of Embeds
395
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Cloudera - Amr Awadallah - Hadoop World 2010

  1. 1. Business Analyst Tools for Hadoop Amr Awadallah CTO, Cloudera, Inc. Hadoop World October 12th, 2010 Copyright 2010 Couldera Inc. All Rights Reserved. 1
  2. 2. The Spectrum of Hadoop Users Copyright 2010 Cloudera Inc. All rights reserved 2 Logs Files Web Data Enterprise Data Warehouse Web Application Enterprise Reporting BI, Analytics Analysts Business Users Customers IDEs Engineers Relational Databases Low-Latency Serving Systems Cloudera Enterprise Operators
  3. 3. Evolution of Hadoop Query/Programming Languages 1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop). 2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility. 3. Cascading: Cascading is a thin Java library that sits on top of MapReduce, it lets developers assemble complex processes. 4. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDe. 6. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above. 3Copyright 2010 Couldera Inc. All Rights Reserved.
  4. 4. Hive vs Pig Example (count distinct values > 0) • Hive syntax: SELECT COUNT(DISTINCT col1) FROM mytable WHERE col1 > 0; • Pig syntax: mytable = LOAD ‘myfile’ AS (col1, col2, col3); mytable = FOREACH mytable GENERATE col1; mytable = FILTER mytable BY col1 > 0; mytable = DISTINCT col1; mytable = GROUP mytable BY col1; mytable = FOREACH mytable GENERATE COUNT(mytable); DUMP mytable; 4Copyright 2010 Couldera Inc. All Rights Reserved.
  5. 5. Hive Features • A subset of SQL covering the most common statements • Agile data types: Array, Map, Struct, and JSON objects • User Defined Functions and Aggregates • Regular Expression support • MapReduce support • JDBC/ODBC support • Partitions and Buckets (for performance optimization) • In The Works: Indices, Columnar Storage, Views, Microstrategy compatibility, Explode/Collect • More details: http://wiki.apache.org/hadoop/Hive 5Copyright 2010 Couldera Inc. All Rights Reserved.
  6. 6. The Hadoop Query Tool Ecosystem 6Copyright 2010 Couldera Inc. All Rights Reserved. Cloudera Enterprise Cloudera’s Distribution for Hadoop In Memory PowerPivot QlikTech EdgeSpring Tableau ETL Informatica Pervasive IBM DataStage Microsoft SSIS Talend Kettle Query Authoring Karmasphere Quest (Toad) Spreadsheet IBM BigSheets Datameer BI/OLAP MicroStrategy IBM Cognos SAP BOBJ Microsoft SSRS Jaspersoft Pentaho Developer Karmasphere Eclipse Cascading Stats/Math SAS IBM SPSS Matlab R/RHIPE Mahoot Hama Reporting SAP Crystal Actuate/BIRT Hadoop is very flexible, use the right tool for the job at hand.
  7. 7. Toad for Cloud (for Query Authoring) 7Copyright 2010 Couldera Inc. All Rights Reserved. RDBMSHadoop Learn more at: http://www.ToadForCloud.com
  8. 8. Karmasphere (for Developers and Analysts) 8Copyright 2010 Couldera Inc. All Rights Reserved.
  9. 9. Tableau (for Advanced Visualization) 9Copyright 2010 Couldera Inc. All Rights Reserved.
  10. 10. Datameer (for Analysts, Spreadsheet UI) 10Copyright 2010 Couldera Inc. All Rights Reserved.
  11. 11. MicroStrategy (for interactive Dashboards) 11Copyright 2010 Couldera Inc. All Rights Reserved.
  12. 12. Talend (for Extract-Tranform-Load, aka ETL) 12Copyright 2010 Couldera Inc. All Rights Reserved.
  13. 13. General Advice for Choosing the Right Tool. • First and foremost, what problem are you trying to solve? And what is your skill set? Use the tool that gets you there fastest. • What is the learning curve involved with this new tool? • Does the tool interoperate with other systems? • Is the tool leveraging the investment in Pig/Hive? • Does the tool lock you in to a proprietary file format? • Is the tool certified for Cloudera’s Distribution of Hadoop? 13Copyright 2010 Couldera Inc. All Rights Reserved.
  14. 14. Appendix Copyright 2010 Couldera Inc. All Rights Reserved. 14
  15. 15. Hive Agile Data Types • STRUCTS: • SELECT mytable.mycolumn.myfield FROM … • MAPS (Hashes): • SELECT mytable.mycolumn[mykey] FROM … • ARRAYS: • SELECT mytable.mycolumn[5] FROM … • JSON: • SELECT get_json_object(mycolumn, objpath 15Copyright 2010 Couldera Inc. All Rights Reserved.

×