Business Analyst Tools for Hadoop
CTO, Cloudera, Inc.
October 12th, 2010
Copyright 2010 Couldera Inc. All Rights Reserved. 1
The Spectrum of Hadoop Users
Copyright 2010 Cloudera Inc. All rights reserved 2
Logs Files Web Data
Analysts Business Users
Evolution of Hadoop Query/Programming Languages
1. Java MapReduce: Gives the most flexibility and performance,
but potentially long development cycle (the “assembly
language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility.
3. Cascading: Cascading is a thin Java library that sits on top of
MapReduce, it lets developers assemble complex processes.
4. Pig: A high-level language out of Yahoo, suitable for batch data
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDe.
6. Oozie: A PDL XML workflow server engine that enables creating
a workflow of jobs composed of any of the above.
3Copyright 2010 Couldera Inc. All Rights Reserved.
Hive vs Pig Example (count distinct values > 0)
• Hive syntax:
SELECT COUNT(DISTINCT col1)
WHERE col1 > 0;
• Pig syntax:
mytable = LOAD ‘myfile’ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
4Copyright 2010 Couldera Inc. All Rights Reserved.
• A subset of SQL covering the most common statements
• Agile data types: Array, Map, Struct, and JSON objects
• User Defined Functions and Aggregates
• Regular Expression support
• MapReduce support
• JDBC/ODBC support
• Partitions and Buckets (for performance optimization)
• In The Works: Indices, Columnar Storage, Views, Microstrategy
• More details: http://wiki.apache.org/hadoop/Hive
5Copyright 2010 Couldera Inc. All Rights Reserved.
The Hadoop Query Tool Ecosystem
6Copyright 2010 Couldera Inc. All Rights Reserved.
Cloudera’s Distribution for Hadoop
Hadoop is very flexible, use the right tool for the job at hand.
Toad for Cloud (for Query Authoring)
7Copyright 2010 Couldera Inc. All Rights Reserved.
Learn more at: http://www.ToadForCloud.com
Karmasphere (for Developers and Analysts)
8Copyright 2010 Couldera Inc. All Rights Reserved.
Tableau (for Advanced Visualization)
9Copyright 2010 Couldera Inc. All Rights Reserved.
Datameer (for Analysts, Spreadsheet UI)
10Copyright 2010 Couldera Inc. All Rights Reserved.
MicroStrategy (for interactive Dashboards)
11Copyright 2010 Couldera Inc. All Rights Reserved.
Talend (for Extract-Tranform-Load, aka ETL)
12Copyright 2010 Couldera Inc. All Rights Reserved.
General Advice for Choosing the Right Tool.
• First and foremost, what problem are you trying to solve? And
what is your skill set? Use the tool that gets you there fastest.
• What is the learning curve involved with this new tool?
• Does the tool interoperate with other systems?
• Is the tool leveraging the investment in Pig/Hive?
• Does the tool lock you in to a proprietary file format?
• Is the tool certified for Cloudera’s Distribution of Hadoop?
13Copyright 2010 Couldera Inc. All Rights Reserved.
Copyright 2010 Couldera Inc. All Rights Reserved. 14
Hive Agile Data Types
• SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
• SELECT mytable.mycolumn[mykey] FROM …
• SELECT mytable.mycolumn FROM …
• SELECT get_json_object(mycolumn, objpath
15Copyright 2010 Couldera Inc. All Rights Reserved.