More Related Content Similar to Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012 (20) More from Big Data Spain (20) Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 20121. Coordinating the Many
Tools of Big Data
Big Data Spain 2012
http://www.bigdataspain.org/
Alan F. Gates
@alanfgates
Page 1
2. Big Data = Terabytes, Petabytes, …
Image Credit: Gizmodo
© Hortonworks 2012
Page 2
3. But It Is Also Complex Algorithms
• An example from a talk byJimmy Lin at Hadoop Summit 2012 on calculations
Twitter is doing via UDFs (user defined functions) in Pig. This equation uses
stochastic gradient descent to do machine learning across with their data:
w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)
© Hortonworks 2012
Page 3
4. Pre-Cloud: One Tool per Machine
• Databases presented SQL or SQL-like paradigms for operating on data
• Other tools came in separate packages (e.g. R) or on separate platforms
(SAS).
Data
Mart
Statistical
Analysis
Data
Warehouse
Cube/M OLTP
OLAP
© Hortonworks 2012
Page 4
5. Cloud: Many Tools One Platform
• Users no longer want to be concerned with what platform their data is in – just
apply the tool to it
• SQL no longer the only or primary data access tool
Statistical
Data Analysis
Mart
Data
Warehouse
Cube/M OLT
OLAP P
© Hortonworks 2012
Page 5
6. Upside - Pick the Right Tool for the Job
© Hortonworks 2012
Page 6
7. Downside – Tools Don’t Play Well Together
• Hard for users to share data between tools
– Different storage formats
– Different data models
– Different user defined function interfaces
© Hortonworks 2012
Page 7
8. Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the redundant functionality
Hive
Pig Parser
Parser Metadata
Optimizer Optimizer
Physical Physical
Planner Planner
Executor Executor
© Hortonworks 2012
Page 8
9. Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the redundant functionality
Hive
Pig Parser
Parser Metadata
Optimizer Optimizer
Physical Physical
Overlap
Planner Planner
Executor Executor
© Hortonworks 2012
Page 9
10. Conclusion: We Need Services
• We need to find a way to share services where we can.
• Gives users the same experience across tools
• Allows developers to share effort when it makes sense
© Hortonworks 2012
Page 10
11. Hadoop = Distributed Data Operating
System
Service Hadoop Component Single Node Analogue
Table Management HCatalog RDBMS
User access control Hadoop /etc/passwd, file system
permissions, etc.
Resource management YARN Process management
Notification HCatalog Signals, semaphores,
mutexes
REST/Connectors HCatalog, Hive, HBase, Network layer
Oozie
Batch data processing Data Virtual Machine JVM
Exists Pieces exist in this component To be built
© Hortonworks 2012
Page 11
12. Hadoop = Distributed Data Operating
System
Service Hadoop Component Single Node Analogue
Table Management HCatalog RDBMS
User access control Hadoop /etc/passwd, file system
permissions, etc.
Resource management YARN Process management
Notification HCatalog Signals, semaphores,
mutexes
REST/Connectors HCatalog, Hive, HBase, Network layer
Oozie
Batch data processing Data Virtual Machine JVM
Exists Pieces exist in this component To be built
© Hortonworks 2012
Page 12
13. HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside Hadoop
• Presents tools with a table paradigm that abstracts away storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access
© Hortonworks 2012
Page 13
14. Data Access Without HCatalog
MapReduce Hive Pig
SerDe
InputFormat/ InputFormat/ Load/
Metastore Client
OuputFormat OuputFormat Store
HDFS
Metastore
© Hortonworks 2012
Page 14
15. Data & Metadata Access With HCatalog
MapReduce Hive Pig
HCatInputFormat/ HCatLoader/
HCatOuputFormat HCatStorer
SerDe
InputFormat/
REST Metastore Client
OuputFormat
External
System HDFS
Metastore
© Hortonworks 2012
Page 15
16. Without HCatalog
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string, int, float, string,
bytes, maps, maps, structs, lists
tuples, bags
Schema Encoded in app Declared in script Read from
or read by loader metadata
Data location Encoded in app Declared in script Read from
metadata
Data format Encoded in app Declared in script Read from
metadata
© Hortonworks 2012
Page 16
17. With HCatalog
Feature MapReduce + Pig + HCatalog Hive
HCatalog
Record format Record Tuple Record
Data model int, float, string, int, float, string, int, float, string,
maps, structs, lists bytes, maps, maps, structs, lists
tuples, bags
Schema Read from Read from Read from
metadata metadata metadata
Data location Read from Read from Read from
metadata metadata metadata
Data format Read from Read from Read from
metadata metadata metadata
© Hortonworks 2012
Page 17
18. YARN – Resource Manager
• Hadoop 1.0: HDFS plus MapReduce
• Hadoop 2.0: HDFS plus YARN Resource Manager, an interface for
developers to write parallel applications on top of the Hadoop cluster
• The Resource Manager provides:
– applications a way to request resources in the cluster
– allocation and scheduling of machine resource to the applications
• MapReduce is now an application provided inside YARN
• Other systems have been ported to YARN such as Spark (cluster computing system
that focuses on in memory operations) and Storm (streaming computations)
© Hortonworks 2012
Page 18
20. Data Virtual Machine – Shared Batch
Processing
• Recall our previous diagram of Pig and Hive
Hive
Pig Parser
Parser Metadata
Optimizer Optimizer
Physical Physical
Overlap
Planner Planner
Executor Executor
© Hortonworks 2012
Page 20
21. A VM That Provides
• Standard operators (equivalent of Java byte codes):
– Project
– Select
– Join
– Aggregate
– Sort
–…
• An optimizer that could
– Choose appropriate implementation of an operator based on physical data
characteristics
– Dynamically re-optimize the plan based on information gathered executing the plan
• Shared execution layer
– Can provide its own YARN application master and improve on MapReduce
paradigm for batch processing
• Shared User Defined Function (UDF) framework
– user code works across systems
© Hortonworks 2012
Page 21
22. Taking Advantage of YARN – MR*
Map Map
Reduce Reduce
HDFS
Map Map
Reduce Reduce
© Hortonworks 2012
Page 22
23. Taking Advantage of YARN – MR*
Map Map
Reduce Reduce
HDFS
Why do I
need
these
Map Map maps?
Reduce Reduce
© Hortonworks 2012
Page 23
24. Taking Advantage of YARN – MR*
Map Map Map Map
Reduce Reduce Reduce Reduce
Reduce Reduce
HDFS
• Removed an entire write/read cycle of HDFS
Map Map • Still want to checkpoint sometimes
Reduce Reduce
© Hortonworks 2012
Page 24
25. Taking Advantage of YARN – In Memory
Data Transfer
Map Map
Reduce Reduce
© Hortonworks 2012
Page 25
26. Taking Advantage of YARN – In Memory
Data Transfer
Map Map
These are
writes to
disk
Reduce Reduce
Switching shuffle to in memory instead of on disk
• Better performance
• Data must also be spilled to disk for retry-ability and to handle memory overflow
• Will benefit from stronger guarantees of simultaneous execution
© Hortonworks 2012
Page 26
27. On the Fly Optimization
• Traditionally databases do all optimization up front based on statistics
– But often there are not statistics in Hadoop
– Languages like Pig Latin allow very long series of operations that make up front
estimates unrealistic
• Observation: as the system operates on the data it can gather basic statistics
and change the subsequent operators based on this information
MR MR
Job Job
Hash
Join
© Hortonworks 2012
Page 27
28. On the Fly Optimization
• Traditionally databases do all optimization up front based on statistics
– But often there are not statistics in Hadoop
– Languages like Pig Latin allow very long series of operations that make up front
estimates unrealistic
• Observation: as the system operates on the data it can gather basic statistics
and change the subsequent operators based on this information
MR MR
Job Job
Output fits Hash
in memory Join
© Hortonworks 2012
Page 28
29. On the Fly Optimization
• Traditionally databases do all optimization up front based on statistics
– But often there are not statistics in Hadoop
– Languages like Pig Latin allow very long series of operations that make up front
estimates unrealistic
• Observation: as the system operates on the data it can gather basic statistics
and change the subsequent operators based on this information
MR MR MR MR
Job Job Job Job
Load into Map-
Hash
distributed side
Join
cache Join
© Hortonworks 2012
Page 29
Editor's Notes This is how we tend to think of Big data Limited in a couple of ways:Scalability limited by being on one machine or a small cluster that counts on all participants being upHard to apply different types of processing without moving data around Hive is the only SQL based app in this pileOther apps still in the picture, it’s not like Hadoop is displacing everything