Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Coordinating the Many
Tools of Big Data
Big Data Spain 2012
http://www.bigdataspain.org/

Alan F. Gates
@alanfgates

Page 1

Big Data = Terabytes, Petabytes, …

Image Credit: Gizmodo
© Hortonworks 2012
Page 2

But It Is Also Complex Algorithms
• An example from a talk byJimmy Lin at Hadoop Summit 2012 on calculations
Twitter is doing via UDFs (user defined functions) in Pig. This equation uses
stochastic gradient descent to do machine learning across with their data:

w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)

© Hortonworks 2012
Page 3

Pre-Cloud: One Tool per Machine
• Databases presented SQL or SQL-like paradigms for operating on data
• Other tools came in separate packages (e.g. R) or on separate platforms
(SAS).

Data
Mart
Statistical
Analysis
Data
Warehouse

Cube/M OLTP
OLAP

© Hortonworks 2012
Page 4

Cloud: Many Tools One Platform
• Users no longer want to be concerned with what platform their data is in – just
apply the tool to it
• SQL no longer the only or primary data access tool

Statistical
Data Analysis
Mart
Data
Warehouse

Cube/M OLT
OLAP P

© Hortonworks 2012
Page 5

Upside - Pick the Right Tool for the Job

© Hortonworks 2012
Page 6

Downside – Tools Don’t Play Well Together
• Hard for users to share data between tools
– Different storage formats
– Different data models
– Different user defined function interfaces

© Hortonworks 2012
Page 7

Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the redundant functionality

Hive

Pig Parser

Parser Metadata

Optimizer Optimizer
Physical Physical
Planner Planner

Executor Executor

© Hortonworks 2012
Page 8

Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the redundant functionality

Hive

Pig Parser

Parser Metadata

Optimizer Optimizer
Physical Physical
Overlap
Planner Planner

Executor Executor

© Hortonworks 2012
Page 9

Conclusion: We Need Services
• We need to find a way to share services where we can.
• Gives users the same experience across tools
• Allows developers to share effort when it makes sense

© Hortonworks 2012
Page 10

Hadoop = Distributed Data Operating
System
Service Hadoop Component Single Node Analogue

Table Management HCatalog RDBMS

User access control Hadoop /etc/passwd, file system
permissions, etc.
Resource management YARN Process management

Notification HCatalog Signals, semaphores,
mutexes
REST/Connectors HCatalog, Hive, HBase, Network layer
Oozie
Batch data processing Data Virtual Machine JVM

Exists Pieces exist in this component To be built

© Hortonworks 2012
Page 11

Hadoop = Distributed Data Operating
System
Service Hadoop Component Single Node Analogue

Table Management HCatalog RDBMS

User access control Hadoop /etc/passwd, file system
permissions, etc.
Resource management YARN Process management

Notification HCatalog Signals, semaphores,
mutexes
REST/Connectors HCatalog, Hive, HBase, Network layer
Oozie
Batch data processing Data Virtual Machine JVM

Exists Pieces exist in this component To be built

© Hortonworks 2012
Page 12

HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside Hadoop
• Presents tools with a table paradigm that abstracts away storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

© Hortonworks 2012
Page 13

Data Access Without HCatalog

MapReduce Hive Pig

SerDe
InputFormat/ InputFormat/ Load/
Metastore Client
OuputFormat OuputFormat Store

HDFS
Metastore

© Hortonworks 2012
Page 14

Data & Metadata Access With HCatalog

MapReduce Hive Pig

HCatInputFormat/ HCatLoader/
HCatOuputFormat HCatStorer

SerDe
InputFormat/
REST Metastore Client
OuputFormat

External
System HDFS
Metastore

© Hortonworks 2012
Page 15

Without HCatalog
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string, int, float, string,
bytes, maps, maps, structs, lists
tuples, bags
Schema Encoded in app Declared in script Read from
or read by loader metadata
Data location Encoded in app Declared in script Read from
metadata
Data format Encoded in app Declared in script Read from
metadata

© Hortonworks 2012
Page 16

With HCatalog
Feature MapReduce + Pig + HCatalog Hive
HCatalog
Record format Record Tuple Record
Data model int, float, string, int, float, string, int, float, string,
maps, structs, lists bytes, maps, maps, structs, lists
tuples, bags
Schema Read from Read from Read from
metadata metadata metadata
Data location Read from Read from Read from
Data format Read from Read from Read from

© Hortonworks 2012
Page 17

YARN – Resource Manager
• Hadoop 1.0: HDFS plus MapReduce
• Hadoop 2.0: HDFS plus YARN Resource Manager, an interface for
developers to write parallel applications on top of the Hadoop cluster
• The Resource Manager provides:
– applications a way to request resources in the cluster
– allocation and scheduling of machine resource to the applications
• MapReduce is now an application provided inside YARN
• Other systems have been ported to YARN such as Spark (cluster computing system
that focuses on in memory operations) and Storm (streaming computations)

© Hortonworks 2012
Page 18

Architectural Comparison
Hadoop 1.0 Hadoop 2.0

© Hortonworks 2012
Page 19

Data Virtual Machine – Shared Batch
Processing
• Recall our previous diagram of Pig and Hive

Hive

Pig Parser

Parser Metadata

Optimizer Optimizer
Physical Physical
Overlap
Planner Planner

Executor Executor

© Hortonworks 2012
Page 20

A VM That Provides
• Standard operators (equivalent of Java byte codes):
– Project
– Select
– Join
– Aggregate
– Sort
–…
• An optimizer that could
– Choose appropriate implementation of an operator based on physical data
characteristics
– Dynamically re-optimize the plan based on information gathered executing the plan
• Shared execution layer
– Can provide its own YARN application master and improve on MapReduce
paradigm for batch processing
• Shared User Defined Function (UDF) framework
– user code works across systems

© Hortonworks 2012
Page 21


Map Map Map Map

Reduce Reduce Reduce Reduce

Reduce Reduce
HDFS

• Removed an entire write/read cycle of HDFS
Map Map • Still want to checkpoint sometimes

Reduce Reduce
© Hortonworks 2012
Page 24

Taking Advantage of YARN – In Memory
Data Transfer

Map Map
These are
writes to
disk

Reduce Reduce

Switching shuffle to in memory instead of on disk
• Better performance
• Data must also be spilled to disk for retry-ability and to handle memory overflow
• Will benefit from stronger guarantees of simultaneous execution
© Hortonworks 2012
Page 26

On the Fly Optimization
• Traditionally databases do all optimization up front based on statistics
– But often there are not statistics in Hadoop
– Languages like Pig Latin allow very long series of operations that make up front
estimates unrealistic
• Observation: as the system operates on the data it can gather basic statistics
and change the subsequent operators based on this information

MR MR
Job Job

Hash
Join

© Hortonworks 2012
Page 27

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Similar to Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012 (20)

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Editor's Notes