Strata feb2013

Coordinating the Many
Tools of Big Data
Strata 2013

Alan F. Gates
@alanfgates

Page 1

Big Data = Terabytes, Petabytes, …

Image Credit: Gizmodo
© Hortonworks 2013
Page 2

But It Is Also Complex Algorithms
• An example from a talk by Jimmy Lin at Hadoop Summit
2012 on calculations Twitter is doing via UDFs in Pig.
This equation uses stochastic gradient descent to do
machine learning with their data:

w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)

© Hortonworks 2013
Page 3

And New Tools
• Apache Hadoop brings with it a large selection of tools
and paradigms
– Apache HBase, Apache Cassandra – Distributed, high volume
reads and rights of individual data records
– Apache Hive - SQL
– Apache Pig, Cascading – Data flow programming for ETL, data
modeling, and exploration
– Apache Giraph – Graph processing
– MapReduce – Batch processing
– Storm, S4 – Stream processing
– Plus lots of commercial offerings

© Hortonworks 2013
Page 4

Pre-Cloud: One Tool per Machine
• Databases presented SQL or SQL-like paradigms for operating on data
• Other tools came in separate packages (e.g. R) or on separate platforms (e.g.
SAS).

Data
Mart
Statistical
Analysis
Data
Warehouse

Cube/M OLTP
OLAP

© Hortonworks 2013
Page 5

Cloud: Many Tools One Platform
• Users no longer want to be concerned with what platform their data is in – just
apply the tool to it
• SQL no longer the only or primary data access tool

Statistical
Data Analysis
Mart
Data
Warehouse

Cube/M OLT
OLAP P

© Hortonworks 2013
Page 6

Upside - Pick the Right Tool for the Job

© Hortonworks 2013
Page 7

Downside – Tools Don’t Play Well Together

• Hard for users to share data between tools
– Different storage formats
– Different data models
– Different user defined function interfaces

© Hortonworks 2013
Page 8

Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
redundant functionality

Hive

Pig Parser

Parser Metadata

Optimizer Optimizer
Physical Physical
Planner Planner

Executor Executor

© Hortonworks 2013
Page 9

Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
redundant functionality

Hive

Pig Parser

Parser Metadata

Optimizer Optimizer
Physical Physical
Overlap
Planner Planner

Executor Executor

© Hortonworks 2013
Page 10

Conclusion: We Need Services
• We need to find a way to share services where we can
• Gives users the same experience across tools
• Allows developers to share effort when it makes sense

© Hortonworks 2013
Page 11

Hadoop = Distributed Data Operating
System
Service Hadoop Component

Table Management Hive

Access To Metadata HCatalog

User authentication Knox

Resource management YARN

Notification HCatalog

REST/Connectors webhcat, webhdfs, Hive, HBase,
Oozie
Relational data processing Tez

Exists Pieces exist in this component New Project

© Hortonworks 2013
Page 12

Hadoop = Distributed Data Operating
System
Service Hadoop Component

Table Management Hive

Access To Metadata HCatalog

User authentication Knox

Resource management YARN

Notification HCatalog

REST/Connectors webhcat, webhdfs, Hive, HBase,
Oozie
Relational data processing Tez

Exists Pieces exist in this component New Project

© Hortonworks 2013
Page 13

HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
Hadoop
• Presents tools with a table paradigm that abstracts away
storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

© Hortonworks 2013
Page 14

Hadoop
storage details

Hive

Metastore

© Hortonworks 2013
Page 15

Hadoop
storage details

Hive Pig
HCat
Loader

Metastore MapReduce
HCatInput
Format

© Hortonworks 2013
Page 16

Hadoop
storage details

Hive Pig
External
Systems HCat
Loader
REST
WebHCat
Metastore MapReduce
HCatInput
Format

© Hortonworks 2013
Page 17

Tez – Moving Beyond MapReduce
• Low level data-processing execution engine
• Use it for the base of MapReduce, Hive, Pig, Cascading
etc.
• Enables pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the end of
the queue between steps in the pipeline
• Does not write intermediate output to HDFS
– Much lighter disk and network usage
• Built on YARN

© Hortonworks 2013
Page 18

Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Job 1

Job 2

I/O Synchronization
Barrier

I/O Synchronization
Barrier

Job 3

Pig/Hive - MR
© Hortonworks 2013
Page 19

Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Job 1

Job 2

I/O Synchronization
Barrier

I/O Synchronization
Barrier

Single Job

Job 3

Pig/Hive - MR Pig/Hive - Tez
© Hortonworks 2013
Page 20

FastQuery: Beyond Batch with YARN

Tez Generalizes Map-Reduce Always-On Tez Service
Simplified execution plans process Low latency processing for
data more efficiently all Hadoop data processing

© Hortonworks 2013
Page 21

Today’s Access Options
• Direct Access
– Access Services via REST (WebHDFS, WebHCat)
– Need knowledge of and access to whole cluster
– Security handled by each component in the cluster
– Kerberos details exposed to users

User {REST} Hadoop Cluster

• Gateway / Portal Nodes
– Dedicated nodes behind firewall
– User SSH to node to access Hadoop services

SSH
GW
User Hadoop Cluster
Node

© Hortonworks 2013
Page 23

Knox Design Goals
• Operators can firewall cluster without end user access to
“gateway node”
• Users see one cluster end-point that aggregates
capabilities for data access, metadata and job control
• Provide perimeter security to make Hadoop security setup
easier
• Enable integration enterprise and cloud identity
management environments

© Hortonworks 2013
Page 24

Perimeter Verification & Authentication
Verification
- Verify identity token Authentication Hadoop Cluster
- SAML, propagation of identity
Authentication
User Store
- Establish identity at Gateway to
Authenticate with LDAP + AD KDC, AD, DN DN
LDAP
Web DN DN
HDFS
NN
{REST} Knox
Client Gateway

JT
Web
Hive
ID Provider HCat
KDC, AD,
LDAP HCat

Verification
© Hortonworks 2013
Page 25

Strata feb2013

More Related Content

What's hot

Viewers also liked

Similar to Strata feb2013

Recently uploaded

Strata feb2013

Editor's Notes