Hadoop workshop

Hadoop workshop
Cloud Connect Shanghai
Sep 15, 2013

Ari Flink – Operations Architect
Mac Fang – Manager, Hadoop development
Dean Zhu – Hadoop Developer

Agenda
1. Introductions (5 minutes)
2. Hadoop and Big Data Concepts (20 minutes)
3. Cisco Webex Hadoop architecture (10 minutes)
4. Cisco UCS Hadoop Common Platform Architecture (10 minutes)
5. Exercise 1 (30 minutes)
– Configure a Hadoop single node VM on a laptop

6. Hive and Impala concepts (15 minutes)
7. Exercise 2 (30 minutes)
– Analytics using Apache Hive and Cloudera Impala

8. Q & A
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

2

Hadoop and Big Data Overview
– Enterprise data management and big data
– Problems, Opportunities and Use case examples
– Hadoop architecture concepts



3

What is Big Data?
For our purposes, big data refers to distributed computing
architectures specifically aimed at the “3 V’s” of data: Volume,
Velocity, and Variety



4

Traditional Enterprise Data Management

Operational
(OLTP)
Operational
(OLTP)

ETL

Operational
(OLTP)

Online
Transactional
Processing

Extract,
Transform, and
Load

EDW

Enterprise
Data
Warehouse

BI/Reports

Business
Intelligence

(batch processing)



5

Traditional Business Intelligence Questions
Transactional Data (e.g. OLTP)
Real-time, but limited reporting/analytics

• What are the top 5
most active stocks
traded in the last
hour?
• How many new
purchase orders have
we received since
noon?


Enterprise Data Warehouse
High value, structured, indexed, cleansed

• How many more
hurricane windows are
sold in Gulf-area
stores during
hurricane season vs.
the rest of the year?
• What were the top 10
most frequently backordered products over
the past year?


6

So what has changed?
The Explosion of Unstructured Data
10,000

1.8 trillion gigabytes of data
was created in 2011…

UNSTRUCTURED DATA

• Approx. 500 quadrillion files
(IN BILLIONS)

GB of Data

• More than 90% is unstructured
data
• Quantity doubles every 2 years
• Most unstructured data is neither
stored nor analyzed!

STRUCTURED DATA
0
2005

2010

2015
Source: Cloudera



7

Enterprise Data Management with Big Data
Inmemory
analytics

Operational
(OLTP)
Operational
(OLTP)

BI/Reports
ETL

MPP EDW

Operational
(OLTP)

Web

Dashboards

Big Data
(Hadoop, etc.)

ETL

Machine



8

Traditional Business Intelligence Questions
Transactional Data (e.g.
OLTP)

Fast data, real-time

• What are the top 5
most active stocks
traded in the last hour?
• How many new
purchase orders have
we received since
noon?


Enterprise Data Warehouse
High value, structured,
indexed, cleansed

Big Data
Lower value, semi-structured,
multi-source, raw/”dirty”

• How many more
hurricane windows are
sold in Gulf-area
stores during hurricane
season vs. the rest of
the year?
• What were the top 10
most frequently backordered products over
the past year?

• Which products do
customers click on the
most and/or spend the
most time browsing
without buying?
• How do we optimally
set pricing for each
product in each store
for individual
customers everyday?
• Did the recent
marketing launch
generate the expected
online buzz, and did
that translate to sales?


9

Example: Web and Location Analytics
iPhone searches
Amazon for Vizio TV’s
in Electronics

1336083635.130 10.8.8.158 TCP_MISS/200 8400 GET
http://www.amazon.com/gp/aw/s/ref=is_box_?k=Visio+tv… "Mozilla/5.0 (iPhone;
CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko)
Version/5.1 Mobile/9A405 Safari/7534.48.3"



10

Big Data and Key Infrastructure Attributes
(What big data isn’t)
 Usually not blade servers (not enough local storage)

 Usually not virtualized (hypervisor only adds overhead)
 Usually not highly oversubscribed (significant east-west traffic)
 Usually not SAN/NAS

Low-cost, DASbased, scale-out
clustered filesystem

Move the
compute to
the storage



$$$

11

11

Cost, Performance, and Capacity
HW:SW $ split 30:70

Enterprise
Database

Structured Data:
Relational
Database

$20K/TB

Massive Scale-Out
Column Store

$10K/TB

Unstructured Data:

Hadoop
No SQL

$300-$1K/TB

Machine Logs, Web Click
Stream, Call Data Records,
Satellite Feeds, GPS Data,
Sensor Readings, Sales Data,
Blogs, Emails, Video

HW:SW $ split 70:30



12

Big Data Software Architectures



13

Three basic big data software architectures
MPP Relational
Database

Real-time NoSQL
Fast key-value
store/retrieve

•HBase (part of
Apache
Hadoop)*
•DataStax
(Cassandra)*
•Oracle NoSQL*
•Amazon Dynamo


Scale-out BI/DW

Batch-oriented
Hadoop
Heavy lifting, processing

•Cloudera*
•MapR*
•Intel Hadoop*
•Pivotal HD*

•Greenplum DB
(Pivotal DB)*
•ParAccel*
•Vertica
•Netezza
•Teradata

*Cisco Partners
14

What Is Hadoop?

Hadoop is a distributed, faulttolerant framework for storing and
analyzing data.
Its two primary components are the
Hadoop Filesystem HDFS and the
MapReduce application engine.


15

Hadoop Components and Operations
Hadoop Distributed File System (HDFS)
File

 Scalable & Fault Tolerant
 Filesystem is distributed, stored
across all data nodes in the cluster
 Files are divided into multiple large
blocks – 64MB default, typically
128MB – 512MB
 Data is stored reliably. Each block is
replicated 3 times by default
 Types of Node Functions
– Name Node - Manages HDFS
– Job Tracker – Manages
MapReduce Jobs
– Data Node/Task Tracker – stores
blocks/does work

Block
1

Block
2

Block
3

Block
4

Block
5

Block
6

ToR
FEX/switch

ToR
FEX/switch

ToR
FEX/switch

Data
node 1

Data
node 6

Data
node 11

Data
node 2

Data
node 7

Data
node 12

Data
node 3

Data
node 8

Data
node 13

Data
node 4

Data
node 9

Name
Node

Data
node 5

Data
node 10

Job
Tracker


16

HDFS Architecture
Switch
ToR
FEX/switch

ToR
FEX/switch

ToR
FEX/switch

Data
node 1

Data
node 6

Data
node 11

2

1

Data
node 2

2
3

Data
node 7

1

Data
node 12

2

Data
node 3

3

Data
node 8

1
4

Data
node 13

3

Data
node 4

4

Data
node 9

4

Data
node 14

Data
node 5

Data
node 10

Data
node 15


Name Node

/usr/sean/foo.txt:blk_1,blk_2
/usr/jacob/bar.txt:blk_3,blk_4
Data node 1:blk_1
Data node 2:blk_2, blk_3
Data node 3:blk_3
17

Rack Awareness
“Rack” 1

“Rack” 2

“Rack” 3

Data
node 1

Data
node 6

Data
node 11

2

1

Data
node 2

2
3

Data
node 7

1

Data
node 12

2

Data
node 3

3

Data
node 8

1
4

Data
node 13

3

Data
node 4

4

Data
node 9

4

Data
node 14

Data
node 5


Data
node 10

 Rack Awareness provides Hadoop the
optional ability to group nodes together in
logical “racks” (i.e. failure domains)
 Logical “racks” may or may not correspond
to physical data center racks
 Distributes blocks across different “racks”
to avoid failure domain of a single “rack”
 It can also lessen block movement between
“racks”

Data
node 15


18

MapReduce Example: Word Count
Input

Map
the
quick
brown
fox

the fox
ate the
mouse
how now
brown
cow

Shuffle & Sort

Reduce

the, 1
brown, 1
fox, 1
quick, 1

Output

Reduce

Map

brown, 2
fox, 2
how, 1
now, 1
the, 3

Reduce

ate, 1
cow, 1
mouse,
1
quick, 1

the, 1
fox, 1
the, 1

Map
quick, 1
how, 1
now, 1
brown, 1

Map

ate, 1
mouse, 1
cow, 1


19

MapReduce Architecture
Switch
ToR
FEX/switch

ToR
FEX/switch

M1
Task
Tracker 1

Task
Tracker 6

ToR
FEX/switch
R2

Task
Tracker 11

M2

Task
Tracker 2

Task
Tracker 7

M1

Task
Tracker 12

Job Tracker

M3

Task
M2
Tracker 3
Task
Tracker 4

M3

Task
Tracker 8
Task
Tracker 9

Task
Tracker 13
Task
Tracker 14

R1

Task
Tracker 5

Task
Tracker 10

Task
Tracker 15


Job1:TT1:Mapper1,Mapper2
Job1:TT4:Mapper3,Reducer1
Job2:TT6:Reducer2
Job2:TT7:Mapper1,Mapper3

20

Cisco Webex Cloud and
Hadoop Architecture



21

Global Scale: 13 datacenters &
iPoPs around the globe
Dedicated network: dual path 10G
circuits between DCs
Multi-tenant: 95k sites
Datacenter / PoP

Leased network link

© 2010 Cisco and/or its affiliates. All rights reserved. rights reserved.
C97-717209-00 © 2012 Cisco and/or its affiliates. All

Real-time collaboration: voice,
desktop sharing, video, chat

22
22

People make mistakes
Hardware fails
Software fails
Even failovers sometimes fail
Datacenter / PoP
Leased network link
© 2010 Cisco and/or its affiliates. All rights reserved. rights reserved.
C97-717209-00 © 2012 Cisco and/or its affiliates. All

23
23

Unstructured/semi-structured data
Syslog

HTTP/REST

Thrift

Log4j

AMQP

Avro

Structured data
RDBMS

Application state & APIs

File

MySQL

Flume

Other
Sinks

Solr
Sink

Sqoop

Cisco UCS C240 M3
servers

HDFS
Sink

12 x 3TB = 36 TB

/ server

SolrCloud

Solr index

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved.

HDFS

Raw data

24

Cisco UCS and Big Data

Building a big data cluster with the UCS
Common Platform Architecture (CPA)
CPA Networking
CPA Sizing and Scaling



25

The evolution of big data deployments
General Purpose IT Data Center

Dedicated “Pod” for Big Data

IT Infrastructure

Generic IT servers

SAP

VMware
WEB

X86 servers

Big Data

Big Data
 Experimental use of Big Data

 App team mandated infrastructure

 Deployed into IT Ops mandated
infrastructures

 Purpose built for Big Data

 “Skunk works”

 Big Data has established business
value

 Small to medium clusters


 Performance matters
 Large or small clusters

Hadoop Hardware Evolving in the Enterprise
Typical 2009
Hadoop node

• 1RU server
• 4 x 1TB 3.5”
spindles
• 2 x 4-core CPU
• 1 x GE
• 24 GB RAM
• Single PSU
• Running Apache
•$


Economics favor
“fat” nodes

• 6x-9x more
data/node
• 3x-6x more
IOPS/node
• Saturated gigabit,
10GE on the rise
• Fewer total nodes
lowers
licensing/support
costs
• Increased
significance of node
and switch failure


Typical 2013
Hadoop node

• 2RU server
• 12 x 3TB 3.5” or 24
x 1TB 2.5” spindles
• 2 x 8-core CPU
• 1-2 x 10GE
• 128 GB RAM
• Dual PSU
• Running
commercial/licensed
distribution
• $$$
27

Cisco UCS Common Platform Architecture (CPA)
Building Blocks for Big Data

UCS Manager
UCS 6200 Series
Fabric Interconnects
Nexus 2232
Fabric Extenders

LAN, SAN, Management

UCS 240 M3
Servers


28

CPA Network Design for Big Data



29

CPA: Topology
Single wire for data and management
8 x 10GE
uplinks per
FEX= 2:1
oversub (16
servers/rack),
no
portchannel
(static pinning)

2 x 10GE links
per server for all
traffic, data and
management


CPA Recommended FEX Connectivity
2 FEX’s and 2 FI’s

•
•

2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32
Distribute servers across port groups to maximize buffer
performance and predictably distribute static pinning on uplinks



Can Hadoop really push 10GE?
It can, depending on workload, so tune for it!
 Analytic workloads tend to be
lighter on the network
 Transform workloads tend to be
heavier on the network
 Hadoop has numerous
parameters which affect network

 Take advantage of 10GE CPA:
–
–
–
–
–
–



mapred.reduce.slowstart.completed.maps
dfs.balance.bandwidthPerSec
mapred.reduce.parallel.copies
mapred.reduce.tasks
mapred.tasktracker.reduce.tasks.maximum
mapred.compress.map.output

32

CPA Sizing and Scaling for Big Data



33

Cisco UCS Reference Configurations for Big Data

Full Rack UCS Solutions
Bundle for Hadoop,
NoSQL Performance

2 x UCS 6296
2 x Nexus 2232 PP
16 x C240 M3 (SFF)

2 x UCS 6296
2 x Nexus 2232 PP
16 x C240 M3 (LFF)

2x E5-2665 (16 cores)
256GB
24 x 1TB 7.2K SAS


Full Rack UCS Solutions
Bundle for Hadoop
Capacity

E5-2640 (12 cores)
128GB
12x 3TB 7.2K SATA


34

Sizing
Part science, part art
 Start with current storage requirement
– Factor in replication (typically 3x) and compression (varies by data set)
– Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL
systems
– Factor in average daily/weekly data ingest rate
– Factor in expected growth rate (i.e. increase in ingest rate over time)

 If I/O requirement known, use next table for guidance
 Most big data architectures are very linear, so more nodes = more capacity and
better performance
 Strike a balance between price/performance of individual nodes vs. total # of
nodes


35

CPA sizing and application guidelines
CPU

2 x E5-2690

2 x E5-2665

2 x E5-2640

256

256

128

24 x 600GB 10K

24 x 1TB 7.2K

12 x 3TB 7.2K

IO Bandwidth (GB/Sec)

2.6

2.0

1.1

Cores

256

256

192

Memory (TB)

4

4

2

Capacity (TB)

225

384

576

IO Bandwidth (GB/Sec)

41.3

31.9

16.9

MPP DB
NoSQL

Hadoop
NoSQL

Hadoop

Memory (GB)
Server
Disk Drives

Rack-Level

Applications

Best Performance


Best Price/TB
36

Scaling the CPA

L2/L3 Switching

Single Rack
16 servers

Single Domain
Up to 10 racks, 160 servers

Multiple Domains


37

Scaling the Common Platform Architecture
Multiple domains based on 16 servers per rack and 2 x 2232 FEXs

Consider intra- and inter-domain bandwidth:
Servers Per
Domain
(Pair of Fabric
Interconnects)

Available
North-Bound
10GE ports
(per fabric)

Southbound
oversubscription
(per fabric)

Northbound
oversubscription
(per fabric)

Intra-domain
server-to-server
bandwidth (per
fabric, Gbits/sec)

Inter-domain
server-to-server
bandwidth (per
fabric, Gbits/sec)

160

16

2:1

5:1

5

1

144

24

2:1

3:1

5

1.67

128

32

2:1

2:1

5

2.5



38

Multi-Domain CPA Customer Example
• 10 Gits/sec Intra-Domain
Server to Server NW
Bandwidth
• 5 Gbits/sec Inter-Domain
Server to Server NW
Bandwidth
• Static pinning from FEX to
FI (no port-channel)



39

Recommendations: UCS Domains and Racks
Single Domain Recommendation

Multi Domain Recommendation
Create one Hadoop rack per UCS Domain

Turn off or enable at physical rack level

• For simplicity and ease of
use, leave Rack Awareness
off
• Consider turning it on to limit
physical rack level fault
domain (e.g. localized
failures due to physical data
center issues – water, power,
cooling, etc.)

• With multiple domains,
enable Rack Awareness
such that each UCS Domain
is its own Hadoop rack
• Provides HDFS data
protection across domains
• Helps minimize crossdomain traffic


40

Exercise 1
 Set up a single node VM cluster on the laptop
– Step 1: copy files from USB memory stick
– Step 2: Mac & Dean to fill in …
– etc



41



42

Hive

 An SQL-like interface to Hadoop

 Top level Apache project
– http://hive.apache.org/

 Hive history
– Created at Facebook to allow people to quickly and easily leverage Hadoop without the effort of
writing Java MapReduce
– Currently used at many companies for log processing, business intelligence and analytics



43

Hive Components







Shell: allows interactive queries
Driver: session handles, fetch, execute
Compiler: parse, plan, optimize
Execution engine: DAG of stages (MR, HDFS, metadata)
Metastore: schema, location in HDFS, SerDe



Data Model
 Tables
– Typed columns (int, float, string, boolean)
– Also, list: map (for JSON-like data)

 Partitions
– For example, range-partition tables by date

 Buckets
– Hash partitions within ranges (useful for sampling, join optimization)



Hive
DBMS

Hive

Language

SQL-92 standard

Subset of SQL-92 plus Hive
extensions

Updates

INSERT, UPDATE, DELETE

INSERT OVERWRITE
No UPDATE or DELETE

Transactions

Yes

No

Latency

Sub-second

Minutes to hours

Indexes

Any number of indexes,
important to performance

No indexes, data is always
scanned in parallel

Dataset size

TBs

PBs



46

Metastore

 Database: namespace containing a set of tables
 Holds table definitions (column types, physical layout)
 Holds partitioning information
 Can be stored in Derby, MySQL, and other relational databases

Cloud Connect 2013
Source: cc-licensedShanghai Cloudera
slide by


Hive components

Hive

SerDe

InputFormat

Hadoop cluster
Cloud Connect 2013
Source: cc-licensedShanghai Cloudera
slide by


MetaStore

Hive MetaStore

BeelineCLI

HiveServer2

HiveCLI

MetaStore

Impala

RDBMS

HCatalog

Pig



Hive Physical Layout

Warehouse directory in HDFS
– E.g., /user/hive/warehouse

Tables stored in subdirectories of warehouse
– Partitions form subdirectories of tables

Actual data stored in HDFS files
– E.g. text, SequenceFile, RCfile, Avro
– Arbitrary format with a custom SerDe



External and Hive managed tables
 Hive managed tables
– Data moved to location /user/hive/warehouse
– Can be stored in a more efficient format than text e.g. RCFile
– If you drop the table, the raw data is lost
hive> CREATE TABLE test(id INT, name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;

 External tables
– Can overlay multiple tables all pointing to the same raw data
– To create external table, simply point to the location of data while creating the tables
hive> CREATE TABLE test (id INT, name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION '/home/test/data';


Hive: Example
 Hive looks similar to an SQL database
 Relational join on two tables:
– Table of word counts from Shakespeare collection
– Table of word counts from the bible
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq
>= 1
ORDER BY s.freq DESC LIMIT 5;

the
I
and
to
of

25848
23031
19671
18038
16700

62394
8854
38985
13526
34654

Impala



53

Impala
 General purpose MPP SQL query engine for Hadoop
– Query latency milliseconds to hours, interactive data exploration
– Runs on the existing Hadoop cluster on existing HDFS files and hardware

 High performance
– C++
– Direct access to HDFS and Hbase data, no MapReduce

 Unified platform
– Use existing Hive metadata and query language (HiveQL)
– Submit queries via ODBC or Thrift API

 Performance
– Disk throughput limited by hw to 100MB/sec
– 3 .. 90 x faster than Hive, depending on the type of the query


54

Impala Details
Unified metadata
HiveQL interface
Hive Metastore
SQL App

HDFS NN

ODBC

StateStored

impalad

impalad

impalad

Query Planner

Query Planner

Query Planner

Query Coordinator

Query Coordinator

Query Coordinator

Query Exec Engine

Query Exec Engine

Query Exec Engine

HDFS DN HBase

HDFS DN HBase

HDFS DN HBase



55

Impala Details
Unified metadata
HiveQL interface
Hive Metastore
SQL App
ODBC

Impalad keep contact to
StateStored to update their state
and to receive metadata for query
planning

HDFS NN

StateStored

impalad

impalad

impalad

Query Planner

Query Planner

Query Planner

Query Coordinator

Query Coordinator

Query Coordinator

Query Exec Engine

Query Exec Engine

Query Exec Engine

HDFS DN HBase

HDFS DN HBase

HDFS DN HBase



56

Impala Details
Unified metadata
HiveQL interface
Hive Metastore
SQL App

HDFS NN

ODBC

StateStore
Query coordinator initiates
execution on remote impalad’s

impalad

impalad

impalad

Query Planner

Query Planner

Query Planner

Query Coordinator

Query Coordinator

Query Coordinator

Query Exec Engine

Query Exec Engine

Query Exec Engine

HDFS DN HBase

HDFS DN HBase

HDFS DN HBase



57

Impala Details
Unified metadata
HiveQL interface
Hive Metastore
SQL App

HDFS NN

ODBC

StateStore
Intermediate results are streamed between impalad’s
and query results are streamed back to client

impalad

impalad

impalad

Query Planner

Query Planner

Query Planner

Query Coordinator

Query Coordinator

Query Coordinator

Query Exec Engine

Query Exec Engine

Query Exec Engine

HDFS DN HBase

HDFS DN HBase

HDFS DN HBase



58

Exercise 2
 Analytics with Hive and Impala
– Step 1: copy test dataset from USB memory stick
– etc



59

Hadoop workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop workshop

Similar to Hadoop workshop (20)

Recently uploaded

Recently uploaded (20)

Hadoop workshop

Editor's Notes