Hadoop
in a
Nutshell
Siva Pandeti
Cloudera Certified Developer for Apache Hadoop (CCDH)
Overview
Why
Hadoop?
What is
Hadoop?
How to
Hadoop?
Examples
Data Growth
What is Big Data?
Hadoop usage
Components
No SQL
Cluster
Vendors
Tool Comparison
Typical Implementation
Data Analysis with Pig & Hive
Opportunities
Map Reduce deep dive
Wordcount
Search index
Recommendation Engine
Why
Hadoop?
Data Growth
OLTP
Databases for
Operations
Throw away
historical data
Relational
Oracle, DB2
OLAP
Data warehouses for
analytics
Cheaper centralized
storage -> Data
warehouses
(ETL tools)
Relational/MPP
appliances
< few hundred TB
Big Data
Data explosion
(social media, etc)
Petabyte scale
Network speeds
haven’t increased
Need Data Locality
Distributed
processing on
commodity
hardware
(Hadoop)
Non-relational
Big Data
What is Big Data?
Volume
Petabyte scale
Variety
Structured
Semi-structured
Unstructured
Velocity
Social
Sensor
Throughput
Veracity
Unclean
Imprecise
Unclear
Where is Hadoop Used?
Industry
Technology
Use Cases
Search
People you may know
Movie recommendations
Banks
Fraud Detection
Regulatory
Risk management
Media
Retail
Marketing analytics
Customer service
Product recommendations
Manufacturing Preventive maintenance
What is
Hadoop?
Hadoop
HDFS
Distributed Storage
Economical: commodity hardware
Scalable: rebalances data on new nodes
Fault Tolerant: detects faults & auto recovers
Reliable: maintains multiple copies of data
High throughput: because data is distributed
Open source
distributed
computing
framework for
storage and
processing
What is Hadoop?
MapReduce
Distributed Processing
Data Locality: process where the data resides
Fault Tolerant: auto-recover job failures
Scalable: add nodes to increase parallelism
Economical: commodity hardware
• Unlike RDBMS:
o De-normalized
o No secondary indexes
o No transactions
• Modeled after Google’s Big Table
• Random real time read/write access to Big Data
• Billions of rows x millions of columns
• Commodity hardware
• Open source, distributed, versioned, column oriented
• Integrates with MapReduce; Has Java/REST APIs
• Automatic sharding
NoSQL DBs - HBase
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Master Node
Slave Node Slave Node Slave Node
Job Tracker
Task Tracker Task Tracker Task Tracker
Name Node
Data Node Data Node Data Node
Cluster
How Does Hadoop Work?
Vendors
Apache
Hadoop
Cloudera
HortonWorks
MapR
Pentaho Informatica
Talend Clover
EMR
ETL/BI Connectors
Hadoop Distributors
Microstrategy Tableau
SASAbInitio
Comparison
Traditional ETL/BI
Expensive license
Expensive hardware
Hadoop
Open source
Cheap commodity hardware
< 100 TB
Central storage
Petabyte scale
Distributed storage
CostVolume
Quick response for processing
small data
Not as fast on large data
Even smallest job takes 15 seconds
Super fast on large data
Speed
Thousands of reads/writes per
minute
Millions of reads/writes per
minute
Thruput
How to
Hadoop?
HDFS
Hadoop
Flume
Sqoop
Ingest
Put/Get
ETL tools
RDBMS
Data
Feeds
Files
Hadoop Implementation
Reports Machine
Learning
Output
Analytics
Visualization
SAS R
MapReduce
Pig Hive Mahout
Process
Data Analysis: Pig & Hive
Pig Hive
Abstraction on top of MapReduce. Generates MapReduce jobs in the
backend. Useful for analysts who are not programmers.
Data flow language
No schema
Better with less structured Data
SQL like language
Schema, tables, joins are stored in
a meta-store.
Example
LOAD ‘file’ USING
PigStorage(‘t’) AS (id, name);
FILTER
FOREACH
GROUP
ORDER
STORE
Example
CREATE TABLE customer (id
INT, name STRING) ROW
FORMAT DELIMITED FIELDS
TERMINATED BY ‘t’;
SELECT * from customer
WHERE id < 100 limit 10;
MapReduce
Source: http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
Examples
Word count - Java
• Copy input files to HDFS
o hadoop fs –put file1.txt input
• Create driver
o Set configuration variables, mapper and reducer class names
• Create mapper
o Read input and emit key value pairs
• Create reducer (optional)
o Aggregate all values for a particular key
• Execute
o hadoop jar WordCount.jar WordCount input output
• Analyze output
o hadoop fs –cat output/* | head
Word count - Streaming
• Hadoop is written in Java. I don’t know Java. What
do I do?
o Hadoop Streaming (Python, Ruby, R, etc)
• Copy input files to HDFS
o hadoop fs –put file1.txt input
• Create mapper
o Read input stream (stdin) and emit (print) key value pairs
• Create reducer (optional)
o Aggregate all values for a particular key
• Execute
o hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-
stream*.jar -mapper mapper.py –file mapper.py -reducer reducer.py –file
reducer.py -input input –output output
• Analyze output
o hdoop fs –cat output/* | head
Hadoop for R
Sys.setenv(HADOOP_HOME="/home/istvan/hadoop")
Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop")
library(rmr2)
library(rhdfs)
setwd("/home/istvan/rhadoop/blogs/")
gdp <- read.csv("GDP_converted.csv")
head(gdp)
hdfs.init()
gdp.values <- to.dfs(gdp)
# AAPL revenue in 2012 in millions USD
aaplRevenue = 156508
gdp.map.fn <- function(k,v) {
key <- ifelse(v[4] < aaplRevenue, "less", "greater")
keyval(key, 1)
}
count.reduce.fn <- function(k,v) {
keyval(k, length(v))
}
count <- mapreduce(input=gdp.values,
map = gdp.map.fn,
reduce = count.reduce.fn)
from.dfs(count)
• RHadoop package
o rmr
o rhdfs
o Rhbase
• Uses Hadoop
Streaming
• Example on the right
determines how
many countries
have greater GDP
than Apple
Source: http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/
Search index example
• Crawl web
o Crawl and save websites to local directory
• Ingest files to HDFS
• Map
o Split the words & associate words with file names
• Reduce
o Build an index with words and files & count of occurrences
• Search
o Pass the word to the index to get the files it shows up in. Display the file
listing in descending order of number of occurrences of the word in a file
Recommender example
• Use web server logs with user ratings info for items
• Create Hive tables to build structure on top of this
log data
• Generate Mahout specific csv input file
(user, item, rating)
• Run Mahout to build item recommendations for
users
o mahout recommeditembased 
--input /user/hive/warehouse/mahout_input 
--output recommendations 
-s SIMILARITY_PEARSON_CORRELATION –n 20
Recap
Why
Hadoop?
What is
Hadoop?
How to
Hadoop?
Demo
Data Growth
What is Big Data?
Hadoop usage
Components
No SQL
Cluster
Vendors
Tool Comparison
Typical Implementation
Data Analysis with Pig & Hive
Opportunities
Map Reduce deep dive
Wordcount
Search index
Recommendation Engine
Q & A
Contact Siva Pandeti:
Email: siva@pandeti.com
LinkedIn: www.linkedin.com/in/SivaPandeti
Twitter: @SivaPandeti
http://pandeti.com/blog

Hadoop overview

  • 1.
    Hadoop in a Nutshell Siva Pandeti ClouderaCertified Developer for Apache Hadoop (CCDH)
  • 2.
    Overview Why Hadoop? What is Hadoop? How to Hadoop? Examples DataGrowth What is Big Data? Hadoop usage Components No SQL Cluster Vendors Tool Comparison Typical Implementation Data Analysis with Pig & Hive Opportunities Map Reduce deep dive Wordcount Search index Recommendation Engine
  • 3.
  • 4.
    Data Growth OLTP Databases for Operations Throwaway historical data Relational Oracle, DB2 OLAP Data warehouses for analytics Cheaper centralized storage -> Data warehouses (ETL tools) Relational/MPP appliances < few hundred TB Big Data Data explosion (social media, etc) Petabyte scale Network speeds haven’t increased Need Data Locality Distributed processing on commodity hardware (Hadoop) Non-relational
  • 5.
    Big Data What isBig Data? Volume Petabyte scale Variety Structured Semi-structured Unstructured Velocity Social Sensor Throughput Veracity Unclean Imprecise Unclear
  • 6.
    Where is HadoopUsed? Industry Technology Use Cases Search People you may know Movie recommendations Banks Fraud Detection Regulatory Risk management Media Retail Marketing analytics Customer service Product recommendations Manufacturing Preventive maintenance
  • 7.
  • 8.
    Hadoop HDFS Distributed Storage Economical: commodityhardware Scalable: rebalances data on new nodes Fault Tolerant: detects faults & auto recovers Reliable: maintains multiple copies of data High throughput: because data is distributed Open source distributed computing framework for storage and processing What is Hadoop? MapReduce Distributed Processing Data Locality: process where the data resides Fault Tolerant: auto-recover job failures Scalable: add nodes to increase parallelism Economical: commodity hardware
  • 9.
    • Unlike RDBMS: oDe-normalized o No secondary indexes o No transactions • Modeled after Google’s Big Table • Random real time read/write access to Big Data • Billions of rows x millions of columns • Commodity hardware • Open source, distributed, versioned, column oriented • Integrates with MapReduce; Has Java/REST APIs • Automatic sharding NoSQL DBs - HBase Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  • 10.
    Master Node Slave NodeSlave Node Slave Node Job Tracker Task Tracker Task Tracker Task Tracker Name Node Data Node Data Node Data Node Cluster How Does Hadoop Work?
  • 11.
    Vendors Apache Hadoop Cloudera HortonWorks MapR Pentaho Informatica Talend Clover EMR ETL/BIConnectors Hadoop Distributors Microstrategy Tableau SASAbInitio
  • 12.
    Comparison Traditional ETL/BI Expensive license Expensivehardware Hadoop Open source Cheap commodity hardware < 100 TB Central storage Petabyte scale Distributed storage CostVolume Quick response for processing small data Not as fast on large data Even smallest job takes 15 seconds Super fast on large data Speed Thousands of reads/writes per minute Millions of reads/writes per minute Thruput
  • 13.
  • 14.
    HDFS Hadoop Flume Sqoop Ingest Put/Get ETL tools RDBMS Data Feeds Files Hadoop Implementation ReportsMachine Learning Output Analytics Visualization SAS R MapReduce Pig Hive Mahout Process
  • 15.
    Data Analysis: Pig& Hive Pig Hive Abstraction on top of MapReduce. Generates MapReduce jobs in the backend. Useful for analysts who are not programmers. Data flow language No schema Better with less structured Data SQL like language Schema, tables, joins are stored in a meta-store. Example LOAD ‘file’ USING PigStorage(‘t’) AS (id, name); FILTER FOREACH GROUP ORDER STORE Example CREATE TABLE customer (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’; SELECT * from customer WHERE id < 100 limit 10;
  • 16.
  • 17.
  • 18.
    Word count -Java • Copy input files to HDFS o hadoop fs –put file1.txt input • Create driver o Set configuration variables, mapper and reducer class names • Create mapper o Read input and emit key value pairs • Create reducer (optional) o Aggregate all values for a particular key • Execute o hadoop jar WordCount.jar WordCount input output • Analyze output o hadoop fs –cat output/* | head
  • 19.
    Word count -Streaming • Hadoop is written in Java. I don’t know Java. What do I do? o Hadoop Streaming (Python, Ruby, R, etc) • Copy input files to HDFS o hadoop fs –put file1.txt input • Create mapper o Read input stream (stdin) and emit (print) key value pairs • Create reducer (optional) o Aggregate all values for a particular key • Execute o hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop- stream*.jar -mapper mapper.py –file mapper.py -reducer reducer.py –file reducer.py -input input –output output • Analyze output o hdoop fs –cat output/* | head
  • 20.
    Hadoop for R Sys.setenv(HADOOP_HOME="/home/istvan/hadoop") Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop") library(rmr2) library(rhdfs) setwd("/home/istvan/rhadoop/blogs/") gdp<- read.csv("GDP_converted.csv") head(gdp) hdfs.init() gdp.values <- to.dfs(gdp) # AAPL revenue in 2012 in millions USD aaplRevenue = 156508 gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count) • RHadoop package o rmr o rhdfs o Rhbase • Uses Hadoop Streaming • Example on the right determines how many countries have greater GDP than Apple Source: http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/
  • 21.
    Search index example •Crawl web o Crawl and save websites to local directory • Ingest files to HDFS • Map o Split the words & associate words with file names • Reduce o Build an index with words and files & count of occurrences • Search o Pass the word to the index to get the files it shows up in. Display the file listing in descending order of number of occurrences of the word in a file
  • 22.
    Recommender example • Useweb server logs with user ratings info for items • Create Hive tables to build structure on top of this log data • Generate Mahout specific csv input file (user, item, rating) • Run Mahout to build item recommendations for users o mahout recommeditembased --input /user/hive/warehouse/mahout_input --output recommendations -s SIMILARITY_PEARSON_CORRELATION –n 20
  • 23.
    Recap Why Hadoop? What is Hadoop? How to Hadoop? Demo DataGrowth What is Big Data? Hadoop usage Components No SQL Cluster Vendors Tool Comparison Typical Implementation Data Analysis with Pig & Hive Opportunities Map Reduce deep dive Wordcount Search index Recommendation Engine
  • 24.
    Q & A ContactSiva Pandeti: Email: siva@pandeti.com LinkedIn: www.linkedin.com/in/SivaPandeti Twitter: @SivaPandeti http://pandeti.com/blog