An Overview of Hadoop

An
Overview
of Hadoop

Contributors: Hussain, Barak, Durai, Gerrit
and Asif

Introductions
Target audience:
Programmers, Developers who want to learn
about Big Data and Hadoop

What is Big Data?

A technology term about
Data that becomes too
large to be managed in a
manner that is previously
known to work normally.

Apache Hadoop

Is an open source, top level Apache project that is based on
Google's MapReduce whitepaper

Is a popular project that is used by several large companies to
process massive amounts of data

The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using a simple
programming model.

Hadoop

Is designed to Scale!

Uses commodity hardware and can be
scaled out using several boxes

Processes data in batches

Can process Petabyte scale data

Before you consider Hadoop,
define your use case
Before using Hadoop or Any other "big data" or
"NoSQL" solution, consider defining your use
case well.

Regular databases can solve most common
use cases if scaled well.

For example MySQL can handle 100s of
millions of records efficiently if scaled well.

Hadoop is best suited when
You have massive data - Terrabytes of data
generated or generating everyday and would
like to process this data for several group
queries, in a scallable manner.

You have billions of rows of data that needs to
be disected for several different reports

Hadoop, Components and Related
Projects

HDFS - Hadoop Distributed File System
Map Reduce - Hadoop Distributed Execution
Framework
Apache Pig - Functional Dataflow Language
for Hadoop M/R (Yahoo).
Hive - SQL like language for Hadoop M/R
(Facebook)
Zookeeper
Etc

Our use case and previous solution
Our use case: Store and process 100s of
millions of records a day. Run 100s of queries -
summing, aggregating data.

Along the line - we had evaluated different
NoSQL technologies for various use cases (not
necessarily for the one above)

MongoDB, Cassandra, Memcached etc were
implemented for various use cases

Original solution: Sharded MySQL system
that could actually process 100s of millions

Issues faced
1. Not a known / preferred design approach.
2. Scalability issues which we had to address it
ourselves
3. Needed a solution that can handle billions of
records per day instead of 100s of millions
4. Needed a truly proven, scalable solution

Why Hadoop
A proven, open source, highly reliable,
distributed data processing platform.

Met our use case of processing 100s of millions
of logs perfectly

We tuned the deployment to process all data
with a max of 30 mins latency

Getting Started
with Hadoop
Installation
HDFS
Map/Reduce
Result

Installation
Requirements
Linux
0.20.203.X current stable version
Java 1.6.x
Using RPM
Configuration
Single Node
Multiple Node

Installation
Modes
Local(Standalone)
Running in single node with a single Java
process
Pseudo
Running in single node with a seperate
Java process
Fully Distributed
Running in clusters of nodes with a
seperate Java process

Related links
http://hadoop.apache.org/common/releases.
html#News
http://pig.apache.org/docs/r0.7.0/setup.html
http://code.google.com/p/bigstreams/

Configurations
/etc/hadoop/conf/hdfs-site.xml
/etc/hadoop/conf/core-site.xml
/etc/hadoop/conf/mapred-site.xml
/etc/hadoop/conf/slaves

Commands
yum install hadoop-0.20
bin/start-all.sh
bin/stop-all.sh
jps
bin/hadoop namenode -format
hadoop fs -mkdir /user/demo
hadoop fs -ls /user/demo
bin/hadoop fs -put conf input
bin/hadoop fs -cat output/*
hadoop fs -cat /user/demo/sample.txt

Sample Physical Architecture

D
A N
T COLLECTOR
A
A M
N NODE
E DB
O N NODE
D O
E D
S E

GLUE + PIG ZOOKEEPER
NODE (VMs)

STREAM STREAM STREAM STREAM
APP NODE 1 APP NODE 2 APP NODE 3 APP NODE N

Sample Logical Architecture

D D D D D N
A A A A A A
T T T T T M
A A A A A E GLUE
N N N N N N DB
O O O O O O
D D D D D D
E E E E E E PIG

ZOOKEEPER
COLLECTOR

STREAM STREAM STREAM STREAM STREAM STREAM
STREAM
APP 1 APP 2 APP 3 APP 4 APP 5 APP 6 APP N

Implementation of a
Hadoop System
BigStreams - Logging Framework
Streams is a high availability, extremely fast, low resource usage real time log
collection framework for terrabytes of data.

- Key author is Gerrit, our Architect
http://code.google.com/p/bigstreams/

Google Protobuf
http://code.google.com/p/protobuf/
for compressing(LZO) the data before transfering the data from application
node to Hadoop node

Implementation
Data Logs will be compressed by stream agents and send
to Collector
Collector informs Namenode regarding new file arrivals
Namenode replies with File sizes and how many blocks
and each blocks will be stored into datanodes
Collector will send the block of file to directly to the
datanodes

Data Processing with Pig
Once data is saved in the HDFS cluster it can
be processed using Java programs or by using
Apache Pig
http://pig.apache.org

1.Apache Pig is a platform platform for analyzing large data
sets
2. Pig latin is the language which presents a simplified
manner to run queries
3. Pig platform has a compiler which translates Pig queries
to MapReduce Programs for Hadoop

Pig Queries
Interactive Mode(Grunt Shell)
Batch Mode (Pig Script)

Also
Local Mode
Map/Reduce Mode

Requirements
Unix and Windows users need the following:

1. Hadoop 0.20.2 - http://hadoop.apache.org/common/releases.html
2. Java 1.6 - http://java.sun.com/javase/downloads/index.jsp (set
JAVA_HOME to the root of your Java installation)
3. Ant 1.7 - http://ant.apache.org/ (optional, for builds)
4. JUnit 4.5 - http://junit.sourceforge.net/ (optional, for unit tests)
Windows users need to install Cygwin and the Perl package: http://www.
cygwin.com/

Download Link:
http://www.gtlib.gatech.edu/pub/apache//pig/pig-0.8.1/

Pig Installing Commands

To install Pig on Red Hat systems:
$rpm -ivh --nodeps pig-0.8.0-1x86_64

To start the Grunt Shell:
$ pig
0 [main] INFO org.apache.pig.backend.hadoop.executionengine.
HExecutionEngine - Connecting to Hadoop file system at: hdfs://localhost:8020
352 [main] INFO org.apache.pig.backend.hadoop.executionengine.
HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001
grunt>

Pig Statements

LOAD
Loads data from the file system.
Usage
Use the LOAD operator to load data from the file system.

Examples
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated.
123
421
834

In this example the default load function, PigStorage, loads data from myfile.txt to form relation A. The two LOAD
statements are equivalent. Note that, because no schema is specified, the fields are not named and all fields default to type
bytearray.
A = LOAD 'myfile.txt';

A = LOAD 'myfile.txt' USING PigStorage('t');

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)

Pig Statements
FOREACH
Generates data transformations based on columns
of data.
Syntax : alias = FOREACH { gen_blk | nested_gen_blk };

Usage
Use the FOREACH…GENERATE operation to work with columns of data (if you want to work with tuples or rows of data, use
the FILTER operation).
FOREACH...GENERATE works with relations (outer bags) as well as inner bags:
● If A is a relation (outer bag), a FOREACH statement could look like this.
● X = FOREACH A GENERATE f1;

● If A is an inner bag, a FOREACH statement could look like this.
● X = FOREACH B {
S = FILTER A BY 'xyz';
GENERATE COUNT (S.$0);
}

Pig Statements

GROUP
Groups the data in one or more relations.
Syntax

alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY
partitioner] [PARALLEL n];

Usage
The GROUP operator groups together tuples that have the same group key (key field). The key field will be a tuple if the
group key has more than one field, otherwise it will be the same type as that of the group key. The result of a GROUP
operation is a relation that includes one tuple per group. This tuple contains two fields:
● The first field is named "group" (do not confuse this with the GROUP operator) and is the same type as the group key.
● The second field takes the name of the original relation and is type bag.
● The names of both fields are generated by the system as shown in the example below.
Note the following about the GROUP/COGROUP and JOIN operators:
● The GROUP and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOIN
creates a flat set of output tuples

Pig Statements

Example For Group
Suppose we have relation A.
A = LOAD 'data' as (f1:chararray, f2:int, f3:int);

DUMP A;
(r1,1,2)
(r2,2,1)
(r3,2,8)
(r4,4,4)

In this example the tuples are grouped using an expression, f2*f3.
X = GROUP A BY f2*f3;

DUMP X;
(2,{(r1,1,2),(r2,2,1)})
(16,{(r3,2,8),(r4,4,4)})

Pig Statements

FILTER
Selects tuples from a relation based on some
condition.
Syntax: alias = FILTER alias BY expression;

Usage
Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH...
GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want.

X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));

DUMP X;
(4,2,1)
(8,3,4)
(7,2,5)
(8,4,3)

Pig Statements

STORE
Stores or saves results to the file system.
Syntax

STORE alias INTO 'directory' [USING function];

Usage
Use the STORE operator to run (execute) Pig Latin statements and save (persist) results to the file system. Use STORE for
production scripts and batch mode processing.
Note: To debug scripts during development, you can use DUMP to check intermediate results.

Pig Statements
Examples For STORE
In this example data is stored using PigStorage and the asterisk character (*) as the field delimiter.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

STORE A INTO 'myoutput' USING PigStorage ('*');

CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3

Pig Latin Example
ads = LOAD '/log/raw/old/ads/month=09/day=01,/log/raw/old/clicks/month=09/day=01' using
com.twitter.elephantbird.pig.proto.LzoProtobuffB64LinePigStore('ad_data');
r1 = foreach ads generate
ad.id as id,
device.carrier_name as carrier_name,
device.device_name as device_name,
device.mobile_model as mobile_model,
ipinfo.city as city_code,
ipinfo.country as country,
ipinfo.ipaddress as ipaddress,
ipinfo.region_code as wifi_Gprs,
site.client_id as client_id,
ad.metrics as metrics,
impressions,
clicks;
g = group r1 by (id,carrier_name,device_name,mobile_model,city_code,country,client_id,wifi_Gprs,metrics);
r = foreach g generate FLATTEN(group), SUM($1.impressions) as imp, SUM($1.clicks) as cl;
rmf /tmp/predicttesnul;
store d into '/tmp/predicttesnul' using PigStorage('t');

Demos / Contact us
Some hands on Demos follow

Need more information -
Find us on Twitter

twitter.com/azifali

An Overview of Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Overview of Hadoop

Similar to An Overview of Hadoop (20)

More from Asif Ali

More from Asif Ali (8)

Recently uploaded

Recently uploaded (20)

An Overview of Hadoop