SlideShare a Scribd company logo
An
Overview
of Hadoop

Contributors: Hussain, Barak, Durai, Gerrit
and Asif
Introductions
Target audience:
Programmers, Developers who want to learn
about Big Data and Hadoop
What is Big Data?

               A technology term about
               Data that becomes too
               large to be managed in a
               manner that is previously
               known to work normally.
Apache Hadoop

Is an open source, top level Apache project that is based on
Google's MapReduce whitepaper

Is a popular project that is used by several large companies to
process massive amounts of data

The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using a simple
programming model.
Hadoop

Is designed to Scale!

Uses commodity hardware and can be
scaled out using several boxes

Processes data in batches

Can process Petabyte scale data
Before you consider Hadoop,
define your use case
Before using Hadoop or Any other "big data" or
"NoSQL" solution, consider defining your use
case well.

Regular databases can solve most common
use cases if scaled well.

For example MySQL can handle 100s of
millions of records efficiently if scaled well.
Hadoop is best suited when
You have massive data - Terrabytes of data
generated or generating everyday and would
like to process this data for several group
queries, in a scallable manner.

You have billions of rows of data that needs to
be disected for several different reports
Hadoop, Components and Related
Projects

HDFS - Hadoop Distributed File System
Map Reduce - Hadoop Distributed Execution
Framework
Apache Pig - Functional Dataflow Language
for Hadoop M/R (Yahoo).
Hive - SQL like language for Hadoop M/R
(Facebook)
Zookeeper
Etc
Our use case and previous solution
Our use case: Store and process 100s of
millions of records a day. Run 100s of queries -
summing, aggregating data.

Along the line - we had evaluated different
NoSQL technologies for various use cases (not
necessarily for the one above)

MongoDB, Cassandra, Memcached etc were
implemented for various use cases
Original solution: Sharded MySQL system
that could actually process 100s of millions
Issues faced
1. Not a known / preferred design approach.
2. Scalability issues which we had to address it
   ourselves
3. Needed a solution that can handle billions of
   records per day instead of 100s of millions
4. Needed a truly proven, scalable solution
Why Hadoop
A proven, open source, highly reliable,
distributed data processing platform.

Met our use case of processing 100s of millions
of logs perfectly

We tuned the deployment to process all data
with a max of 30 mins latency
Getting Started
with Hadoop
Installation
HDFS
Map/Reduce
Result
Installation
Requirements
  Linux
  0.20.203.X current stable version
  Java 1.6.x
Using RPM
Configuration
  Single Node
  Multiple Node
Installation
Modes
   Local(Standalone)
     Running in single node with a single Java
process
   Pseudo
     Running in single node with a seperate
Java process
   Fully Distributed
     Running in clusters of nodes with a
seperate Java process
Related links
http://hadoop.apache.org/common/releases.
html#News
http://pig.apache.org/docs/r0.7.0/setup.html
http://code.google.com/p/bigstreams/
Configurations
/etc/hadoop/conf/hdfs-site.xml
/etc/hadoop/conf/core-site.xml
/etc/hadoop/conf/mapred-site.xml
/etc/hadoop/conf/slaves
Commands
yum install hadoop-0.20
bin/start-all.sh
bin/stop-all.sh
jps
bin/hadoop namenode -format
hadoop fs -mkdir /user/demo
hadoop fs -ls /user/demo
bin/hadoop fs -put conf input
bin/hadoop fs -cat output/*
hadoop fs -cat /user/demo/sample.txt
Sample Physical Architecture

 D
 A            N
 T                        COLLECTOR
              A
 A            M
 N                             NODE
              E                                             DB
 O            N                                            NODE
 D            O
 E            D
 S            E



                   GLUE + PIG                      ZOOKEEPER
                      NODE                         (VMs)



     STREAM           STREAM              STREAM               STREAM
 APP NODE 1       APP NODE 2          APP NODE 3        APP NODE N
Sample Logical Architecture

   D       D      D     D   D            N
   A       A      A     A   A            A
   T       T      T     T   T            M
   A       A      A     A   A            E             GLUE
   N       N      N     N   N            N                                  DB
   O       O      O     O   O            O
   D       D      D     D   D            D
   E       E      E     E   E            E               PIG



                                                                      ZOOKEEPER
                                             COLLECTOR




 STREAM        STREAM            STREAM         STREAM     STREAM       STREAM
                        STREAM
   APP 1       APP 2    APP 3    APP 4         APP 5          APP 6     APP N
Implementation of a
Hadoop System
BigStreams - Logging Framework
Streams is a high availability, extremely fast, low resource usage real time log
collection framework for terrabytes of data.

- Key author is Gerrit, our Architect
http://code.google.com/p/bigstreams/

Google Protobuf
http://code.google.com/p/protobuf/
for compressing(LZO) the data before transfering the data from application
node to Hadoop node
Implementation
Data Logs will be compressed by stream agents and send
to Collector
Collector informs Namenode regarding new file arrivals
Namenode replies with File sizes and how many blocks
and each blocks will be stored into datanodes
Collector will send the block of file to directly to the
datanodes
Data Processing with Pig
Once data is saved in the HDFS cluster it can
be processed using Java programs or by using
Apache Pig
http://pig.apache.org

1.Apache Pig is a platform platform for analyzing large data
sets
2. Pig latin is the language which presents a simplified
manner to run queries
3. Pig platform has a compiler which translates Pig queries
to MapReduce Programs for Hadoop
Pig Queries
Interactive Mode(Grunt Shell)
Batch Mode (Pig Script)

Also
Local Mode
Map/Reduce Mode
Requirements
Unix and Windows users need the following:

1.  Hadoop 0.20.2 - http://hadoop.apache.org/common/releases.html
2.  Java 1.6 - http://java.sun.com/javase/downloads/index.jsp (set
    JAVA_HOME to the root of your Java installation)
3. Ant 1.7 - http://ant.apache.org/ (optional, for builds)
4. JUnit 4.5 - http://junit.sourceforge.net/ (optional, for unit tests)
Windows users need to install Cygwin and the Perl package: http://www.
cygwin.com/


Download Link:
http://www.gtlib.gatech.edu/pub/apache//pig/pig-0.8.1/
Pig Installing Commands

To install Pig on Red Hat systems:
     $rpm -ivh --nodeps pig-0.8.0-1x86_64

To start the Grunt Shell:
$ pig
0 [main] INFO org.apache.pig.backend.hadoop.executionengine.
HExecutionEngine - Connecting to Hadoop file system at: hdfs://localhost:8020
352 [main] INFO org.apache.pig.backend.hadoop.executionengine.
HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001
grunt>
Pig Statements

LOAD
Loads data from the file system.
Usage
Use the LOAD operator to load data from the file system.

Examples
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated.
123
421
834

In this example the default load function, PigStorage, loads data from myfile.txt to form relation A. The two LOAD
statements are equivalent. Note that, because no schema is specified, the fields are not named and all fields default to type
bytearray.
A = LOAD 'myfile.txt';

A = LOAD 'myfile.txt' USING PigStorage('t');

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
Pig Statements
 FOREACH
 Generates data transformations based on columns
 of data.
 Syntax : alias = FOREACH { gen_blk | nested_gen_blk };

 Usage
 Use the FOREACH…GENERATE operation to work with columns of data (if you want to work with tuples or rows of data, use
 the FILTER operation).
 FOREACH...GENERATE works with relations (outer bags) as well as inner bags:
●    If A is a relation (outer bag), a FOREACH statement could look like this.
●    X = FOREACH A GENERATE f1;

●   If A is an inner bag, a FOREACH statement could look like this.
●   X = FOREACH B {
           S = FILTER A BY 'xyz';
           GENERATE COUNT (S.$0);
    }
Pig Statements

 GROUP
 Groups the data in one or more relations.
 Syntax

 alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY
 partitioner] [PARALLEL n];

 Usage
 The GROUP operator groups together tuples that have the same group key (key field). The key field will be a tuple if the
 group key has more than one field, otherwise it will be the same type as that of the group key. The result of a GROUP
 operation is a relation that includes one tuple per group. This tuple contains two fields:
●   The first field is named "group" (do not confuse this with the GROUP operator) and is the same type as the group key.
●   The second field takes the name of the original relation and is type bag.
●   The names of both fields are generated by the system as shown in the example below.
 Note the following about the GROUP/COGROUP and JOIN operators:
●   The GROUP and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOIN
    creates a flat set of output tuples
Pig Statements

Example For Group
Suppose we have relation A.
A = LOAD 'data' as (f1:chararray, f2:int, f3:int);

DUMP A;
(r1,1,2)
(r2,2,1)
(r3,2,8)
(r4,4,4)

In this example the tuples are grouped using an expression, f2*f3.
X = GROUP A BY f2*f3;

DUMP X;
(2,{(r1,1,2),(r2,2,1)})
(16,{(r3,2,8),(r4,4,4)})
Pig Statements

FILTER
Selects tuples from a relation based on some
condition.
Syntax: alias = FILTER alias BY expression;

Usage
Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH...
GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want.

X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));

DUMP X;
(4,2,1)
(8,3,4)
(7,2,5)
(8,4,3)
Pig Statements

STORE
Stores or saves results to the file system.
Syntax

STORE alias INTO 'directory' [USING function];

Usage
Use the STORE operator to run (execute) Pig Latin statements and save (persist) results to the file system. Use STORE for
production scripts and batch mode processing.
Note: To debug scripts during development, you can use DUMP to check intermediate results.
Pig Statements
Examples For STORE
In this example data is stored using PigStorage and the asterisk character (*) as the field delimiter.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

STORE A INTO 'myoutput' USING PigStorage ('*');

CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3
Pig Latin Example
ads = LOAD '/log/raw/old/ads/month=09/day=01,/log/raw/old/clicks/month=09/day=01' using
com.twitter.elephantbird.pig.proto.LzoProtobuffB64LinePigStore('ad_data');
r1 = foreach ads generate
ad.id as id,
device.carrier_name as carrier_name,
device.device_name as device_name,
device.mobile_model as mobile_model,
ipinfo.city as city_code,
ipinfo.country as country,
ipinfo.ipaddress as ipaddress,
ipinfo.region_code as wifi_Gprs,
site.client_id as client_id,
ad.metrics as metrics,
impressions,
clicks;
g = group r1 by (id,carrier_name,device_name,mobile_model,city_code,country,client_id,wifi_Gprs,metrics);
r = foreach g generate FLATTEN(group), SUM($1.impressions) as imp, SUM($1.clicks) as cl;
rmf /tmp/predicttesnul;
store d into '/tmp/predicttesnul' using PigStorage('t');
Demos / Contact us
Some hands on Demos follow

Need more information -
Find us on Twitter

twitter.com/azifali

More Related Content

What's hot

RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
Hyunjung Park
 
Scalding
ScaldingScalding
The Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5PyThe Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5Py
The HDF-EOS Tools and Information Center
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
Alexandros Karatzoglou
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
Ryan Rosario
 
Math synonyms
Math synonymsMath synonyms
Math synonyms
Orest Ivasiv
 
HDF5 Tools
HDF5 ToolsHDF5 Tools
Commands documentaion
Commands documentaionCommands documentaion
Commands documentaion
TejalNijai
 
Python data science toolbox (part2)
Python data science toolbox (part2)Python data science toolbox (part2)
Python data science toolbox (part2)
Vincent Lu MA, PMP, NPDP
 
Substituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scriptsSubstituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scripts
The HDF-EOS Tools and Information Center
 
DPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisDPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for Analysis
Henry Schreiner
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPy
kammeyer
 
Accessing R from Python using RPy2
Accessing R from Python using RPy2Accessing R from Python using RPy2
Accessing R from Python using RPy2
Ryan Rosario
 
R belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamerR belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamerJean-Baptiste Poullet
 
Unix lab
Unix labUnix lab
05 Analysis of Algorithms: Heap and Quick Sort - Corrected
05 Analysis of Algorithms: Heap and Quick Sort - Corrected05 Analysis of Algorithms: Heap and Quick Sort - Corrected
05 Analysis of Algorithms: Heap and Quick Sort - Corrected
Andres Mendez-Vazquez
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
Hansol Kang
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisDataWorks Summit
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
Edureka!
 
Ceilometer Central Agent Active/Active HA Proposal
Ceilometer Central Agent Active/Active HA ProposalCeilometer Central Agent Active/Active HA Proposal
Ceilometer Central Agent Active/Active HA Proposal
Fabio Giannetti
 

What's hot (20)

RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
 
Scalding
ScaldingScalding
Scalding
 
The Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5PyThe Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5Py
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
 
Math synonyms
Math synonymsMath synonyms
Math synonyms
 
HDF5 Tools
HDF5 ToolsHDF5 Tools
HDF5 Tools
 
Commands documentaion
Commands documentaionCommands documentaion
Commands documentaion
 
Python data science toolbox (part2)
Python data science toolbox (part2)Python data science toolbox (part2)
Python data science toolbox (part2)
 
Substituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scriptsSubstituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scripts
 
DPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisDPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for Analysis
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPy
 
Accessing R from Python using RPy2
Accessing R from Python using RPy2Accessing R from Python using RPy2
Accessing R from Python using RPy2
 
R belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamerR belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamer
 
Unix lab
Unix labUnix lab
Unix lab
 
05 Analysis of Algorithms: Heap and Quick Sort - Corrected
05 Analysis of Algorithms: Heap and Quick Sort - Corrected05 Analysis of Algorithms: Heap and Quick Sort - Corrected
05 Analysis of Algorithms: Heap and Quick Sort - Corrected
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data Analysis
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 
Ceilometer Central Agent Active/Active HA Proposal
Ceilometer Central Agent Active/Active HA ProposalCeilometer Central Agent Active/Active HA Proposal
Ceilometer Central Agent Active/Active HA Proposal
 

Similar to An Overview of Hadoop

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Apache Pig
Apache PigApache Pig
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
vishal choudhary
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
storm-170531123446.pptx
storm-170531123446.pptxstorm-170531123446.pptx
storm-170531123446.pptx
IbrahimBenhadhria
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Avinash Pandu
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 
Easy R
Easy REasy R
Easy R
Ajay Ohri
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
AnandMHadoop
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
vishal choudhary
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
Rob Vesse
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
vishal choudhary
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
stratapps
 

Similar to An Overview of Hadoop (20)

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Data Science
Data ScienceData Science
Data Science
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
storm-170531123446.pptx
storm-170531123446.pptxstorm-170531123446.pptx
storm-170531123446.pptx
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Easy R
Easy REasy R
Easy R
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 

More from Asif Ali

Re targeting-fail
Re targeting-failRe targeting-fail
Re targeting-failAsif Ali
 
How to scale your Startup
How to scale your Startup How to scale your Startup
How to scale your Startup
Asif Ali
 
The Future Of Mobile Advertising
The Future Of Mobile AdvertisingThe Future Of Mobile Advertising
The Future Of Mobile Advertising
Asif Ali
 
Memcached Presentation
Memcached PresentationMemcached Presentation
Memcached Presentation
Asif Ali
 
ZestADZ Publisher Presentation
ZestADZ Publisher PresentationZestADZ Publisher Presentation
ZestADZ Publisher Presentation
Asif Ali
 
Mw Mobile Advertising Campaigns Strategies For Sucessful Campaigns And Self S...
Mw Mobile Advertising Campaigns Strategies For Sucessful Campaigns And Self S...Mw Mobile Advertising Campaigns Strategies For Sucessful Campaigns And Self S...
Mw Mobile Advertising Campaigns Strategies For Sucessful Campaigns And Self S...
Asif Ali
 
An Insight Into Mobile Advertising By Asif Ali Cto Of Mobile Worx CTO of Mobi...
An Insight Into Mobile Advertising By Asif Ali Cto Of Mobile Worx CTO of Mobi...An Insight Into Mobile Advertising By Asif Ali Cto Of Mobile Worx CTO of Mobi...
An Insight Into Mobile Advertising By Asif Ali Cto Of Mobile Worx CTO of Mobi...
Asif Ali
 
Mobile advertising overview - by Asif Ali, CTO, Mobile-worx at Momo Chennai
Mobile advertising overview - by Asif Ali, CTO, Mobile-worx at Momo ChennaiMobile advertising overview - by Asif Ali, CTO, Mobile-worx at Momo Chennai
Mobile advertising overview - by Asif Ali, CTO, Mobile-worx at Momo Chennai
Asif Ali
 

More from Asif Ali (8)

Re targeting-fail
Re targeting-failRe targeting-fail
Re targeting-fail
 
How to scale your Startup
How to scale your Startup How to scale your Startup
How to scale your Startup
 
The Future Of Mobile Advertising
The Future Of Mobile AdvertisingThe Future Of Mobile Advertising
The Future Of Mobile Advertising
 
Memcached Presentation
Memcached PresentationMemcached Presentation
Memcached Presentation
 
ZestADZ Publisher Presentation
ZestADZ Publisher PresentationZestADZ Publisher Presentation
ZestADZ Publisher Presentation
 
Mw Mobile Advertising Campaigns Strategies For Sucessful Campaigns And Self S...
Mw Mobile Advertising Campaigns Strategies For Sucessful Campaigns And Self S...Mw Mobile Advertising Campaigns Strategies For Sucessful Campaigns And Self S...
Mw Mobile Advertising Campaigns Strategies For Sucessful Campaigns And Self S...
 
An Insight Into Mobile Advertising By Asif Ali Cto Of Mobile Worx CTO of Mobi...
An Insight Into Mobile Advertising By Asif Ali Cto Of Mobile Worx CTO of Mobi...An Insight Into Mobile Advertising By Asif Ali Cto Of Mobile Worx CTO of Mobi...
An Insight Into Mobile Advertising By Asif Ali Cto Of Mobile Worx CTO of Mobi...
 
Mobile advertising overview - by Asif Ali, CTO, Mobile-worx at Momo Chennai
Mobile advertising overview - by Asif Ali, CTO, Mobile-worx at Momo ChennaiMobile advertising overview - by Asif Ali, CTO, Mobile-worx at Momo Chennai
Mobile advertising overview - by Asif Ali, CTO, Mobile-worx at Momo Chennai
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

An Overview of Hadoop

  • 1. An Overview of Hadoop Contributors: Hussain, Barak, Durai, Gerrit and Asif
  • 2. Introductions Target audience: Programmers, Developers who want to learn about Big Data and Hadoop
  • 3. What is Big Data? A technology term about Data that becomes too large to be managed in a manner that is previously known to work normally.
  • 4. Apache Hadoop Is an open source, top level Apache project that is based on Google's MapReduce whitepaper Is a popular project that is used by several large companies to process massive amounts of data The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.
  • 5. Hadoop Is designed to Scale! Uses commodity hardware and can be scaled out using several boxes Processes data in batches Can process Petabyte scale data
  • 6. Before you consider Hadoop, define your use case Before using Hadoop or Any other "big data" or "NoSQL" solution, consider defining your use case well. Regular databases can solve most common use cases if scaled well. For example MySQL can handle 100s of millions of records efficiently if scaled well.
  • 7. Hadoop is best suited when You have massive data - Terrabytes of data generated or generating everyday and would like to process this data for several group queries, in a scallable manner. You have billions of rows of data that needs to be disected for several different reports
  • 8. Hadoop, Components and Related Projects HDFS - Hadoop Distributed File System Map Reduce - Hadoop Distributed Execution Framework Apache Pig - Functional Dataflow Language for Hadoop M/R (Yahoo). Hive - SQL like language for Hadoop M/R (Facebook) Zookeeper Etc
  • 9. Our use case and previous solution Our use case: Store and process 100s of millions of records a day. Run 100s of queries - summing, aggregating data. Along the line - we had evaluated different NoSQL technologies for various use cases (not necessarily for the one above) MongoDB, Cassandra, Memcached etc were implemented for various use cases
  • 10. Original solution: Sharded MySQL system that could actually process 100s of millions
  • 11. Issues faced 1. Not a known / preferred design approach. 2. Scalability issues which we had to address it ourselves 3. Needed a solution that can handle billions of records per day instead of 100s of millions 4. Needed a truly proven, scalable solution
  • 12. Why Hadoop A proven, open source, highly reliable, distributed data processing platform. Met our use case of processing 100s of millions of logs perfectly We tuned the deployment to process all data with a max of 30 mins latency
  • 14. Installation Requirements Linux 0.20.203.X current stable version Java 1.6.x Using RPM Configuration Single Node Multiple Node
  • 15. Installation Modes Local(Standalone) Running in single node with a single Java process Pseudo Running in single node with a seperate Java process Fully Distributed Running in clusters of nodes with a seperate Java process
  • 18. Commands yum install hadoop-0.20 bin/start-all.sh bin/stop-all.sh jps bin/hadoop namenode -format hadoop fs -mkdir /user/demo hadoop fs -ls /user/demo bin/hadoop fs -put conf input bin/hadoop fs -cat output/* hadoop fs -cat /user/demo/sample.txt
  • 19. Sample Physical Architecture D A N T COLLECTOR A A M N NODE E DB O N NODE D O E D S E GLUE + PIG ZOOKEEPER NODE (VMs) STREAM STREAM STREAM STREAM APP NODE 1 APP NODE 2 APP NODE 3 APP NODE N
  • 20. Sample Logical Architecture D D D D D N A A A A A A T T T T T M A A A A A E GLUE N N N N N N DB O O O O O O D D D D D D E E E E E E PIG ZOOKEEPER COLLECTOR STREAM STREAM STREAM STREAM STREAM STREAM STREAM APP 1 APP 2 APP 3 APP 4 APP 5 APP 6 APP N
  • 21. Implementation of a Hadoop System BigStreams - Logging Framework Streams is a high availability, extremely fast, low resource usage real time log collection framework for terrabytes of data. - Key author is Gerrit, our Architect http://code.google.com/p/bigstreams/ Google Protobuf http://code.google.com/p/protobuf/ for compressing(LZO) the data before transfering the data from application node to Hadoop node
  • 22. Implementation Data Logs will be compressed by stream agents and send to Collector Collector informs Namenode regarding new file arrivals Namenode replies with File sizes and how many blocks and each blocks will be stored into datanodes Collector will send the block of file to directly to the datanodes
  • 23. Data Processing with Pig Once data is saved in the HDFS cluster it can be processed using Java programs or by using Apache Pig http://pig.apache.org 1.Apache Pig is a platform platform for analyzing large data sets 2. Pig latin is the language which presents a simplified manner to run queries 3. Pig platform has a compiler which translates Pig queries to MapReduce Programs for Hadoop
  • 24. Pig Queries Interactive Mode(Grunt Shell) Batch Mode (Pig Script) Also Local Mode Map/Reduce Mode
  • 25. Requirements Unix and Windows users need the following: 1. Hadoop 0.20.2 - http://hadoop.apache.org/common/releases.html 2. Java 1.6 - http://java.sun.com/javase/downloads/index.jsp (set JAVA_HOME to the root of your Java installation) 3. Ant 1.7 - http://ant.apache.org/ (optional, for builds) 4. JUnit 4.5 - http://junit.sourceforge.net/ (optional, for unit tests) Windows users need to install Cygwin and the Perl package: http://www. cygwin.com/ Download Link: http://www.gtlib.gatech.edu/pub/apache//pig/pig-0.8.1/
  • 26. Pig Installing Commands To install Pig on Red Hat systems: $rpm -ivh --nodeps pig-0.8.0-1x86_64 To start the Grunt Shell: $ pig 0 [main] INFO org.apache.pig.backend.hadoop.executionengine. HExecutionEngine - Connecting to Hadoop file system at: hdfs://localhost:8020 352 [main] INFO org.apache.pig.backend.hadoop.executionengine. HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 grunt>
  • 27. Pig Statements LOAD Loads data from the file system. Usage Use the LOAD operator to load data from the file system. Examples Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated. 123 421 834 In this example the default load function, PigStorage, loads data from myfile.txt to form relation A. The two LOAD statements are equivalent. Note that, because no schema is specified, the fields are not named and all fields default to type bytearray. A = LOAD 'myfile.txt'; A = LOAD 'myfile.txt' USING PigStorage('t'); DUMP A; (1,2,3) (4,2,1) (8,3,4)
  • 28. Pig Statements FOREACH Generates data transformations based on columns of data. Syntax : alias = FOREACH { gen_blk | nested_gen_blk }; Usage Use the FOREACH…GENERATE operation to work with columns of data (if you want to work with tuples or rows of data, use the FILTER operation). FOREACH...GENERATE works with relations (outer bags) as well as inner bags: ● If A is a relation (outer bag), a FOREACH statement could look like this. ● X = FOREACH A GENERATE f1; ● If A is an inner bag, a FOREACH statement could look like this. ● X = FOREACH B { S = FILTER A BY 'xyz'; GENERATE COUNT (S.$0); }
  • 29. Pig Statements GROUP Groups the data in one or more relations. Syntax alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY partitioner] [PARALLEL n]; Usage The GROUP operator groups together tuples that have the same group key (key field). The key field will be a tuple if the group key has more than one field, otherwise it will be the same type as that of the group key. The result of a GROUP operation is a relation that includes one tuple per group. This tuple contains two fields: ● The first field is named "group" (do not confuse this with the GROUP operator) and is the same type as the group key. ● The second field takes the name of the original relation and is type bag. ● The names of both fields are generated by the system as shown in the example below. Note the following about the GROUP/COGROUP and JOIN operators: ● The GROUP and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOIN creates a flat set of output tuples
  • 30. Pig Statements Example For Group Suppose we have relation A. A = LOAD 'data' as (f1:chararray, f2:int, f3:int); DUMP A; (r1,1,2) (r2,2,1) (r3,2,8) (r4,4,4) In this example the tuples are grouped using an expression, f2*f3. X = GROUP A BY f2*f3; DUMP X; (2,{(r1,1,2),(r2,2,1)}) (16,{(r3,2,8),(r4,4,4)})
  • 31. Pig Statements FILTER Selects tuples from a relation based on some condition. Syntax: alias = FILTER alias BY expression; Usage Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH... GENERATE operation). FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want. X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1)); DUMP X; (4,2,1) (8,3,4) (7,2,5) (8,4,3)
  • 32. Pig Statements STORE Stores or saves results to the file system. Syntax STORE alias INTO 'directory' [USING function]; Usage Use the STORE operator to run (execute) Pig Latin statements and save (persist) results to the file system. Use STORE for production scripts and batch mode processing. Note: To debug scripts during development, you can use DUMP to check intermediate results.
  • 33. Pig Statements Examples For STORE In this example data is stored using PigStorage and the asterisk character (*) as the field delimiter. A = LOAD 'data' AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) STORE A INTO 'myoutput' USING PigStorage ('*'); CAT myoutput; 1*2*3 4*2*1 8*3*4 4*3*3 7*2*5 8*4*3
  • 34. Pig Latin Example ads = LOAD '/log/raw/old/ads/month=09/day=01,/log/raw/old/clicks/month=09/day=01' using com.twitter.elephantbird.pig.proto.LzoProtobuffB64LinePigStore('ad_data'); r1 = foreach ads generate ad.id as id, device.carrier_name as carrier_name, device.device_name as device_name, device.mobile_model as mobile_model, ipinfo.city as city_code, ipinfo.country as country, ipinfo.ipaddress as ipaddress, ipinfo.region_code as wifi_Gprs, site.client_id as client_id, ad.metrics as metrics, impressions, clicks; g = group r1 by (id,carrier_name,device_name,mobile_model,city_code,country,client_id,wifi_Gprs,metrics); r = foreach g generate FLATTEN(group), SUM($1.impressions) as imp, SUM($1.clicks) as cl; rmf /tmp/predicttesnul; store d into '/tmp/predicttesnul' using PigStorage('t');
  • 35. Demos / Contact us Some hands on Demos follow Need more information - Find us on Twitter twitter.com/azifali