Hadoop And Big Data - My Presentation To Selective Audience
1. HADOOP
BIG DATA
Presented by Chandra Sekhar
YOUR COMPANY INFORMATION • WWW.YOURCOMPANY.COM
2. What is Hadoop?
Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters
of commodity computers using a simple programming
model.
3. PRESENTATION FLOW
1. How Hadoop STORES Data.
2. How Hadoop PROCESS Data.
3. Architecture of Hadoop
4. ROI
5. Resources
4.
5. CHALLENGES LIKE
OPPORTUNITIES:
● Out of all People who sailed between 1997 - 2005, should I target the
people who purchased alcohol package or Spa Package?
● Based on the onboard spending of adult men from New York who
have ever sailed with us, who can be targeted to sail on Azamara ?
● Which first time guest will be a high roller ?
COST SAVINGS:
● On a sailing, Who and How many will have genuine complaints vs
whining?
● Which propulsion will break next?
PRODUCTIVITY :
● Which employee will quit next ?
We have answers to most of these questions somewhere in our warehouses.
6.
7. What is so Great about Hadoop ?
● Why all this buzz?
● Is it a hype?
● Is it a dot com ?
● How does Hadoop Handle ?
Next Slide is a good example
10. Hadoop is ideal For
● Write once, Read many times operation..
● No edits, No Updates..
● Movie files, Music files, Flight data
recorders, Logs, XML files are all fine (
DB records as well.)
11. HOW HADOOP STORES
DATA
● Hadoop uses blocks to store Files.
● Default Block size is 64MB
● Every block gets replicated thrice.
● A 100 MB file will take up 2 blocks ( +
Replication factor of 3 = 6 blocks)
● 1 GB File, not a problem … 48 blocks
12. OLD VS NEW
● You can set replication for older files to 2,
and new files to 3 or even 4.
● You can compress the files .
13. More on Blocks..
Because a unit of storage is block, It
does not really matter how many
files, or how big the files are ..
But.
Hadoop prefers large files instead of
many small files. Why ?
14. Why Large Files ?
When a block gets created, the addresses of
block location , gets stored in namenode in
memory For faster retrievel.
It is not mandated,but it is efficient to have
few entries . Usually multiple files get
merged into a single file ( ex : all Assignment
manager logs of a day into a single huge file)
17. MAP REDUCE
Map Function
● Reads the data
● Usually does the preprocessing
● Hands over the records to Reduce
Function for further processing
( Ex : Eliminate all records where the age is
less than 18 )
18. More about Processing
● A single huge file ( ex: 1GB ) file could be
processed by several mappers ( usually one block =
1 mapper, so about 16 Map jobs.
● If a simple logic, then you can disable reduce
function and map job can process the logic.
● A Mapreduce job can pick up a web log from our
website, join to a Siebel table and the output written
to a TIBCO Queue to write to AS400 ( or MongoDB
directly)
27. Same Code in PIG
A = load '/home/cloudera/wordcountproblem' using
TextLoader as (data:chararray);
B = foreach A generate FLATTEN(TOKENIZE(data)) as
word;
C = group B by word;
D = foreach C generate group, COUNT(B);
store D into '/home/cloudera/Chandra7' using
PigStorage(',');
28. Same Code in HIVE
SELECT word, COUNT(*) FROM input LATERAL
VIEW explode(split(text, ' ')) lTable as
word GROUP BY word;
29. More on data processing
● Map function output is always sorted by
the Key.
● Map data is intermediate data , so it is not
saved in HDFS, only in the local node and
gets deleted after reducer finishes.
34. ROI
One study : Storing and Processing 1 TB
Traditional RDBMS : $37,000 / year
Data Appliance : $5000 / year
Hadoop Cluster : $ 2000 /yearSource : HBR Big
Data@work page 60
35. Wikibon Study
BREAK EVEN TIMEFRAME
Big data Approach :
4 months
Traditional DW Appliance Approach : 26 months