Hadoop and big data training

Hadoop and Big Data Training
Lessons learned
0

What’s Cloudera?
 Leading company in the NoSQL and cloud computing space
 Most popular Hadoop distribution
 Ex-es from Google, Facebook, Oracle and other leading tech
companies
 Sample Bn$ companies client list:
eBay,JPMorganChase,Experian,Groupon,MorganStanley,Nokia
,Orbitz,NationalCancerInstitute,RIM,TheWaltDisney Company
 Consulting and training services
1

Why this training?
 MongoDB is great for OLTP
 Not an OLAP DB, not really aspiring to become one
 Big Data coming in, need for more advanced analysis
processes
2

Intended audience
 Software engineers and friends 
3

 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models
 Modules:
 HadoopCommon
 Hadoop Distributed File System (HDFS™)
 HadoopYARN
 HadoopMapReduce
4
What’s Hadoop?

How does it fit in our Big Goal?
 MongoDB for OLTP
 RDBMS (MySQL) for config data
 Hadoop for OLAP
5

What’s Map Reduce?
 MapReduce is a programming model for processing large data
sets, and the name of an implementation of the model by
Google. MapReduce is typically used to do distributed
computing on clusters of computers. © Wiki
 Practically?
 Can perform computations in a distributed fashion
 Highly scalable
 Inherently highly available
 By design fault tolerant
6

Bindings
 Native Java
 any language, even scripting ones, using Streaming
7

MapReduce framework vs. MapReduce functionality
 Several NoSQL technologies provide MR functionality
8

MR functionality
 Compromise….
 i.e. MongoDB
 CouchDB select * from foo; ;;
9

MapReduce V1 vsMapReduce V2
 MR V1 can not scale past 4k nodes per cluster
 More important to our goals, MR V1 is monolithic
10

MR V2 YARN
 Pluggable implementations on top of Hadoop
 Whole new set of problems can be solved:
 Graph processing
 MPI
11

MR V1 daemons
 client
 NameNode (HDFS)
 JobTracker
 DataNode(HDFS) + TaskTracker
13

MR V2 daemons
 Client
 Resource manager/Application manager
 NodeManager
 Application Master (resource containers)
15

Data Locality in Hadoop
 First replica placed in client node (or random if off cluster
client)
 Second off-rack
 Third in same rack as second but different node
16

HDFS - Architecture
 Hot
 Very large files
 Streaming data access (seek time ~<1% transfer time)
 Commodity hardware (no iphones…)
 Not
 Low-latency data access
 Lots of small files
 Multiple writers, arbitrary file modification
17

HDFS – NameNode
 Namenode Master
 Filesystem tree
 Metadata for all files and directories
 Namespace image and edit log
 Secondary Namenode
 Not a backup node!
 Periodically merges edit log into namespace image
 Could take 30 mins to come back online
18

HDFS HA - NameNode
 2.x Hadoop brings in HDFS HA
 Active-standby config for NameNodes
 Gotchas:
 Shared storage for edit log
 Datanodes send block reports to both NameNodes
 NameNode needs to be transparent to clients
19

HDFS - Read
 Client requests file from namenode (for first 10 blocks)
 Namenode returns addresses of datanodes
 Client contacts directly datanodes
 Blocks are read in order
21

HDFS - Write
 RPC initial call to create the file
 Permissions/file exists checks in NameNode etc
 As we write data, data queue in client which asks the
NameNode for datanode to store data
 List of datanodes form a pipeline
 ack queue to verify all replicas have been written
 Close file
23

Job Configuration
 setInputFormatClass
 setOutputFormatClass
 setMapperClass
 setReducerClass
 Set(Map)OutputKeyClass
 set(Map)OutputValueClass
 setNumReduceTasks
24

Job Configuration
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
OR job.submit();
25

Job Configuration
 Simple to invoke:
 bin/hadoop jar WordCountinputPathoutputPath
26

Mapper – Life cycle
 Mapper inputs <K1,V1> outputs <K2,V2>
28

Shuffle and Sort
 All same keys are guaranteed to end up in the same reducer,
sorted by key
 Mapper output <K2,V2><‘the’,1>, <‘the’,2>, <‘cat’,1>
 Reducer input <K2,[V2]><‘cat’,*1+>, <‘the’,*1,2+>
29

Reducer – Life cycle
 Reducer inputs <K2, [V2]> outputs <K3, V3>
30

Hadoop interfaces and classes
 >=0.23 new API favoring abstract classes
 <0.23 old API with interfaces
 Packages mapred.* OLD API, mapreduce.* NEW API
31

Speculative execution
 At least one minute into a mapper or reducer, the Jobtracker
will decide based on the progress of a task
 Threshold of each task progress compared to
avgprogress(configurable)
 Relaunch task in different NameNode and have them race..
 Sometimes not wanted
 Cluster utilization
 Non idempotent partial output (OutputCollector)
32

Input Output Formats
 InputFormat<K,V> ->FileInputFormat<K,V> ->TextInputFormat,
KeyValueTextInputFormat, SequenceFileInputFormat
 Default TextInputFormat key=byte offset, value=line
 KeyValueTextInputFormat (key t value)
 Binary splittable format
 Corresponding Output formats
33

Compression
 The billion files problem
 300B/file * 10^9 files  300G RAM
 Big Data storage
 Solutions:
 Containers
 Compression
34

Containers
 HAR (splittable)
 Sequence Files, RC files, Avro files (splittable, compressable)
35

Compression codecs
 LZO, LZ4, snappy codecs are best VFM in compression speed
 Bzip2 offers native splitting but can be slow
36

Long story short
 Compression + sequence files
 Compression that supports splitting
 Split file into chunks in application layer with chunk size
aligned to HDFS block size
 Don’t bother
37

Partitioner
 Default is HashPartitioner
 Why implement our own partitioner?
 Sample case: Total ordering
 1 reducer
 Multiple reducers?
38

Partitioner
 TotalOrderPartitioner
 Sample input to determine number of reducers for maximum
performance
39

Hadoop Ecosystem
 Pig
 Apache Pig is a platform for analyzing large data sets. Pig's
language, Pig Latin, lets you specify a sequence of data
transformations such as merging data sets, filtering them, and
applying functions to records or groups of records.
 Procedural language, lazy evaluated, pipeline split support
 Closer to developers (or relational algebra aficionados) than
not
40

Hadoop Ecosystem
 Hive
 Access to hadoop clusters for non developers
 Data analysts, data scientists, statisticians, SDMs etc
 Subset of SQL-92 plus Hive extensions
 Insert overwrite, no update or delete
 No transactions
 No indexes, parallel scanning
 “Near” real time
 Only equality joins
41

Hadoop Ecosystem
 Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier
42

Hadoop ecosystem
 Algorithmic categories:
 Classification
 Clustering
 Pattern mining
 Regression
 Dimension reduction
 Recommendation engines
 Vector similarity
…
43

Reporting Services
 Pentaho, Microstrategy, Jasper all can hook up to a hadoop
cluster
44

References
 Hadoop the definite guide 3rd edition
 apache.hadoop.org
 Hadoop in practice
 Cloudera Custom training slides
45

Hadoop and big data training

More Related Content

What's hot

Similar to Hadoop and big data training

Recently uploaded

Hadoop and big data training

Editor's Notes