2. What’s Cloudera?
Leading company in the NoSQL and cloud computing space
Most popular Hadoop distribution
Ex-es from Google, Facebook, Oracle and other leading tech
companies
Sample Bn$ companies client list:
eBay,JPMorganChase,Experian,Groupon,MorganStanley,Nokia
,Orbitz,NationalCancerInstitute,RIM,TheWaltDisney Company
Consulting and training services
1
3. Why this training?
MongoDB is great for OLTP
Not an OLAP DB, not really aspiring to become one
Big Data coming in, need for more advanced analysis
processes
2
5. The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models
Modules:
HadoopCommon
Hadoop Distributed File System (HDFS™)
HadoopYARN
HadoopMapReduce
4
What’s Hadoop?
6. How does it fit in our Big Goal?
MongoDB for OLTP
RDBMS (MySQL) for config data
Hadoop for OLAP
5
17. Data Locality in Hadoop
First replica placed in client node (or random if off cluster
client)
Second off-rack
Third in same rack as second but different node
16
18. HDFS - Architecture
Hot
Very large files
Streaming data access (seek time ~<1% transfer time)
Commodity hardware (no iphones…)
Not
Low-latency data access
Lots of small files
Multiple writers, arbitrary file modification
17
19. HDFS – NameNode
Namenode Master
Filesystem tree
Metadata for all files and directories
Namespace image and edit log
Secondary Namenode
Not a backup node!
Periodically merges edit log into namespace image
Could take 30 mins to come back online
18
20. HDFS HA - NameNode
2.x Hadoop brings in HDFS HA
Active-standby config for NameNodes
Gotchas:
Shared storage for edit log
Datanodes send block reports to both NameNodes
NameNode needs to be transparent to clients
19
24. HDFS - Write
RPC initial call to create the file
Permissions/file exists checks in NameNode etc
As we write data, data queue in client which asks the
NameNode for datanode to store data
List of datanodes form a pipeline
ack queue to verify all replicas have been written
Close file
23
30. Shuffle and Sort
All same keys are guaranteed to end up in the same reducer,
sorted by key
Mapper output <K2,V2><‘the’,1>, <‘the’,2>, <‘cat’,1>
Reducer input <K2,[V2]><‘cat’,*1+>, <‘the’,*1,2+>
29
32. Hadoop interfaces and classes
>=0.23 new API favoring abstract classes
<0.23 old API with interfaces
Packages mapred.* OLD API, mapreduce.* NEW API
31
33. Speculative execution
At least one minute into a mapper or reducer, the Jobtracker
will decide based on the progress of a task
Threshold of each task progress compared to
avgprogress(configurable)
Relaunch task in different NameNode and have them race..
Sometimes not wanted
Cluster utilization
Non idempotent partial output (OutputCollector)
32
37. Compression codecs
LZO, LZ4, snappy codecs are best VFM in compression speed
Bzip2 offers native splitting but can be slow
36
38. Long story short
Compression + sequence files
Compression that supports splitting
Split file into chunks in application layer with chunk size
aligned to HDFS block size
Don’t bother
37
39. Partitioner
Default is HashPartitioner
Why implement our own partitioner?
Sample case: Total ordering
1 reducer
Multiple reducers?
38
41. Hadoop Ecosystem
Pig
Apache Pig is a platform for analyzing large data sets. Pig's
language, Pig Latin, lets you specify a sequence of data
transformations such as merging data sets, filtering them, and
applying functions to records or groups of records.
Procedural language, lazy evaluated, pipeline split support
Closer to developers (or relational algebra aficionados) than
not
40
42. Hadoop Ecosystem
Hive
Access to hadoop clusters for non developers
Data analysts, data scientists, statisticians, SDMs etc
Subset of SQL-92 plus Hive extensions
Insert overwrite, no update or delete
No transactions
No indexes, parallel scanning
“Near” real time
Only equality joins
41
43. Hadoop Ecosystem
Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier
42