SlideShare a Scribd company logo
1 of 46
Hadoop and Big Data Training
Lessons learned
0
What’s Cloudera?
 Leading company in the NoSQL and cloud computing space
 Most popular Hadoop distribution
 Ex-es from Google, Facebook, Oracle and other leading tech
companies
 Sample Bn$ companies client list:
eBay,JPMorganChase,Experian,Groupon,MorganStanley,Nokia
,Orbitz,NationalCancerInstitute,RIM,TheWaltDisney Company
 Consulting and training services
1
Why this training?
 MongoDB is great for OLTP
 Not an OLAP DB, not really aspiring to become one
 Big Data coming in, need for more advanced analysis
processes
2
Intended audience
 Software engineers and friends 
3
 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models
 Modules:
 HadoopCommon
 Hadoop Distributed File System (HDFS™)
 HadoopYARN
 HadoopMapReduce
4
What’s Hadoop?
How does it fit in our Big Goal?
 MongoDB for OLTP
 RDBMS (MySQL) for config data
 Hadoop for OLAP
5
What’s Map Reduce?
 MapReduce is a programming model for processing large data
sets, and the name of an implementation of the model by
Google. MapReduce is typically used to do distributed
computing on clusters of computers. © Wiki
 Practically?
 Can perform computations in a distributed fashion
 Highly scalable
 Inherently highly available
 By design fault tolerant
6
Bindings
 Native Java
 any language, even scripting ones, using Streaming
7
MapReduce framework vs. MapReduce functionality
 Several NoSQL technologies provide MR functionality
8
MR functionality
 Compromise….
 i.e. MongoDB
 CouchDB select * from foo; ;;
9
MapReduce V1 vsMapReduce V2
 MR V1 can not scale past 4k nodes per cluster
 More important to our goals, MR V1 is monolithic
10
MR V2 YARN
 Pluggable implementations on top of Hadoop
 Whole new set of problems can be solved:
 Graph processing
 MPI
11
MR V1 Architecture
12
MR V1 daemons
 client
 NameNode (HDFS)
 JobTracker
 DataNode(HDFS) + TaskTracker
13
MR V2 Architecture
14
MR V2 daemons
 Client
 Resource manager/Application manager
 NodeManager
 Application Master (resource containers)
15
Data Locality in Hadoop
 First replica placed in client node (or random if off cluster
client)
 Second off-rack
 Third in same rack as second but different node
16
HDFS - Architecture
 Hot
 Very large files
 Streaming data access (seek time ~<1% transfer time)
 Commodity hardware (no iphones…)
 Not
 Low-latency data access
 Lots of small files
 Multiple writers, arbitrary file modification
17
HDFS – NameNode
 Namenode Master
 Filesystem tree
 Metadata for all files and directories
 Namespace image and edit log
 Secondary Namenode
 Not a backup node!
 Periodically merges edit log into namespace image
 Could take 30 mins to come back online
18
HDFS HA - NameNode
 2.x Hadoop brings in HDFS HA
 Active-standby config for NameNodes
 Gotchas:
 Shared storage for edit log
 Datanodes send block reports to both NameNodes
 NameNode needs to be transparent to clients
19
HDFS – Read
20
HDFS - Read
 Client requests file from namenode (for first 10 blocks)
 Namenode returns addresses of datanodes
 Client contacts directly datanodes
 Blocks are read in order
21
HDFS - Write
22
HDFS - Write
 RPC initial call to create the file
 Permissions/file exists checks in NameNode etc
 As we write data, data queue in client which asks the
NameNode for datanode to store data
 List of datanodes form a pipeline
 ack queue to verify all replicas have been written
 Close file
23
Job Configuration
 setInputFormatClass
 setOutputFormatClass
 setMapperClass
 setReducerClass
 Set(Map)OutputKeyClass
 set(Map)OutputValueClass
 setNumReduceTasks
24
Job Configuration
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
OR job.submit();
25
Job Configuration
 Simple to invoke:
 bin/hadoop jar WordCountinputPathoutputPath
26
Map Reduce phases
27
Mapper – Life cycle
 Mapper inputs <K1,V1> outputs <K2,V2>
28
Shuffle and Sort
 All same keys are guaranteed to end up in the same reducer,
sorted by key
 Mapper output <K2,V2><‘the’,1>, <‘the’,2>, <‘cat’,1>
 Reducer input <K2,[V2]><‘cat’,*1+>, <‘the’,*1,2+>
29
Reducer – Life cycle
 Reducer inputs <K2, [V2]> outputs <K3, V3>
30
Hadoop interfaces and classes
 >=0.23 new API favoring abstract classes
 <0.23 old API with interfaces
 Packages mapred.* OLD API, mapreduce.* NEW API
31
Speculative execution
 At least one minute into a mapper or reducer, the Jobtracker
will decide based on the progress of a task
 Threshold of each task progress compared to
avgprogress(configurable)
 Relaunch task in different NameNode and have them race..
 Sometimes not wanted
 Cluster utilization
 Non idempotent partial output (OutputCollector)
32
Input Output Formats
 InputFormat<K,V> ->FileInputFormat<K,V> ->TextInputFormat,
KeyValueTextInputFormat, SequenceFileInputFormat
 Default TextInputFormat key=byte offset, value=line
 KeyValueTextInputFormat (key t value)
 Binary splittable format
 Corresponding Output formats
33
Compression
 The billion files problem
 300B/file * 10^9 files  300G RAM
 Big Data storage
 Solutions:
 Containers
 Compression
34
Containers
 HAR (splittable)
 Sequence Files, RC files, Avro files (splittable, compressable)
35
Compression codecs
 LZO, LZ4, snappy codecs are best VFM in compression speed
 Bzip2 offers native splitting but can be slow
36
Long story short
 Compression + sequence files
 Compression that supports splitting
 Split file into chunks in application layer with chunk size
aligned to HDFS block size
 Don’t bother
37
Partitioner
 Default is HashPartitioner
 Why implement our own partitioner?
 Sample case: Total ordering
 1 reducer
 Multiple reducers?
38
Partitioner
 TotalOrderPartitioner
 Sample input to determine number of reducers for maximum
performance
39
Hadoop Ecosystem
 Pig
 Apache Pig is a platform for analyzing large data sets. Pig's
language, Pig Latin, lets you specify a sequence of data
transformations such as merging data sets, filtering them, and
applying functions to records or groups of records.
 Procedural language, lazy evaluated, pipeline split support
 Closer to developers (or relational algebra aficionados) than
not
40
Hadoop Ecosystem
 Hive
 Access to hadoop clusters for non developers
 Data analysts, data scientists, statisticians, SDMs etc
 Subset of SQL-92 plus Hive extensions
 Insert overwrite, no update or delete
 No transactions
 No indexes, parallel scanning
 “Near” real time
 Only equality joins
41
Hadoop Ecosystem
 Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier
42
Hadoop ecosystem
 Algorithmic categories:
 Classification
 Clustering
 Pattern mining
 Regression
 Dimension reduction
 Recommendation engines
 Vector similarity
…
43
Reporting Services
 Pentaho, Microstrategy, Jasper all can hook up to a hadoop
cluster
44
References
 Hadoop the definite guide 3rd edition
 apache.hadoop.org
 Hadoop in practice
 Cloudera Custom training slides
45

More Related Content

What's hot

Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 

What's hot (20)

Meethadoop
MeethadoopMeethadoop
Meethadoop
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 

Similar to Hadoop and big data training

Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 

Similar to Hadoop and big data training (20)

Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Unit 1
Unit 1Unit 1
Unit 1
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Hadoop and big data training

  • 1. Hadoop and Big Data Training Lessons learned 0
  • 2. What’s Cloudera?  Leading company in the NoSQL and cloud computing space  Most popular Hadoop distribution  Ex-es from Google, Facebook, Oracle and other leading tech companies  Sample Bn$ companies client list: eBay,JPMorganChase,Experian,Groupon,MorganStanley,Nokia ,Orbitz,NationalCancerInstitute,RIM,TheWaltDisney Company  Consulting and training services 1
  • 3. Why this training?  MongoDB is great for OLTP  Not an OLAP DB, not really aspiring to become one  Big Data coming in, need for more advanced analysis processes 2
  • 4. Intended audience  Software engineers and friends  3
  • 5.  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models  Modules:  HadoopCommon  Hadoop Distributed File System (HDFS™)  HadoopYARN  HadoopMapReduce 4 What’s Hadoop?
  • 6. How does it fit in our Big Goal?  MongoDB for OLTP  RDBMS (MySQL) for config data  Hadoop for OLAP 5
  • 7. What’s Map Reduce?  MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers. © Wiki  Practically?  Can perform computations in a distributed fashion  Highly scalable  Inherently highly available  By design fault tolerant 6
  • 8. Bindings  Native Java  any language, even scripting ones, using Streaming 7
  • 9. MapReduce framework vs. MapReduce functionality  Several NoSQL technologies provide MR functionality 8
  • 10. MR functionality  Compromise….  i.e. MongoDB  CouchDB select * from foo; ;; 9
  • 11. MapReduce V1 vsMapReduce V2  MR V1 can not scale past 4k nodes per cluster  More important to our goals, MR V1 is monolithic 10
  • 12. MR V2 YARN  Pluggable implementations on top of Hadoop  Whole new set of problems can be solved:  Graph processing  MPI 11
  • 14. MR V1 daemons  client  NameNode (HDFS)  JobTracker  DataNode(HDFS) + TaskTracker 13
  • 16. MR V2 daemons  Client  Resource manager/Application manager  NodeManager  Application Master (resource containers) 15
  • 17. Data Locality in Hadoop  First replica placed in client node (or random if off cluster client)  Second off-rack  Third in same rack as second but different node 16
  • 18. HDFS - Architecture  Hot  Very large files  Streaming data access (seek time ~<1% transfer time)  Commodity hardware (no iphones…)  Not  Low-latency data access  Lots of small files  Multiple writers, arbitrary file modification 17
  • 19. HDFS – NameNode  Namenode Master  Filesystem tree  Metadata for all files and directories  Namespace image and edit log  Secondary Namenode  Not a backup node!  Periodically merges edit log into namespace image  Could take 30 mins to come back online 18
  • 20. HDFS HA - NameNode  2.x Hadoop brings in HDFS HA  Active-standby config for NameNodes  Gotchas:  Shared storage for edit log  Datanodes send block reports to both NameNodes  NameNode needs to be transparent to clients 19
  • 22. HDFS - Read  Client requests file from namenode (for first 10 blocks)  Namenode returns addresses of datanodes  Client contacts directly datanodes  Blocks are read in order 21
  • 24. HDFS - Write  RPC initial call to create the file  Permissions/file exists checks in NameNode etc  As we write data, data queue in client which asks the NameNode for datanode to store data  List of datanodes form a pipeline  ack queue to verify all replicas have been written  Close file 23
  • 25. Job Configuration  setInputFormatClass  setOutputFormatClass  setMapperClass  setReducerClass  Set(Map)OutputKeyClass  set(Map)OutputValueClass  setNumReduceTasks 24
  • 26. Job Configuration Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); OR job.submit(); 25
  • 27. Job Configuration  Simple to invoke:  bin/hadoop jar WordCountinputPathoutputPath 26
  • 29. Mapper – Life cycle  Mapper inputs <K1,V1> outputs <K2,V2> 28
  • 30. Shuffle and Sort  All same keys are guaranteed to end up in the same reducer, sorted by key  Mapper output <K2,V2><‘the’,1>, <‘the’,2>, <‘cat’,1>  Reducer input <K2,[V2]><‘cat’,*1+>, <‘the’,*1,2+> 29
  • 31. Reducer – Life cycle  Reducer inputs <K2, [V2]> outputs <K3, V3> 30
  • 32. Hadoop interfaces and classes  >=0.23 new API favoring abstract classes  <0.23 old API with interfaces  Packages mapred.* OLD API, mapreduce.* NEW API 31
  • 33. Speculative execution  At least one minute into a mapper or reducer, the Jobtracker will decide based on the progress of a task  Threshold of each task progress compared to avgprogress(configurable)  Relaunch task in different NameNode and have them race..  Sometimes not wanted  Cluster utilization  Non idempotent partial output (OutputCollector) 32
  • 34. Input Output Formats  InputFormat<K,V> ->FileInputFormat<K,V> ->TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat  Default TextInputFormat key=byte offset, value=line  KeyValueTextInputFormat (key t value)  Binary splittable format  Corresponding Output formats 33
  • 35. Compression  The billion files problem  300B/file * 10^9 files  300G RAM  Big Data storage  Solutions:  Containers  Compression 34
  • 36. Containers  HAR (splittable)  Sequence Files, RC files, Avro files (splittable, compressable) 35
  • 37. Compression codecs  LZO, LZ4, snappy codecs are best VFM in compression speed  Bzip2 offers native splitting but can be slow 36
  • 38. Long story short  Compression + sequence files  Compression that supports splitting  Split file into chunks in application layer with chunk size aligned to HDFS block size  Don’t bother 37
  • 39. Partitioner  Default is HashPartitioner  Why implement our own partitioner?  Sample case: Total ordering  1 reducer  Multiple reducers? 38
  • 40. Partitioner  TotalOrderPartitioner  Sample input to determine number of reducers for maximum performance 39
  • 41. Hadoop Ecosystem  Pig  Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records.  Procedural language, lazy evaluated, pipeline split support  Closer to developers (or relational algebra aficionados) than not 40
  • 42. Hadoop Ecosystem  Hive  Access to hadoop clusters for non developers  Data analysts, data scientists, statisticians, SDMs etc  Subset of SQL-92 plus Hive extensions  Insert overwrite, no update or delete  No transactions  No indexes, parallel scanning  “Near” real time  Only equality joins 41
  • 43. Hadoop Ecosystem  Mahout Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier 42
  • 44. Hadoop ecosystem  Algorithmic categories:  Classification  Clustering  Pattern mining  Regression  Dimension reduction  Recommendation engines  Vector similarity … 43
  • 45. Reporting Services  Pentaho, Microstrategy, Jasper all can hook up to a hadoop cluster 44
  • 46. References  Hadoop the definite guide 3rd edition  apache.hadoop.org  Hadoop in practice  Cloudera Custom training slides 45

Editor's Notes

  1. Combiners invoked by design in mongodb
  2. 1 reducer is the default config