SlideShare a Scribd company logo
1 of 30
Hadoop Fundamentals
Satish Mittal
InMobi
Why Hadoop?
Big Data
• Sources: Server logs, clickstream, machine, sensor, social…
• Use-cases: batch/interactive/real-time
Scalable
o Petabytes of data
Economical
o Use commodity hardware
o Share clusters among many applications
Reliable
o Failure is common when you run thousands of machines. Handle it well in
the SW layer.
Simple programming model
o Applications must be simple to write and maintain
What is needed from a Distributed Platform?
Hadoop is peta-byte scale distributed data storage and data
processing infrastructure
 Based on Google GFS & MR paper
 Contributed mostly by Yahoo! in the initial years and now have a
more widespread developer and user base
 1000s of nodes, PBs of data in storage
What is Hadoop?
• Cheap JBODs for storage
• Move processing to where data is
Location awareness (topology)
• Assume hardware failures to be the norm
• Map & Reduce primitives are fairly simple yet powerful
Most set operations can be performed using these primitives
• Isolation
Hadoop Basics
Hadoop Distributed File System
(HDFS)
Goals:
 Fault tolerant, scalable, distributed storage system
 Designed to reliably store very large files across machines in a
large cluster
Assumptions:
 Files are written once and read several times
 Applications perform large sequential streaming reads
 Not a Unix-like, POSIX file system
 Access via command line or Java API
HDFS
• Data is organized into files and directories
• Files are divided into uniform sized blocks and distributed across
cluster nodes
• Blocks are replicated to handle hardware failure
• Filesystem keeps checksums of data for corruption detection and
recovery
• HDFS exposes block placement so that computes can be migrated
to data
HDFS – Data Model
HDFS - Architecture
• Namenode is SPOF (HA for NN is now available in 2.0
Alpha)
• Responsible for managing a list of all active data nodes,
FS name system (files, directories, blocks and their
locations)
• Block placement policy
• Ensuring adequate replicas
• Writing edit logs durably
Namenode
• Service to allow data to be streamed in & out
• Block is the unit of data that data node understands
• Block reports to Namenode periodically
• Checksum checks, disk usage stats are managed by datanode
• Clients talk to datanode for actual data
• As long as there is at least one data node available to service file
blocks, failures in datanodes can be tolerated, albeit at lower
performance.
Datanode
HDFS – Write pipeline
DFS Client Namenode
Data node 1
Data node 2
Data node 3
Rack 2
Create file, get Block Loc (1)
DN 1, 2 & 3 (2)
Stream file (5)
Ack (5a)
Ack (4a)
Ack (3a)
Complete file (3b)
Rack 1
• Default is 3 replicas, but configurable
• Blocks are placed (writes are pipelined):
On same node
On different rack
On the other rack
• Clients read from closest replica
• If the replication for a block drops below target, it is
automatically re-replicated.
HDFS – Block placement
• Data is checked with CRC32
• File Creation
‣ Client computes checksum per block
‣ DataNode stores the checksum
• File access
‣ Client retrieves the data and checksum from DataNode
‣ If Validation fails, Client tries other replicas
HDFS – Data correctness
Simple commands
• hadoop fs -ls, -du, -rm, -rmr, -chown, -chmod
Uploading files
• hadoop fs -put foo mydata/foo
• cat ReallyBigFile | hadoop fs -put - mydata/ReallyBigFile
Downloading files
• hadoop fs -get mydata/foo foo
• hadoop fs -get - mydata/ReallyBigFile | grep “the answer is”
• hadoop fs -cat mydata/foo
Admin
• hadoop dfsadmin –report
• hadoop fsck
Interacting with HDFS
Map-Reduce
Say we have 100s of machines available to us. How do we write
applications on them?
As an example, consider the problem of creating an index for search.
‣ Input: Hundreds of documents
‣ Output: A mapping of word to document IDs
‣ Resources: A few machines
Map-Reduce Application
The problem : Inverted Index
Farmer1 has the
following animals:
bees, cows, goats.
Some other
animals …
Animals: 1, 2, 3, 4, 12
Bees: 1, 2, 23, 34
Dog: 3,9
Farmer1: 1, 7
…
Building an inverted index
Machine1
Machine2
Machine3
Animals: 1,3
Dog: 3
Animals:2,12
Bees: 23
Dog:9
Farmer1: 7
Machine4
Animals: 1,3
Animals:2,12
Bees:23
Machine5
Dog: 3
Dog:9
Farmer1: 7
Machine4
Animals: 1,2,3,12
Bees:23
Machine5
Dog: 3,9
Farmer1: 7
In our example
‣ Map: (doc-num, text) ➝ [(word, doc-num)]
‣ Reduce: (word, [doc1, doc3, ...]) ➝ [(word, “doc1, doc3, …”)]
General form:
‣ Two functions: Map and Reduce
‣ Operate on key and value pairs
‣ Map: (K1, V1) ➝ list(K2, V2)
‣ Reduce: (K2, list(V2)) ➝ (K3, V3)
‣ Primitives present in Lisp and other functional languages
Same principle extended to distributed computing
‣ Map and Reduce tasks run on distributed sets of machines
This is Map-Reduce
Abstracts functionality common to all Map/Reduce applications
‣ Distribute tasks to multiple machines
‣ Sorts, transfers and merges intermediate data from all machines from the Map phase to
the Reduce phase
‣ Monitors task progress
‣ Handles faulty machines, faulty tasks transparently
Provides pluggable APIs and configuration mechanisms for writing applications
‣ Map and Reduce functions
‣ Input formats and splits
‣ Number of tasks, data types, etc…
Provides status about jobs to users
Map-Reduce Framework
MR – Architecture
Job Client Job Tracker
DFS Client
DFS Client
DFS Client
DFS Client
DFS Client
DFS Client
Task Tracker
Heartbeat Task Assignment
Shuffle
Submit
Progress
H
D
F
S
• All user code runs in isolated JVM
• Client computes splits
• JT just schedules these splits (one mapper per split)
• Mapper, Reducer, Partitioner and Combiner and any custom
Input/OutputFormat runs in user JVM
• Idempotence
Map-Reduce
Hadoop HDFS + MR cluster
Machines with Datanodes and Tasktrackers
D D D DTT
JobTracker
Namenode
T T TD
Client
Submit Job
HTTP Monitoring UI
Get Block
Locations
• Input: A bunch of large text files
• Desired Output: Frequencies of Words
WordCount: Hello World of Hadoop
Hadoop – Two services in one
Mapper
‣ Input: value: lines of text of input
‣ Output: key: word, value: 1
Reducer
‣ Input: key: word, value: set of counts
‣ Output: key: word, value: sum
Launching program
‣ Defines the job
‣ Submits job to cluster
Word Count Example
Questions ?
Thank You!
mailto: satish.mittal@inmobi.com

More Related Content

What's hot

Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopDataWorks Summit
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 

What's hot (19)

Hadoop
Hadoop Hadoop
Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
10c introduction
10c introduction10c introduction
10c introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 

Viewers also liked

Talk on Parallel Computing at IGWA
Talk on Parallel Computing at IGWATalk on Parallel Computing at IGWA
Talk on Parallel Computing at IGWADishant Ailawadi
 
Kafka. seattle data science and data engineering meetup
Kafka. seattle data science and data engineering meetupKafka. seattle data science and data engineering meetup
Kafka. seattle data science and data engineering meetupAbhishek Goswami
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems divjeev
 
Design principles of scalable, distributed systems
Design principles of scalable, distributed systemsDesign principles of scalable, distributed systems
Design principles of scalable, distributed systemsTinniam V Ganesh (TV)
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityRenato Lucindo
 
Distributed Systems Architecture in Software Engineering SE11
Distributed Systems Architecture in Software Engineering SE11Distributed Systems Architecture in Software Engineering SE11
Distributed Systems Architecture in Software Engineering SE11koolkampus
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web appsDirecti Group
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed SystemsRupsee
 
Unit 1 architecture of distributed systems
Unit 1 architecture of distributed systemsUnit 1 architecture of distributed systems
Unit 1 architecture of distributed systemskaran2190
 

Viewers also liked (13)

Talk on Parallel Computing at IGWA
Talk on Parallel Computing at IGWATalk on Parallel Computing at IGWA
Talk on Parallel Computing at IGWA
 
Kafka. seattle data science and data engineering meetup
Kafka. seattle data science and data engineering meetupKafka. seattle data science and data engineering meetup
Kafka. seattle data science and data engineering meetup
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems
 
Design principles of scalable, distributed systems
Design principles of scalable, distributed systemsDesign principles of scalable, distributed systems
Design principles of scalable, distributed systems
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Distributed Systems Architecture in Software Engineering SE11
Distributed Systems Architecture in Software Engineering SE11Distributed Systems Architecture in Software Engineering SE11
Distributed Systems Architecture in Software Engineering SE11
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web apps
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Scalability Design Principles - Internal Session
Scalability Design Principles - Internal SessionScalability Design Principles - Internal Session
Scalability Design Principles - Internal Session
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
Unit 1 architecture of distributed systems
Unit 1 architecture of distributed systemsUnit 1 architecture of distributed systems
Unit 1 architecture of distributed systems
 

Similar to Hadoop Fundamentals

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialPranamesh Chakraborty
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 

Similar to Hadoop Fundamentals (20)

Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorial
 
Anju
AnjuAnju
Anju
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Hadoop Fundamentals

  • 3. Big Data • Sources: Server logs, clickstream, machine, sensor, social… • Use-cases: batch/interactive/real-time
  • 4. Scalable o Petabytes of data Economical o Use commodity hardware o Share clusters among many applications Reliable o Failure is common when you run thousands of machines. Handle it well in the SW layer. Simple programming model o Applications must be simple to write and maintain What is needed from a Distributed Platform?
  • 5. Hadoop is peta-byte scale distributed data storage and data processing infrastructure  Based on Google GFS & MR paper  Contributed mostly by Yahoo! in the initial years and now have a more widespread developer and user base  1000s of nodes, PBs of data in storage What is Hadoop?
  • 6. • Cheap JBODs for storage • Move processing to where data is Location awareness (topology) • Assume hardware failures to be the norm • Map & Reduce primitives are fairly simple yet powerful Most set operations can be performed using these primitives • Isolation Hadoop Basics
  • 7. Hadoop Distributed File System (HDFS)
  • 8. Goals:  Fault tolerant, scalable, distributed storage system  Designed to reliably store very large files across machines in a large cluster Assumptions:  Files are written once and read several times  Applications perform large sequential streaming reads  Not a Unix-like, POSIX file system  Access via command line or Java API HDFS
  • 9. • Data is organized into files and directories • Files are divided into uniform sized blocks and distributed across cluster nodes • Blocks are replicated to handle hardware failure • Filesystem keeps checksums of data for corruption detection and recovery • HDFS exposes block placement so that computes can be migrated to data HDFS – Data Model
  • 11. • Namenode is SPOF (HA for NN is now available in 2.0 Alpha) • Responsible for managing a list of all active data nodes, FS name system (files, directories, blocks and their locations) • Block placement policy • Ensuring adequate replicas • Writing edit logs durably Namenode
  • 12. • Service to allow data to be streamed in & out • Block is the unit of data that data node understands • Block reports to Namenode periodically • Checksum checks, disk usage stats are managed by datanode • Clients talk to datanode for actual data • As long as there is at least one data node available to service file blocks, failures in datanodes can be tolerated, albeit at lower performance. Datanode
  • 13. HDFS – Write pipeline DFS Client Namenode Data node 1 Data node 2 Data node 3 Rack 2 Create file, get Block Loc (1) DN 1, 2 & 3 (2) Stream file (5) Ack (5a) Ack (4a) Ack (3a) Complete file (3b) Rack 1
  • 14. • Default is 3 replicas, but configurable • Blocks are placed (writes are pipelined): On same node On different rack On the other rack • Clients read from closest replica • If the replication for a block drops below target, it is automatically re-replicated. HDFS – Block placement
  • 15. • Data is checked with CRC32 • File Creation ‣ Client computes checksum per block ‣ DataNode stores the checksum • File access ‣ Client retrieves the data and checksum from DataNode ‣ If Validation fails, Client tries other replicas HDFS – Data correctness
  • 16. Simple commands • hadoop fs -ls, -du, -rm, -rmr, -chown, -chmod Uploading files • hadoop fs -put foo mydata/foo • cat ReallyBigFile | hadoop fs -put - mydata/ReallyBigFile Downloading files • hadoop fs -get mydata/foo foo • hadoop fs -get - mydata/ReallyBigFile | grep “the answer is” • hadoop fs -cat mydata/foo Admin • hadoop dfsadmin –report • hadoop fsck Interacting with HDFS
  • 18. Say we have 100s of machines available to us. How do we write applications on them? As an example, consider the problem of creating an index for search. ‣ Input: Hundreds of documents ‣ Output: A mapping of word to document IDs ‣ Resources: A few machines Map-Reduce Application
  • 19. The problem : Inverted Index Farmer1 has the following animals: bees, cows, goats. Some other animals … Animals: 1, 2, 3, 4, 12 Bees: 1, 2, 23, 34 Dog: 3,9 Farmer1: 1, 7 …
  • 20. Building an inverted index Machine1 Machine2 Machine3 Animals: 1,3 Dog: 3 Animals:2,12 Bees: 23 Dog:9 Farmer1: 7 Machine4 Animals: 1,3 Animals:2,12 Bees:23 Machine5 Dog: 3 Dog:9 Farmer1: 7 Machine4 Animals: 1,2,3,12 Bees:23 Machine5 Dog: 3,9 Farmer1: 7
  • 21. In our example ‣ Map: (doc-num, text) ➝ [(word, doc-num)] ‣ Reduce: (word, [doc1, doc3, ...]) ➝ [(word, “doc1, doc3, …”)] General form: ‣ Two functions: Map and Reduce ‣ Operate on key and value pairs ‣ Map: (K1, V1) ➝ list(K2, V2) ‣ Reduce: (K2, list(V2)) ➝ (K3, V3) ‣ Primitives present in Lisp and other functional languages Same principle extended to distributed computing ‣ Map and Reduce tasks run on distributed sets of machines This is Map-Reduce
  • 22. Abstracts functionality common to all Map/Reduce applications ‣ Distribute tasks to multiple machines ‣ Sorts, transfers and merges intermediate data from all machines from the Map phase to the Reduce phase ‣ Monitors task progress ‣ Handles faulty machines, faulty tasks transparently Provides pluggable APIs and configuration mechanisms for writing applications ‣ Map and Reduce functions ‣ Input formats and splits ‣ Number of tasks, data types, etc… Provides status about jobs to users Map-Reduce Framework
  • 23. MR – Architecture Job Client Job Tracker DFS Client DFS Client DFS Client DFS Client DFS Client DFS Client Task Tracker Heartbeat Task Assignment Shuffle Submit Progress H D F S
  • 24. • All user code runs in isolated JVM • Client computes splits • JT just schedules these splits (one mapper per split) • Mapper, Reducer, Partitioner and Combiner and any custom Input/OutputFormat runs in user JVM • Idempotence Map-Reduce
  • 25. Hadoop HDFS + MR cluster Machines with Datanodes and Tasktrackers D D D DTT JobTracker Namenode T T TD Client Submit Job HTTP Monitoring UI Get Block Locations
  • 26. • Input: A bunch of large text files • Desired Output: Frequencies of Words WordCount: Hello World of Hadoop
  • 27. Hadoop – Two services in one
  • 28. Mapper ‣ Input: value: lines of text of input ‣ Output: key: word, value: 1 Reducer ‣ Input: key: word, value: set of counts ‣ Output: key: word, value: sum Launching program ‣ Defines the job ‣ Submits job to cluster Word Count Example