SlideShare a Scribd company logo
1 of 28
Download to read offline
Hadoop
                          A Hands-on Introduction

                          Claudio Martella
                          Elia Bruni


                                9 November 2011




Tuesday, November 8, 11
Outline
                    • What is Hadoop
                    • Why is Hadoop
                    • How is Hadoop
                    • Hadoop & Python
                    • Some NLP code
                    • A more complicated problem: Eva
                                        2
Tuesday, November 8, 11
A bit of Context
                    • 2003: first MapReduce library @ Google
                    • 2003: GFS paper
                    • 2004: MapReduce paper
                    • 2005: Apache Nutch uses MapReduce
                    • 2006: Hadoop was born
                    • 2007: first 1000 nodes cluster at Y!
                                       3
Tuesday, November 8, 11
An Ecosystem
            • HDFS & MapReduce
            • Zookeeper
            • HBase
            • Pig & Hive
            • Mahout
            • Giraph
            • Nutch              4
Tuesday, November 8, 11
Traditional way
                    • Design a high-level Schema
                    • You store data in a RDBMS
                    • Which has very poor write throughput
                    • And doesn’t scale very much
                    • When you talk about Terabyte of data
                    • Expensive Data Warehouse
                                        5
Tuesday, November 8, 11
BigData & NoSQL

                    • Store first, think later
                    • Schema-less storage
                    • Analytics
                    • Petabyte scale
                    • Offline processing
                                           6
Tuesday, November 8, 11
Vertical Scalability

                    • Extremely expensive
                    • Requires expertise in distributed systems
                          and concurrent programming
                    • Lacks of real fault-tolerance

                                           7
Tuesday, November 8, 11
Horizontal Scalability

                    • Built on top of commodity hardware
                    • Easy to use programming paradigms
                    • Fault-tolerance through replication


                                        8
Tuesday, November 8, 11
1st Assumptions
                    • Data to process does not fit on one node.
                    • Each node is commodity hardware.
                    • Failure happens.
                   Spread your data among your nodes
                            and replicate it.

                                        9
Tuesday, November 8, 11
2nd Assumptions
                    • Moving computation is cheap.
                    • Moving data is expensive.
                    • Distributed computing is hard.
                          Move computation to data,
                            with simple paradigm.

                                         10
Tuesday, November 8, 11
3rd Assumptions
                    • Systems run on spinning hard disks.
                    • Disk seek >> disk scan.
                    • Many small files are expensive.

          Base the paradigm on scanning large files.


                                         11
Tuesday, November 8, 11
Typical Problem

                    • Collect and iterate over many records
                    • Filter and extract something from each
                    • Shuffle & sort these intermediate results
                    • Group-by and aggregate them
                    • Produce final output set
                                         12
Tuesday, November 8, 11
Typical Problem

                    • Collect and iterate over many records
      AP




                    • Filter and extract something from each
  M




                    • Shuffle & sort these intermediate R
                                                       results
                    • Group-by and aggregate them       ED
                                                           U
                    • Produce final output set                C
                                                               E


                                         13
Tuesday, November 8, 11
Quick example
          127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/
          1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en]
          (Win98; I ;Nav)"




                    •     (frank, index.html)

                    •     (index.html, 10/Oct/2000)

                    •     (index.html, http://www.example.com/start.html)


                                                14
Tuesday, November 8, 11
MapReduce
                    • Programmers define two functions:
                          ★   map (key, value)           (key’, value’)*
                          ★   reduce (key’, [value’+])         (key”, value”)*

                    • Can also define:
                          ★   combine (key, value)          (key’, value’)*
                          ★   partitioner: k‘            partition


                                                  15
Tuesday, November 8, 11
k1 v1   k2 v2   k3 v3      k4 v4   k5 v5       k6 v6




                           map                 map                     map                   map


                          a 1    b 2        c 3     c 6           a 5    c     2           b 7   c   9

                                Shuffle and Sort: aggregate values by keys
                                       a    1 5             b     2 7                c     2 3 6 9




                                  reduce              reduce                   reduce


                                    r1 s1                 r2 s2                    r3 s3



                                                          16
Tuesday, November 8, 11
MapReduce daemons
                    • JobTracker: it’s the Master, it runs the
                          schedule of the jobs, assigns tasks to
                          nodes, collects hearth-beats from workers,
                          reschedules for fault-tolerance.
                    • TaskTracker: it’s the Worker, it runs on
                          each slave, runs (multiple) Mappers and
                          Reducers each in their JVM.


                                             17
Tuesday, November 8, 11
User
                                                                           Program

                                                           (1) fork         (1) fork       (1) fork


                                                                           Master

                                                             (2) assign map
                                                                                   (2) assign reduce

                                                worker
                           split 0
                                                                                                                (6) write   output
                           split 1                                           (5) remote read           worker
                                     (3) read                                                                                file 0
                           split 2                       (4) local write
                                                worker
                           split 3
                           split 4                                                                                          output
                                                                                                       worker
                                                                                                                             file 1

                                                worker


                           Input                 Map            Intermediate files                     Reduce               Output
                            files               phase             (on local disk)                      phase                 files




                                                                              18
Redrawn from (Dean and Ghemawat, OSDI 2004)
 Tuesday, November 8, 11
HDFS daemons
                    • NameNode: it’s the Master, it keeps the
                          filesystem metadata (in-memory), the file-
                          block-node mapping, decides replication
                          and block placement, collects heart-beats
                          from nodes.
                    • DataNode: it’s the Slave, it stores the
                          blocks (64MB) of the files and serves
                          directly reads and writes.

                                            19
Tuesday, November 8, 11
Application                                     GFS master
                                  (file name, chunk index)                                 /foo/bar
                 GSF Client                                       File namespace            chunk 2ef0
                                (chunk handle, chunk location)




                                                                     Instructions to chunkserver

                                                                                Chunkserver state
                                 (chunk handle, byte range)
                                                                  GFS chunkserver                   GFS chunkserver
                                 chunk data
                                                                   Linux file system                  Linux file system

                                                                                   …                                 …




awn from (Ghemawat et al., SOSP 2003)

                                                                      20
 Tuesday, November 8, 11
Transparent to

                    • Workers to data assignment
                    • Map / Reduce assignment to nodes
                    • Management of synchronization
                    • Management of communication
                    • Fault-tolerance and restarts
                                       21
Tuesday, November 8, 11
Take home recipe

                    • Scan-based computation (no random I/O)
                    • Big datasets
                    • Divide-and-conquer class algorithms
                    • No communication between tasks

                                       22
Tuesday, November 8, 11
Not good for

                    • Real-time / Stream processing
                    • Graph processing
                    • Computation without locality
                    • Small datasets

                                        23
Tuesday, November 8, 11
Questions?



Tuesday, November 8, 11
Baseline solution




Tuesday, November 8, 11
What we attacked

                    • You don’t want to parse the file many times
                    • You don’t want to re-calculate the norm
                    • You don’t want to calculate 0*n

                                        26
Tuesday, November 8, 11
Our solution
                          0 1.3 0     0 7.1 1.1        1.3   7.1   1.1

                    1.2 0         0   0   0 3.4        1.2   3.4

                          0 5.7 0     0 1.1 2          5.7   1.1   2

                    5.1 0         0 4.6 0   10         5.1   4.6   10

                          0   0   0 1.6 0    0         1.6


                      line format: <string><norm>[<col><value>]*
                        for example: cat 12.1313 0 5.1 3 4.6 5 10
                                                  27
Tuesday, November 8, 11
Benchmarking

                    • serial python (single-core): 7 minutes
                    • java+hadoop (single-core): 2 minutes
                    • serial python (big file): 18 days
                    • java+hadoop (parallel, big file): 8 hours
                    • it makes sense: 18d / 3.5 = 5.14d / 14 = 8h
                                         28
Tuesday, November 8, 11

More Related Content

Similar to Hadoop: A Hands-on Introduction

Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionPhil Cryer
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big DataShawn Hermans
 
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMUsing JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMPT.JUG
 
Rails Performance Tuning
Rails Performance TuningRails Performance Tuning
Rails Performance TuningBurke Libbey
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...OpenBlend society
 
Rails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyRails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyBlazing Cloud
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLDaniel Austin
 
Openstack In Real Life
Openstack In Real LifeOpenstack In Real Life
Openstack In Real LifePaul Guth
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydbDaniel Austin
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLYan Cui
 
Play concurrency
Play concurrencyPlay concurrency
Play concurrencyJustin Long
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataChristan Grant
 
Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Skills Matter
 

Similar to Hadoop: A Hands-on Introduction (20)

My sql tutorial-oscon-2012
My sql tutorial-oscon-2012My sql tutorial-oscon-2012
My sql tutorial-oscon-2012
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
Infinispan for Dummies
Infinispan for DummiesInfinispan for Dummies
Infinispan for Dummies
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMUsing JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
 
Rails Performance Tuning
Rails Performance TuningRails Performance Tuning
Rails Performance Tuning
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
 
Rails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyRails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_many
 
Iwmn architecture
Iwmn architectureIwmn architecture
Iwmn architecture
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQL
 
Openstack In Real Life
Openstack In Real LifeOpenstack In Real Life
Openstack In Real Life
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydb
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Play concurrency
Play concurrencyPlay concurrency
Play concurrency
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big Data
 
Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)
 
NoSQL
NoSQLNoSQL
NoSQL
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Hadoop: A Hands-on Introduction

  • 1. Hadoop A Hands-on Introduction Claudio Martella Elia Bruni 9 November 2011 Tuesday, November 8, 11
  • 2. Outline • What is Hadoop • Why is Hadoop • How is Hadoop • Hadoop & Python • Some NLP code • A more complicated problem: Eva 2 Tuesday, November 8, 11
  • 3. A bit of Context • 2003: first MapReduce library @ Google • 2003: GFS paper • 2004: MapReduce paper • 2005: Apache Nutch uses MapReduce • 2006: Hadoop was born • 2007: first 1000 nodes cluster at Y! 3 Tuesday, November 8, 11
  • 4. An Ecosystem • HDFS & MapReduce • Zookeeper • HBase • Pig & Hive • Mahout • Giraph • Nutch 4 Tuesday, November 8, 11
  • 5. Traditional way • Design a high-level Schema • You store data in a RDBMS • Which has very poor write throughput • And doesn’t scale very much • When you talk about Terabyte of data • Expensive Data Warehouse 5 Tuesday, November 8, 11
  • 6. BigData & NoSQL • Store first, think later • Schema-less storage • Analytics • Petabyte scale • Offline processing 6 Tuesday, November 8, 11
  • 7. Vertical Scalability • Extremely expensive • Requires expertise in distributed systems and concurrent programming • Lacks of real fault-tolerance 7 Tuesday, November 8, 11
  • 8. Horizontal Scalability • Built on top of commodity hardware • Easy to use programming paradigms • Fault-tolerance through replication 8 Tuesday, November 8, 11
  • 9. 1st Assumptions • Data to process does not fit on one node. • Each node is commodity hardware. • Failure happens. Spread your data among your nodes and replicate it. 9 Tuesday, November 8, 11
  • 10. 2nd Assumptions • Moving computation is cheap. • Moving data is expensive. • Distributed computing is hard. Move computation to data, with simple paradigm. 10 Tuesday, November 8, 11
  • 11. 3rd Assumptions • Systems run on spinning hard disks. • Disk seek >> disk scan. • Many small files are expensive. Base the paradigm on scanning large files. 11 Tuesday, November 8, 11
  • 12. Typical Problem • Collect and iterate over many records • Filter and extract something from each • Shuffle & sort these intermediate results • Group-by and aggregate them • Produce final output set 12 Tuesday, November 8, 11
  • 13. Typical Problem • Collect and iterate over many records AP • Filter and extract something from each M • Shuffle & sort these intermediate R results • Group-by and aggregate them ED U • Produce final output set C E 13 Tuesday, November 8, 11
  • 14. Quick example 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/ 1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" • (frank, index.html) • (index.html, 10/Oct/2000) • (index.html, http://www.example.com/start.html) 14 Tuesday, November 8, 11
  • 15. MapReduce • Programmers define two functions: ★ map (key, value) (key’, value’)* ★ reduce (key’, [value’+]) (key”, value”)* • Can also define: ★ combine (key, value) (key’, value’)* ★ partitioner: k‘ partition 15 Tuesday, November 8, 11
  • 16. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 9 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 9 reduce reduce reduce r1 s1 r2 s2 r3 s3 16 Tuesday, November 8, 11
  • 17. MapReduce daemons • JobTracker: it’s the Master, it runs the schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance. • TaskTracker: it’s the Worker, it runs on each slave, runs (multiple) Mappers and Reducers each in their JVM. 17 Tuesday, November 8, 11
  • 18. User Program (1) fork (1) fork (1) fork Master (2) assign map (2) assign reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files 18 Redrawn from (Dean and Ghemawat, OSDI 2004) Tuesday, November 8, 11
  • 19. HDFS daemons • NameNode: it’s the Master, it keeps the filesystem metadata (in-memory), the file- block-node mapping, decides replication and block placement, collects heart-beats from nodes. • DataNode: it’s the Slave, it stores the blocks (64MB) of the files and serves directly reads and writes. 19 Tuesday, November 8, 11
  • 20. Application GFS master (file name, chunk index) /foo/bar GSF Client File namespace chunk 2ef0 (chunk handle, chunk location) Instructions to chunkserver Chunkserver state (chunk handle, byte range) GFS chunkserver GFS chunkserver chunk data Linux file system Linux file system … … awn from (Ghemawat et al., SOSP 2003) 20 Tuesday, November 8, 11
  • 21. Transparent to • Workers to data assignment • Map / Reduce assignment to nodes • Management of synchronization • Management of communication • Fault-tolerance and restarts 21 Tuesday, November 8, 11
  • 22. Take home recipe • Scan-based computation (no random I/O) • Big datasets • Divide-and-conquer class algorithms • No communication between tasks 22 Tuesday, November 8, 11
  • 23. Not good for • Real-time / Stream processing • Graph processing • Computation without locality • Small datasets 23 Tuesday, November 8, 11
  • 26. What we attacked • You don’t want to parse the file many times • You don’t want to re-calculate the norm • You don’t want to calculate 0*n 26 Tuesday, November 8, 11
  • 27. Our solution 0 1.3 0 0 7.1 1.1 1.3 7.1 1.1 1.2 0 0 0 0 3.4 1.2 3.4 0 5.7 0 0 1.1 2 5.7 1.1 2 5.1 0 0 4.6 0 10 5.1 4.6 10 0 0 0 1.6 0 0 1.6 line format: <string><norm>[<col><value>]* for example: cat 12.1313 0 5.1 3 4.6 5 10 27 Tuesday, November 8, 11
  • 28. Benchmarking • serial python (single-core): 7 minutes • java+hadoop (single-core): 2 minutes • serial python (big file): 18 days • java+hadoop (parallel, big file): 8 hours • it makes sense: 18d / 3.5 = 5.14d / 14 = 8h 28 Tuesday, November 8, 11