SlideShare a Scribd company logo
1 of 28
Download to read offline
Hadoop
                          A Hands-on Introduction

                          Claudio Martella
                          Elia Bruni


                                9 November 2011




Tuesday, November 8, 11
Outline
                    • What is Hadoop
                    • Why is Hadoop
                    • How is Hadoop
                    • Hadoop & Python
                    • Some NLP code
                    • A more complicated problem: Eva
                                        2
Tuesday, November 8, 11
A bit of Context
                    • 2003: first MapReduce library @ Google
                    • 2003: GFS paper
                    • 2004: MapReduce paper
                    • 2005: Apache Nutch uses MapReduce
                    • 2006: Hadoop was born
                    • 2007: first 1000 nodes cluster at Y!
                                       3
Tuesday, November 8, 11
An Ecosystem
            • HDFS & MapReduce
            • Zookeeper
            • HBase
            • Pig & Hive
            • Mahout
            • Giraph
            • Nutch              4
Tuesday, November 8, 11
Traditional way
                    • Design a high-level Schema
                    • You store data in a RDBMS
                    • Which has very poor write throughput
                    • And doesn’t scale very much
                    • When you talk about Terabyte of data
                    • Expensive Data Warehouse
                                        5
Tuesday, November 8, 11
BigData & NoSQL

                    • Store first, think later
                    • Schema-less storage
                    • Analytics
                    • Petabyte scale
                    • Offline processing
                                           6
Tuesday, November 8, 11
Vertical Scalability

                    • Extremely expensive
                    • Requires expertise in distributed systems
                          and concurrent programming
                    • Lacks of real fault-tolerance

                                           7
Tuesday, November 8, 11
Horizontal Scalability

                    • Built on top of commodity hardware
                    • Easy to use programming paradigms
                    • Fault-tolerance through replication


                                        8
Tuesday, November 8, 11
1st Assumptions
                    • Data to process does not fit on one node.
                    • Each node is commodity hardware.
                    • Failure happens.
                   Spread your data among your nodes
                            and replicate it.

                                        9
Tuesday, November 8, 11
2nd Assumptions
                    • Moving computation is cheap.
                    • Moving data is expensive.
                    • Distributed computing is hard.
                          Move computation to data,
                            with simple paradigm.

                                         10
Tuesday, November 8, 11
3rd Assumptions
                    • Systems run on spinning hard disks.
                    • Disk seek >> disk scan.
                    • Many small files are expensive.

          Base the paradigm on scanning large files.


                                         11
Tuesday, November 8, 11
Typical Problem

                    • Collect and iterate over many records
                    • Filter and extract something from each
                    • Shuffle & sort these intermediate results
                    • Group-by and aggregate them
                    • Produce final output set
                                         12
Tuesday, November 8, 11
Typical Problem

                    • Collect and iterate over many records
      AP




                    • Filter and extract something from each
  M




                    • Shuffle & sort these intermediate R
                                                       results
                    • Group-by and aggregate them       ED
                                                           U
                    • Produce final output set                C
                                                               E


                                         13
Tuesday, November 8, 11
Quick example
          127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/
          1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en]
          (Win98; I ;Nav)"




                    •     (frank, index.html)

                    •     (index.html, 10/Oct/2000)

                    •     (index.html, http://www.example.com/start.html)


                                                14
Tuesday, November 8, 11
MapReduce
                    • Programmers define two functions:
                          ★   map (key, value)           (key’, value’)*
                          ★   reduce (key’, [value’+])         (key”, value”)*

                    • Can also define:
                          ★   combine (key, value)          (key’, value’)*
                          ★   partitioner: k‘            partition


                                                  15
Tuesday, November 8, 11
k1 v1   k2 v2   k3 v3      k4 v4   k5 v5       k6 v6




                           map                 map                     map                   map


                          a 1    b 2        c 3     c 6           a 5    c     2           b 7   c   9

                                Shuffle and Sort: aggregate values by keys
                                       a    1 5             b     2 7                c     2 3 6 9




                                  reduce              reduce                   reduce


                                    r1 s1                 r2 s2                    r3 s3



                                                          16
Tuesday, November 8, 11
MapReduce daemons
                    • JobTracker: it’s the Master, it runs the
                          schedule of the jobs, assigns tasks to
                          nodes, collects hearth-beats from workers,
                          reschedules for fault-tolerance.
                    • TaskTracker: it’s the Worker, it runs on
                          each slave, runs (multiple) Mappers and
                          Reducers each in their JVM.


                                             17
Tuesday, November 8, 11
User
                                                                           Program

                                                           (1) fork         (1) fork       (1) fork


                                                                           Master

                                                             (2) assign map
                                                                                   (2) assign reduce

                                                worker
                           split 0
                                                                                                                (6) write   output
                           split 1                                           (5) remote read           worker
                                     (3) read                                                                                file 0
                           split 2                       (4) local write
                                                worker
                           split 3
                           split 4                                                                                          output
                                                                                                       worker
                                                                                                                             file 1

                                                worker


                           Input                 Map            Intermediate files                     Reduce               Output
                            files               phase             (on local disk)                      phase                 files




                                                                              18
Redrawn from (Dean and Ghemawat, OSDI 2004)
 Tuesday, November 8, 11
HDFS daemons
                    • NameNode: it’s the Master, it keeps the
                          filesystem metadata (in-memory), the file-
                          block-node mapping, decides replication
                          and block placement, collects heart-beats
                          from nodes.
                    • DataNode: it’s the Slave, it stores the
                          blocks (64MB) of the files and serves
                          directly reads and writes.

                                            19
Tuesday, November 8, 11
Application                                     GFS master
                                  (file name, chunk index)                                 /foo/bar
                 GSF Client                                       File namespace            chunk 2ef0
                                (chunk handle, chunk location)




                                                                     Instructions to chunkserver

                                                                                Chunkserver state
                                 (chunk handle, byte range)
                                                                  GFS chunkserver                   GFS chunkserver
                                 chunk data
                                                                   Linux file system                  Linux file system

                                                                                   …                                 …




awn from (Ghemawat et al., SOSP 2003)

                                                                      20
 Tuesday, November 8, 11
Transparent to

                    • Workers to data assignment
                    • Map / Reduce assignment to nodes
                    • Management of synchronization
                    • Management of communication
                    • Fault-tolerance and restarts
                                       21
Tuesday, November 8, 11
Take home recipe

                    • Scan-based computation (no random I/O)
                    • Big datasets
                    • Divide-and-conquer class algorithms
                    • No communication between tasks

                                       22
Tuesday, November 8, 11
Not good for

                    • Real-time / Stream processing
                    • Graph processing
                    • Computation without locality
                    • Small datasets

                                        23
Tuesday, November 8, 11
Questions?



Tuesday, November 8, 11
Baseline solution




Tuesday, November 8, 11
What we attacked

                    • You don’t want to parse the file many times
                    • You don’t want to re-calculate the norm
                    • You don’t want to calculate 0*n

                                        26
Tuesday, November 8, 11
Our solution
                          0 1.3 0     0 7.1 1.1        1.3   7.1   1.1

                    1.2 0         0   0   0 3.4        1.2   3.4

                          0 5.7 0     0 1.1 2          5.7   1.1   2

                    5.1 0         0 4.6 0   10         5.1   4.6   10

                          0   0   0 1.6 0    0         1.6


                      line format: <string><norm>[<col><value>]*
                        for example: cat 12.1313 0 5.1 3 4.6 5 10
                                                  27
Tuesday, November 8, 11
Benchmarking

                    • serial python (single-core): 7 minutes
                    • java+hadoop (single-core): 2 minutes
                    • serial python (big file): 18 days
                    • java+hadoop (parallel, big file): 8 hours
                    • it makes sense: 18d / 3.5 = 5.14d / 14 = 8h
                                         28
Tuesday, November 8, 11

More Related Content

Similar to Hadoop: A Hands-on Introduction

Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionPhil Cryer
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big DataShawn Hermans
 
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMUsing JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMPT.JUG
 
Rails Performance Tuning
Rails Performance TuningRails Performance Tuning
Rails Performance TuningBurke Libbey
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...OpenBlend society
 
Rails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyRails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyBlazing Cloud
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLDaniel Austin
 
Openstack In Real Life
Openstack In Real LifeOpenstack In Real Life
Openstack In Real LifePaul Guth
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydbDaniel Austin
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLYan Cui
 
Play concurrency
Play concurrencyPlay concurrency
Play concurrencyJustin Long
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataChristan Grant
 
Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Skills Matter
 

Similar to Hadoop: A Hands-on Introduction (20)

My sql tutorial-oscon-2012
My sql tutorial-oscon-2012My sql tutorial-oscon-2012
My sql tutorial-oscon-2012
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
Infinispan for Dummies
Infinispan for DummiesInfinispan for Dummies
Infinispan for Dummies
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMUsing JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
 
Rails Performance Tuning
Rails Performance TuningRails Performance Tuning
Rails Performance Tuning
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
 
Rails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyRails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_many
 
Iwmn architecture
Iwmn architectureIwmn architecture
Iwmn architecture
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQL
 
Openstack In Real Life
Openstack In Real LifeOpenstack In Real Life
Openstack In Real Life
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydb
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Play concurrency
Play concurrencyPlay concurrency
Play concurrency
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big Data
 
Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)
 
NoSQL
NoSQLNoSQL
NoSQL
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Hadoop: A Hands-on Introduction

  • 1. Hadoop A Hands-on Introduction Claudio Martella Elia Bruni 9 November 2011 Tuesday, November 8, 11
  • 2. Outline • What is Hadoop • Why is Hadoop • How is Hadoop • Hadoop & Python • Some NLP code • A more complicated problem: Eva 2 Tuesday, November 8, 11
  • 3. A bit of Context • 2003: first MapReduce library @ Google • 2003: GFS paper • 2004: MapReduce paper • 2005: Apache Nutch uses MapReduce • 2006: Hadoop was born • 2007: first 1000 nodes cluster at Y! 3 Tuesday, November 8, 11
  • 4. An Ecosystem • HDFS & MapReduce • Zookeeper • HBase • Pig & Hive • Mahout • Giraph • Nutch 4 Tuesday, November 8, 11
  • 5. Traditional way • Design a high-level Schema • You store data in a RDBMS • Which has very poor write throughput • And doesn’t scale very much • When you talk about Terabyte of data • Expensive Data Warehouse 5 Tuesday, November 8, 11
  • 6. BigData & NoSQL • Store first, think later • Schema-less storage • Analytics • Petabyte scale • Offline processing 6 Tuesday, November 8, 11
  • 7. Vertical Scalability • Extremely expensive • Requires expertise in distributed systems and concurrent programming • Lacks of real fault-tolerance 7 Tuesday, November 8, 11
  • 8. Horizontal Scalability • Built on top of commodity hardware • Easy to use programming paradigms • Fault-tolerance through replication 8 Tuesday, November 8, 11
  • 9. 1st Assumptions • Data to process does not fit on one node. • Each node is commodity hardware. • Failure happens. Spread your data among your nodes and replicate it. 9 Tuesday, November 8, 11
  • 10. 2nd Assumptions • Moving computation is cheap. • Moving data is expensive. • Distributed computing is hard. Move computation to data, with simple paradigm. 10 Tuesday, November 8, 11
  • 11. 3rd Assumptions • Systems run on spinning hard disks. • Disk seek >> disk scan. • Many small files are expensive. Base the paradigm on scanning large files. 11 Tuesday, November 8, 11
  • 12. Typical Problem • Collect and iterate over many records • Filter and extract something from each • Shuffle & sort these intermediate results • Group-by and aggregate them • Produce final output set 12 Tuesday, November 8, 11
  • 13. Typical Problem • Collect and iterate over many records AP • Filter and extract something from each M • Shuffle & sort these intermediate R results • Group-by and aggregate them ED U • Produce final output set C E 13 Tuesday, November 8, 11
  • 14. Quick example 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/ 1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" • (frank, index.html) • (index.html, 10/Oct/2000) • (index.html, http://www.example.com/start.html) 14 Tuesday, November 8, 11
  • 15. MapReduce • Programmers define two functions: ★ map (key, value) (key’, value’)* ★ reduce (key’, [value’+]) (key”, value”)* • Can also define: ★ combine (key, value) (key’, value’)* ★ partitioner: k‘ partition 15 Tuesday, November 8, 11
  • 16. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 9 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 9 reduce reduce reduce r1 s1 r2 s2 r3 s3 16 Tuesday, November 8, 11
  • 17. MapReduce daemons • JobTracker: it’s the Master, it runs the schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance. • TaskTracker: it’s the Worker, it runs on each slave, runs (multiple) Mappers and Reducers each in their JVM. 17 Tuesday, November 8, 11
  • 18. User Program (1) fork (1) fork (1) fork Master (2) assign map (2) assign reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files 18 Redrawn from (Dean and Ghemawat, OSDI 2004) Tuesday, November 8, 11
  • 19. HDFS daemons • NameNode: it’s the Master, it keeps the filesystem metadata (in-memory), the file- block-node mapping, decides replication and block placement, collects heart-beats from nodes. • DataNode: it’s the Slave, it stores the blocks (64MB) of the files and serves directly reads and writes. 19 Tuesday, November 8, 11
  • 20. Application GFS master (file name, chunk index) /foo/bar GSF Client File namespace chunk 2ef0 (chunk handle, chunk location) Instructions to chunkserver Chunkserver state (chunk handle, byte range) GFS chunkserver GFS chunkserver chunk data Linux file system Linux file system … … awn from (Ghemawat et al., SOSP 2003) 20 Tuesday, November 8, 11
  • 21. Transparent to • Workers to data assignment • Map / Reduce assignment to nodes • Management of synchronization • Management of communication • Fault-tolerance and restarts 21 Tuesday, November 8, 11
  • 22. Take home recipe • Scan-based computation (no random I/O) • Big datasets • Divide-and-conquer class algorithms • No communication between tasks 22 Tuesday, November 8, 11
  • 23. Not good for • Real-time / Stream processing • Graph processing • Computation without locality • Small datasets 23 Tuesday, November 8, 11
  • 26. What we attacked • You don’t want to parse the file many times • You don’t want to re-calculate the norm • You don’t want to calculate 0*n 26 Tuesday, November 8, 11
  • 27. Our solution 0 1.3 0 0 7.1 1.1 1.3 7.1 1.1 1.2 0 0 0 0 3.4 1.2 3.4 0 5.7 0 0 1.1 2 5.7 1.1 2 5.1 0 0 4.6 0 10 5.1 4.6 10 0 0 0 1.6 0 0 1.6 line format: <string><norm>[<col><value>]* for example: cat 12.1313 0 5.1 3 4.6 5 10 27 Tuesday, November 8, 11
  • 28. Benchmarking • serial python (single-core): 7 minutes • java+hadoop (single-core): 2 minutes • serial python (big file): 18 days • java+hadoop (parallel, big file): 8 hours • it makes sense: 18d / 3.5 = 5.14d / 14 = 8h 28 Tuesday, November 8, 11