SlideShare a Scribd company logo
1 of 28
Download to read offline
Hadoop
                          A Hands-on Introduction

                          Claudio Martella
                          Elia Bruni


                                9 November 2011




Tuesday, November 8, 11
Outline
                    • What is Hadoop
                    • Why is Hadoop
                    • How is Hadoop
                    • Hadoop & Python
                    • Some NLP code
                    • A more complicated problem: Eva
                                        2
Tuesday, November 8, 11
A bit of Context
                    • 2003: first MapReduce library @ Google
                    • 2003: GFS paper
                    • 2004: MapReduce paper
                    • 2005: Apache Nutch uses MapReduce
                    • 2006: Hadoop was born
                    • 2007: first 1000 nodes cluster at Y!
                                       3
Tuesday, November 8, 11
An Ecosystem
            • HDFS & MapReduce
            • Zookeeper
            • HBase
            • Pig & Hive
            • Mahout
            • Giraph
            • Nutch              4
Tuesday, November 8, 11
Traditional way
                    • Design a high-level Schema
                    • You store data in a RDBMS
                    • Which has very poor write throughput
                    • And doesn’t scale very much
                    • When you talk about Terabyte of data
                    • Expensive Data Warehouse
                                        5
Tuesday, November 8, 11
BigData & NoSQL

                    • Store first, think later
                    • Schema-less storage
                    • Analytics
                    • Petabyte scale
                    • Offline processing
                                           6
Tuesday, November 8, 11
Vertical Scalability

                    • Extremely expensive
                    • Requires expertise in distributed systems
                          and concurrent programming
                    • Lacks of real fault-tolerance

                                           7
Tuesday, November 8, 11
Horizontal Scalability

                    • Built on top of commodity hardware
                    • Easy to use programming paradigms
                    • Fault-tolerance through replication


                                        8
Tuesday, November 8, 11
1st Assumptions
                    • Data to process does not fit on one node.
                    • Each node is commodity hardware.
                    • Failure happens.
                   Spread your data among your nodes
                            and replicate it.

                                        9
Tuesday, November 8, 11
2nd Assumptions
                    • Moving computation is cheap.
                    • Moving data is expensive.
                    • Distributed computing is hard.
                          Move computation to data,
                            with simple paradigm.

                                         10
Tuesday, November 8, 11
3rd Assumptions
                    • Systems run on spinning hard disks.
                    • Disk seek >> disk scan.
                    • Many small files are expensive.

          Base the paradigm on scanning large files.


                                         11
Tuesday, November 8, 11
Typical Problem

                    • Collect and iterate over many records
                    • Filter and extract something from each
                    • Shuffle & sort these intermediate results
                    • Group-by and aggregate them
                    • Produce final output set
                                         12
Tuesday, November 8, 11
Typical Problem

                    • Collect and iterate over many records
      AP




                    • Filter and extract something from each
  M




                    • Shuffle & sort these intermediate R
                                                       results
                    • Group-by and aggregate them       ED
                                                           U
                    • Produce final output set                C
                                                               E


                                         13
Tuesday, November 8, 11
Quick example
          127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/
          1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en]
          (Win98; I ;Nav)"




                    •     (frank, index.html)

                    •     (index.html, 10/Oct/2000)

                    •     (index.html, http://www.example.com/start.html)


                                                14
Tuesday, November 8, 11
MapReduce
                    • Programmers define two functions:
                          ★   map (key, value)           (key’, value’)*
                          ★   reduce (key’, [value’+])         (key”, value”)*

                    • Can also define:
                          ★   combine (key, value)          (key’, value’)*
                          ★   partitioner: k‘            partition


                                                  15
Tuesday, November 8, 11
k1 v1   k2 v2   k3 v3      k4 v4   k5 v5       k6 v6




                           map                 map                     map                   map


                          a 1    b 2        c 3     c 6           a 5    c     2           b 7   c   9

                                Shuffle and Sort: aggregate values by keys
                                       a    1 5             b     2 7                c     2 3 6 9




                                  reduce              reduce                   reduce


                                    r1 s1                 r2 s2                    r3 s3



                                                          16
Tuesday, November 8, 11
MapReduce daemons
                    • JobTracker: it’s the Master, it runs the
                          schedule of the jobs, assigns tasks to
                          nodes, collects hearth-beats from workers,
                          reschedules for fault-tolerance.
                    • TaskTracker: it’s the Worker, it runs on
                          each slave, runs (multiple) Mappers and
                          Reducers each in their JVM.


                                             17
Tuesday, November 8, 11
User
                                                                           Program

                                                           (1) fork         (1) fork       (1) fork


                                                                           Master

                                                             (2) assign map
                                                                                   (2) assign reduce

                                                worker
                           split 0
                                                                                                                (6) write   output
                           split 1                                           (5) remote read           worker
                                     (3) read                                                                                file 0
                           split 2                       (4) local write
                                                worker
                           split 3
                           split 4                                                                                          output
                                                                                                       worker
                                                                                                                             file 1

                                                worker


                           Input                 Map            Intermediate files                     Reduce               Output
                            files               phase             (on local disk)                      phase                 files




                                                                              18
Redrawn from (Dean and Ghemawat, OSDI 2004)
 Tuesday, November 8, 11
HDFS daemons
                    • NameNode: it’s the Master, it keeps the
                          filesystem metadata (in-memory), the file-
                          block-node mapping, decides replication
                          and block placement, collects heart-beats
                          from nodes.
                    • DataNode: it’s the Slave, it stores the
                          blocks (64MB) of the files and serves
                          directly reads and writes.

                                            19
Tuesday, November 8, 11
Application                                     GFS master
                                  (file name, chunk index)                                 /foo/bar
                 GSF Client                                       File namespace            chunk 2ef0
                                (chunk handle, chunk location)




                                                                     Instructions to chunkserver

                                                                                Chunkserver state
                                 (chunk handle, byte range)
                                                                  GFS chunkserver                   GFS chunkserver
                                 chunk data
                                                                   Linux file system                  Linux file system

                                                                                   …                                 …




awn from (Ghemawat et al., SOSP 2003)

                                                                      20
 Tuesday, November 8, 11
Transparent to

                    • Workers to data assignment
                    • Map / Reduce assignment to nodes
                    • Management of synchronization
                    • Management of communication
                    • Fault-tolerance and restarts
                                       21
Tuesday, November 8, 11
Take home recipe

                    • Scan-based computation (no random I/O)
                    • Big datasets
                    • Divide-and-conquer class algorithms
                    • No communication between tasks

                                       22
Tuesday, November 8, 11
Not good for

                    • Real-time / Stream processing
                    • Graph processing
                    • Computation without locality
                    • Small datasets

                                        23
Tuesday, November 8, 11
Questions?



Tuesday, November 8, 11
Baseline solution




Tuesday, November 8, 11
What we attacked

                    • You don’t want to parse the file many times
                    • You don’t want to re-calculate the norm
                    • You don’t want to calculate 0*n

                                        26
Tuesday, November 8, 11
Our solution
                          0 1.3 0     0 7.1 1.1        1.3   7.1   1.1

                    1.2 0         0   0   0 3.4        1.2   3.4

                          0 5.7 0     0 1.1 2          5.7   1.1   2

                    5.1 0         0 4.6 0   10         5.1   4.6   10

                          0   0   0 1.6 0    0         1.6


                      line format: <string><norm>[<col><value>]*
                        for example: cat 12.1313 0 5.1 3 4.6 5 10
                                                  27
Tuesday, November 8, 11
Benchmarking

                    • serial python (single-core): 7 minutes
                    • java+hadoop (single-core): 2 minutes
                    • serial python (big file): 18 days
                    • java+hadoop (parallel, big file): 8 hours
                    • it makes sense: 18d / 3.5 = 5.14d / 14 = 8h
                                         28
Tuesday, November 8, 11

More Related Content

Similar to Hadoop: A Hands-on Introduction

Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionPhil Cryer
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big DataShawn Hermans
 
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMUsing JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMPT.JUG
 
Rails Performance Tuning
Rails Performance TuningRails Performance Tuning
Rails Performance TuningBurke Libbey
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...OpenBlend society
 
Rails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyRails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyBlazing Cloud
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLDaniel Austin
 
Openstack In Real Life
Openstack In Real LifeOpenstack In Real Life
Openstack In Real LifePaul Guth
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydbDaniel Austin
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLYan Cui
 
Play concurrency
Play concurrencyPlay concurrency
Play concurrencyJustin Long
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataChristan Grant
 
Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Skills Matter
 

Similar to Hadoop: A Hands-on Introduction (20)

My sql tutorial-oscon-2012
My sql tutorial-oscon-2012My sql tutorial-oscon-2012
My sql tutorial-oscon-2012
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
Infinispan for Dummies
Infinispan for DummiesInfinispan for Dummies
Infinispan for Dummies
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMUsing JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
 
Rails Performance Tuning
Rails Performance TuningRails Performance Tuning
Rails Performance Tuning
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
 
Rails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyRails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_many
 
Iwmn architecture
Iwmn architectureIwmn architecture
Iwmn architecture
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQL
 
Openstack In Real Life
Openstack In Real LifeOpenstack In Real Life
Openstack In Real Life
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydb
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Play concurrency
Play concurrencyPlay concurrency
Play concurrency
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big Data
 
Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)
 
NoSQL
NoSQLNoSQL
NoSQL
 

Recently uploaded

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Hadoop: A Hands-on Introduction

  • 1. Hadoop A Hands-on Introduction Claudio Martella Elia Bruni 9 November 2011 Tuesday, November 8, 11
  • 2. Outline • What is Hadoop • Why is Hadoop • How is Hadoop • Hadoop & Python • Some NLP code • A more complicated problem: Eva 2 Tuesday, November 8, 11
  • 3. A bit of Context • 2003: first MapReduce library @ Google • 2003: GFS paper • 2004: MapReduce paper • 2005: Apache Nutch uses MapReduce • 2006: Hadoop was born • 2007: first 1000 nodes cluster at Y! 3 Tuesday, November 8, 11
  • 4. An Ecosystem • HDFS & MapReduce • Zookeeper • HBase • Pig & Hive • Mahout • Giraph • Nutch 4 Tuesday, November 8, 11
  • 5. Traditional way • Design a high-level Schema • You store data in a RDBMS • Which has very poor write throughput • And doesn’t scale very much • When you talk about Terabyte of data • Expensive Data Warehouse 5 Tuesday, November 8, 11
  • 6. BigData & NoSQL • Store first, think later • Schema-less storage • Analytics • Petabyte scale • Offline processing 6 Tuesday, November 8, 11
  • 7. Vertical Scalability • Extremely expensive • Requires expertise in distributed systems and concurrent programming • Lacks of real fault-tolerance 7 Tuesday, November 8, 11
  • 8. Horizontal Scalability • Built on top of commodity hardware • Easy to use programming paradigms • Fault-tolerance through replication 8 Tuesday, November 8, 11
  • 9. 1st Assumptions • Data to process does not fit on one node. • Each node is commodity hardware. • Failure happens. Spread your data among your nodes and replicate it. 9 Tuesday, November 8, 11
  • 10. 2nd Assumptions • Moving computation is cheap. • Moving data is expensive. • Distributed computing is hard. Move computation to data, with simple paradigm. 10 Tuesday, November 8, 11
  • 11. 3rd Assumptions • Systems run on spinning hard disks. • Disk seek >> disk scan. • Many small files are expensive. Base the paradigm on scanning large files. 11 Tuesday, November 8, 11
  • 12. Typical Problem • Collect and iterate over many records • Filter and extract something from each • Shuffle & sort these intermediate results • Group-by and aggregate them • Produce final output set 12 Tuesday, November 8, 11
  • 13. Typical Problem • Collect and iterate over many records AP • Filter and extract something from each M • Shuffle & sort these intermediate R results • Group-by and aggregate them ED U • Produce final output set C E 13 Tuesday, November 8, 11
  • 14. Quick example 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/ 1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" • (frank, index.html) • (index.html, 10/Oct/2000) • (index.html, http://www.example.com/start.html) 14 Tuesday, November 8, 11
  • 15. MapReduce • Programmers define two functions: ★ map (key, value) (key’, value’)* ★ reduce (key’, [value’+]) (key”, value”)* • Can also define: ★ combine (key, value) (key’, value’)* ★ partitioner: k‘ partition 15 Tuesday, November 8, 11
  • 16. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 9 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 9 reduce reduce reduce r1 s1 r2 s2 r3 s3 16 Tuesday, November 8, 11
  • 17. MapReduce daemons • JobTracker: it’s the Master, it runs the schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance. • TaskTracker: it’s the Worker, it runs on each slave, runs (multiple) Mappers and Reducers each in their JVM. 17 Tuesday, November 8, 11
  • 18. User Program (1) fork (1) fork (1) fork Master (2) assign map (2) assign reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files 18 Redrawn from (Dean and Ghemawat, OSDI 2004) Tuesday, November 8, 11
  • 19. HDFS daemons • NameNode: it’s the Master, it keeps the filesystem metadata (in-memory), the file- block-node mapping, decides replication and block placement, collects heart-beats from nodes. • DataNode: it’s the Slave, it stores the blocks (64MB) of the files and serves directly reads and writes. 19 Tuesday, November 8, 11
  • 20. Application GFS master (file name, chunk index) /foo/bar GSF Client File namespace chunk 2ef0 (chunk handle, chunk location) Instructions to chunkserver Chunkserver state (chunk handle, byte range) GFS chunkserver GFS chunkserver chunk data Linux file system Linux file system … … awn from (Ghemawat et al., SOSP 2003) 20 Tuesday, November 8, 11
  • 21. Transparent to • Workers to data assignment • Map / Reduce assignment to nodes • Management of synchronization • Management of communication • Fault-tolerance and restarts 21 Tuesday, November 8, 11
  • 22. Take home recipe • Scan-based computation (no random I/O) • Big datasets • Divide-and-conquer class algorithms • No communication between tasks 22 Tuesday, November 8, 11
  • 23. Not good for • Real-time / Stream processing • Graph processing • Computation without locality • Small datasets 23 Tuesday, November 8, 11
  • 26. What we attacked • You don’t want to parse the file many times • You don’t want to re-calculate the norm • You don’t want to calculate 0*n 26 Tuesday, November 8, 11
  • 27. Our solution 0 1.3 0 0 7.1 1.1 1.3 7.1 1.1 1.2 0 0 0 0 3.4 1.2 3.4 0 5.7 0 0 1.1 2 5.7 1.1 2 5.1 0 0 4.6 0 10 5.1 4.6 10 0 0 0 1.6 0 0 1.6 line format: <string><norm>[<col><value>]* for example: cat 12.1313 0 5.1 3 4.6 5 10 27 Tuesday, November 8, 11
  • 28. Benchmarking • serial python (single-core): 7 minutes • java+hadoop (single-core): 2 minutes • serial python (big file): 18 days • java+hadoop (parallel, big file): 8 hours • it makes sense: 18d / 3.5 = 5.14d / 14 = 8h 28 Tuesday, November 8, 11