SlideShare a Scribd company logo
1 of 24
Download to read offline
Large-Scale Data Storage and Processing
          for Scientists  in The Netherlands




Evert.Lammerts@SARA.nl
January 21, 2011
NBIC BioAssist Programmers' Day
Super computing




Cloud computing                                         Grid computing




        Cluster computing               GPU computing




             BioAssist Programmers' Day, January 21, 2011
Status Quo:
Storage separate from Compute




       BioAssist Programmers' Day, January 21, 2011
Case Study:
Virtual Knowledge Studio
 http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment




                                                              ●   How do categories in WikiPedia
                                                                  evolve over time? (And how do
                                                                  they relate to internal links?)

                                                              ●   2.7 TB raw text, single file

                                                              ●   Java application, searches for
                                                                  categories in Wiki markup,
                                                                  like [[Category:NAME]]

                                                              ●   Executed on the Grid




            BioAssist Programmers' Day, January 21, 2011
1.1) Copy file from local       2.1) Stream file from Grid                  3.1) Process all files in
   Machine to Grid storage         Storage to single machine                   parallel: N machines
                                2.2) Cut into pieces of 10 GB                  run the Java application,
                                2.3) Stream back to Grid                       fetch a 10GB file as
                                   Storage                                     input, processing it, and
                                                                               putting the result back
                             BioAssist Programmers' Day, January 21, 2011
Status Quo:
           Arrange your own (Data-)parallelism
●   Cut the dataset up in “processable chunks”:
    ●   Size of chunk depending on local space on processing node...
    ●   … on the total processing capacity available …
    ●   … on the smallest unit of work (“largest grade of parallelism”)...
    ●   … on the substance (sometimes you don't want many output files, e.g.
        when building a search index).
●   Submit the amount of jobs you consider necessary:
    ●   To a cluster close to your data (270x10GB over WAN is a bad idea)
    ●   Amount might depend on cluster capacity, amount of chunks, smallest
        unit of work, substance...

         When dealing with large data, let's say 100GB+, this is
        ERROR PRONE, TIME CONSUMING AND NOT FOR NEWBIES!

                             BioAssist Programmers' Day, January 21, 2011
2002             2004                     2006

Nutch*           MR/GFS**                 Hadoop


*  http://nutch.apache.org/
** http://labs.google.com/papers/mapreduce.html
   http://labs.google.com/papers/gfs.html

                     BioAssist Programmers' Day, January 21, 2011
2010/2011: A Hype in Production




http://wiki.apache.org/hadoop/PoweredBy
                BioAssist Programmers' Day, January 21, 2011
BioAssist Programmers' Day, January 21, 2011
Hadoop Distributed File System (HDFS)
●   Very large DFS. Order of magnitude:
    –   10k nodes
    –   millions of files
    –   PetaBytes of storage
●   Assumes “commodity hardware”:
    –   redundancy through replication
    –   failure handling and recovery
●   Optimized for batch processing:
    –   locations of data exposed to computation
    –   high aggregate bandwidth
                     BioAssist Programmers' Day, January 21, 2011
                                                                    http://www.slideshare.net/jhammerb/hdfs-architecture
HDFS Continued...
●   Single Namespace for the entire cluster
●   Data coherency
    –   Write-once-read-many model
    –   Only appending is supported for existing files
●   Files are broken up in chunks (“blocks”)
    –   Blocksizes ranging from 64 to 256 MB, depending on configuration
    –   Blocks are distributed over nodes (a single FILE, existing of N
        blocks, is stored on M nodes)
    –   Blocks are replicated and replications are distributed
●   Client accesses the blocks of a file at the nodes directly
    –   This creates high aggregate bandwidth!
                       BioAssist Programmers' Day, January 21, 2011
                                                                      http://www.slideshare.net/jhammerb/hdfs-architecture
HDFS NameNode & DataNodes




NameNode                                                  DataNode
●   Manages File System Namespace                          ●   A “Block Server”
    –   Mapping filename to blocks                             –   Stores data in local FS
    –   Mapping blocks to DataNode                             –   Stores metadata of a block (e.g. hash)
●   Cluster Configuration                                      –   Serve (meta)data to clients
●   Replication Management                                 ●   Facilitates pipeline to other DN's
                                 BioAssist Programmers' Day, January 21, 2011
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html




                            Metadata operations
                             ●   Communicate with NN only
                                  –   ls (see above), lsr, df, du, chmod,
                                      chown... etc.

                            R/W (block) operations
                             ●   Communicate with NN and DN's
                                  –   put, copyFromLocal, CopyToLocal,
                                      tail... etc.

         BioAssist Programmers' Day, January 21, 2011
HDFS
     Application Programming Interface (API)
●   Enables programmers to access any HDFS from their code
●   Described at
    http://hadoop.apache.org/common/docs/r0.20.0/api/index.html
●   Written in (and thus available for) Java, but...
●   Is also exposed through Apache Thrift, so can be accessed
    from:
    ●   C++, Python, PHP, Ruby, and others
    ●   See http://wiki.apache.org/hadoop/HDFS-APIs
●   Has a separate C-API (libhdfs)

So: you can enable your services to work with HDFS
                       BioAssist Programmers' Day, January 21, 2011
MapReduce
●   Is a framework for distributed (parallel) processing
    of large datasets
●   Provides a programming model
●   Lets users plug-in own code
●   Uses a common pattern:
      cat   | grep | sort    | unique > file
      input | map  | shuffle | reduce > output
●   Is useful for large scale data analytics and
    processing
                 BioAssist Programmers' Day, January 21, 2011
MapReduce Continued...
●   Is great for processing large datasets!
    –   Send computation to data, so little data over lines
    –   Uses blocks stored in the DFS, so no splitting required
        (this is a bit more sophisticated depending on your input)
●   Handles parallelism for you
    –   One map per block, if possible
●   Scales basically linearly
    –   time_on_cluster = time_on_single_core / total_cores
●   Java, but streaming possible (plus others, see later)
                    BioAssist Programmers' Day, January 21, 2011
MapReduce
                          JobTracker & TaskTrackers




JobTracker                                                        TaskTracker
●   Holds job metadata                                            ●   Requests work from JT
    –   Status of job                                                 –   Fetch the code to execute from the DFS
    –   Status of Tasks running on TTs                                –   Apply job-specific configuration
●   Decides on scheduling                                         ●   Communicate with JT on tasks:
●   Delegates creation of 'InputSplits'                               –   Sending output, Killing tasks, Task updates, etc

                                         BioAssist Programmers' Day, January 21, 2011
MapReduce client




BioAssist Programmers' Day, January 21, 2011
MapReduce
     Application Programming Interface (API)
●   Enables programmers to write MapReduce jobs
    ●   More info on MR jobs:
        http://www.slideshare.net/evertlammerts/infodoc-6107350
●   Enables programmers to communicate with a
    JobTracker
    ●   Submitting jobs, getting statuses, cancelling jobs,
        etc
●   Described at
    http://hadoop.apache.org/common/docs/r0.20.0/api/index.html


                      BioAssist Programmers' Day, January 21, 2011
Case Study:
   Virtual Knowledge Studio




1) Load file into                         2) Submit code
   HDFS                                      to MR
        BioAssist Programmers' Day, January 21, 2011
What's more on Hadoop?

Lots!
●   Apache Pig http://pig.apache.org
     –   Analyze datasets in a high level language, “Pig Latin”
     –   Simple! SQL like. Extremely fast experiments.
     –   N-stage jobs (MR chaining!)
●   Apache Hive http://hive.apache.org
     –   Data Warehousing
     –   Hive QL
●   Apache Hbase http://hbase.apache.org
     –   BigTable implementation (Google)
     –   In-memory operation
     –   Performance good enough for websites (Facebook built its Messaging Platform on top of it)
●   Yahoo! Oozie http://yahoo.github.com/oozie/
     –   Hadoop workflow engine
●   Apache [AVRO | Chukwa | Hama | Mahout] and so on
●   3rd Party:
     –   ElephantBird
     –   Cloudera's Distribution for Hadoop
     –   Hue
     –   Yahoo's Distribution for Hadoop

                                          BioAssist Programmers' Day, January 21, 2011
Hadoop @ SARA

A prototype cluster
●   Since December 2010
●   20 cores for MR (TT's)
●   110 TB gross for HDFS (DN) (55TB net)
●   Hue web-interface for job submission & management
●   SFTP interface to HDFS
●   Pig 0.8
●   Hive
●   Available for scientists / scientific programmers until May / June 2011

Towards a production infrastructure?
●   Depending on results


    It's open for you all as well: ask me for an account!
                             BioAssist Programmers' Day, January 21, 2011
BioAssist Programmers' Day, January 21, 2011
Hadoop for:
    ●   Large-scale data storage and processing
    ●   Fundamental difference: data locality!
    ●   Small files? Don't, but... Hadoop Archives (HAR)
    ●   Archival? Don't. Use tape storage. (We have lots!)
    ●   Very fast analytics (Pig!)
    ●   For data-parallel applications (not good at crossproducts – use
        Huygens or Lisa!)
    ●   Legacy applications possible through piping / streaming (Weird
        dependencies? Use Cloud!)

We'll do another Hackathon on Hadoop. Interested? Send me a mail!
                          BioAssist Programmers' Day, January 21, 2011

More Related Content

What's hot

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 

What's hot (20)

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Big data
Big dataBig data
Big data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 

Similar to Large-Scale Data Storage and Processing for Scientists with Hadoop

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETNikos Kormpakis
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-trainingGeohedrick
 
Moeller bosc2010 debian_taverna
Moeller bosc2010 debian_tavernaMoeller bosc2010 debian_taverna
Moeller bosc2010 debian_tavernaBOSC 2010
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Community
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Community
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
high_level_parallel_processing_model
high_level_parallel_processing_modelhigh_level_parallel_processing_model
high_level_parallel_processing_modelMingliang Sun
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph Ceph Community
 

Similar to Large-Scale Data Storage and Processing for Scientists with Hadoop (20)

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Moeller bosc2010 debian_taverna
Moeller bosc2010 debian_tavernaMoeller bosc2010 debian_taverna
Moeller bosc2010 debian_taverna
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Hadoop
HadoopHadoop
Hadoop
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's Ceph
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
high_level_parallel_processing_model
high_level_parallel_processing_modelhigh_level_parallel_processing_model
high_level_parallel_processing_model
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 

Large-Scale Data Storage and Processing for Scientists with Hadoop

  • 1. Large-Scale Data Storage and Processing for Scientists in The Netherlands Evert.Lammerts@SARA.nl January 21, 2011 NBIC BioAssist Programmers' Day
  • 2. Super computing Cloud computing Grid computing Cluster computing GPU computing BioAssist Programmers' Day, January 21, 2011
  • 3. Status Quo: Storage separate from Compute BioAssist Programmers' Day, January 21, 2011
  • 4. Case Study: Virtual Knowledge Studio http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment ● How do categories in WikiPedia evolve over time? (And how do they relate to internal links?) ● 2.7 TB raw text, single file ● Java application, searches for categories in Wiki markup, like [[Category:NAME]] ● Executed on the Grid BioAssist Programmers' Day, January 21, 2011
  • 5. 1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in Machine to Grid storage Storage to single machine parallel: N machines 2.2) Cut into pieces of 10 GB run the Java application, 2.3) Stream back to Grid fetch a 10GB file as Storage input, processing it, and putting the result back BioAssist Programmers' Day, January 21, 2011
  • 6. Status Quo: Arrange your own (Data-)parallelism ● Cut the dataset up in “processable chunks”: ● Size of chunk depending on local space on processing node... ● … on the total processing capacity available … ● … on the smallest unit of work (“largest grade of parallelism”)... ● … on the substance (sometimes you don't want many output files, e.g. when building a search index). ● Submit the amount of jobs you consider necessary: ● To a cluster close to your data (270x10GB over WAN is a bad idea) ● Amount might depend on cluster capacity, amount of chunks, smallest unit of work, substance... When dealing with large data, let's say 100GB+, this is ERROR PRONE, TIME CONSUMING AND NOT FOR NEWBIES! BioAssist Programmers' Day, January 21, 2011
  • 7. 2002 2004 2006 Nutch* MR/GFS** Hadoop *  http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html    http://labs.google.com/papers/gfs.html BioAssist Programmers' Day, January 21, 2011
  • 8. 2010/2011: A Hype in Production http://wiki.apache.org/hadoop/PoweredBy BioAssist Programmers' Day, January 21, 2011
  • 9. BioAssist Programmers' Day, January 21, 2011
  • 10. Hadoop Distributed File System (HDFS) ● Very large DFS. Order of magnitude: – 10k nodes – millions of files – PetaBytes of storage ● Assumes “commodity hardware”: – redundancy through replication – failure handling and recovery ● Optimized for batch processing: – locations of data exposed to computation – high aggregate bandwidth BioAssist Programmers' Day, January 21, 2011 http://www.slideshare.net/jhammerb/hdfs-architecture
  • 11. HDFS Continued... ● Single Namespace for the entire cluster ● Data coherency – Write-once-read-many model – Only appending is supported for existing files ● Files are broken up in chunks (“blocks”) – Blocksizes ranging from 64 to 256 MB, depending on configuration – Blocks are distributed over nodes (a single FILE, existing of N blocks, is stored on M nodes) – Blocks are replicated and replications are distributed ● Client accesses the blocks of a file at the nodes directly – This creates high aggregate bandwidth! BioAssist Programmers' Day, January 21, 2011 http://www.slideshare.net/jhammerb/hdfs-architecture
  • 12. HDFS NameNode & DataNodes NameNode DataNode ● Manages File System Namespace ● A “Block Server” – Mapping filename to blocks – Stores data in local FS – Mapping blocks to DataNode – Stores metadata of a block (e.g. hash) ● Cluster Configuration – Serve (meta)data to clients ● Replication Management ● Facilitates pipeline to other DN's BioAssist Programmers' Day, January 21, 2011
  • 13. http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html Metadata operations ● Communicate with NN only – ls (see above), lsr, df, du, chmod, chown... etc. R/W (block) operations ● Communicate with NN and DN's – put, copyFromLocal, CopyToLocal, tail... etc. BioAssist Programmers' Day, January 21, 2011
  • 14. HDFS Application Programming Interface (API) ● Enables programmers to access any HDFS from their code ● Described at http://hadoop.apache.org/common/docs/r0.20.0/api/index.html ● Written in (and thus available for) Java, but... ● Is also exposed through Apache Thrift, so can be accessed from: ● C++, Python, PHP, Ruby, and others ● See http://wiki.apache.org/hadoop/HDFS-APIs ● Has a separate C-API (libhdfs) So: you can enable your services to work with HDFS BioAssist Programmers' Day, January 21, 2011
  • 15. MapReduce ● Is a framework for distributed (parallel) processing of large datasets ● Provides a programming model ● Lets users plug-in own code ● Uses a common pattern: cat   | grep | sort    | unique > file input | map  | shuffle | reduce > output ● Is useful for large scale data analytics and processing BioAssist Programmers' Day, January 21, 2011
  • 16. MapReduce Continued... ● Is great for processing large datasets! – Send computation to data, so little data over lines – Uses blocks stored in the DFS, so no splitting required (this is a bit more sophisticated depending on your input) ● Handles parallelism for you – One map per block, if possible ● Scales basically linearly – time_on_cluster = time_on_single_core / total_cores ● Java, but streaming possible (plus others, see later) BioAssist Programmers' Day, January 21, 2011
  • 17. MapReduce JobTracker & TaskTrackers JobTracker TaskTracker ● Holds job metadata ● Requests work from JT – Status of job – Fetch the code to execute from the DFS – Status of Tasks running on TTs – Apply job-specific configuration ● Decides on scheduling ● Communicate with JT on tasks: ● Delegates creation of 'InputSplits' – Sending output, Killing tasks, Task updates, etc BioAssist Programmers' Day, January 21, 2011
  • 19. MapReduce Application Programming Interface (API) ● Enables programmers to write MapReduce jobs ● More info on MR jobs: http://www.slideshare.net/evertlammerts/infodoc-6107350 ● Enables programmers to communicate with a JobTracker ● Submitting jobs, getting statuses, cancelling jobs, etc ● Described at http://hadoop.apache.org/common/docs/r0.20.0/api/index.html BioAssist Programmers' Day, January 21, 2011
  • 20. Case Study: Virtual Knowledge Studio 1) Load file into 2) Submit code HDFS to MR BioAssist Programmers' Day, January 21, 2011
  • 21. What's more on Hadoop? Lots! ● Apache Pig http://pig.apache.org – Analyze datasets in a high level language, “Pig Latin” – Simple! SQL like. Extremely fast experiments. – N-stage jobs (MR chaining!) ● Apache Hive http://hive.apache.org – Data Warehousing – Hive QL ● Apache Hbase http://hbase.apache.org – BigTable implementation (Google) – In-memory operation – Performance good enough for websites (Facebook built its Messaging Platform on top of it) ● Yahoo! Oozie http://yahoo.github.com/oozie/ – Hadoop workflow engine ● Apache [AVRO | Chukwa | Hama | Mahout] and so on ● 3rd Party: – ElephantBird – Cloudera's Distribution for Hadoop – Hue – Yahoo's Distribution for Hadoop BioAssist Programmers' Day, January 21, 2011
  • 22. Hadoop @ SARA A prototype cluster ● Since December 2010 ● 20 cores for MR (TT's) ● 110 TB gross for HDFS (DN) (55TB net) ● Hue web-interface for job submission & management ● SFTP interface to HDFS ● Pig 0.8 ● Hive ● Available for scientists / scientific programmers until May / June 2011 Towards a production infrastructure? ● Depending on results It's open for you all as well: ask me for an account! BioAssist Programmers' Day, January 21, 2011
  • 23. BioAssist Programmers' Day, January 21, 2011
  • 24. Hadoop for: ● Large-scale data storage and processing ● Fundamental difference: data locality! ● Small files? Don't, but... Hadoop Archives (HAR) ● Archival? Don't. Use tape storage. (We have lots!) ● Very fast analytics (Pig!) ● For data-parallel applications (not good at crossproducts – use Huygens or Lisa!) ● Legacy applications possible through piping / streaming (Weird dependencies? Use Cloud!) We'll do another Hackathon on Hadoop. Interested? Send me a mail! BioAssist Programmers' Day, January 21, 2011