Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Large-Scale Data Storage and Processing
          for Scientists  in The Netherlands




Evert.Lammerts@SARA.nl
January 21...
Super computing




Cloud computing                                         Grid computing




        Cluster computing  ...
Status Quo:
Storage separate from Compute




       BioAssist Programmers' Day, January 21, 2011
Case Study:
Virtual Knowledge Studio
 http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment

...
1.1) Copy file from local       2.1) Stream file from Grid                  3.1) Process all files in
   Machine to Grid s...
Status Quo:
           Arrange your own (Data-)parallelism
●   Cut the dataset up in “processable chunks”:
    ●   Size of...
2002             2004                     2006

Nutch*           MR/GFS**                 Hadoop


*  http://nutch.apache....
2010/2011: A Hype in Production




http://wiki.apache.org/hadoop/PoweredBy
                BioAssist Programmers' Day, Ja...
BioAssist Programmers' Day, January 21, 2011
Hadoop Distributed File System (HDFS)
●   Very large DFS. Order of magnitude:
    –   10k nodes
    –   millions of files
...
HDFS Continued...
●   Single Namespace for the entire cluster
●   Data coherency
    –   Write-once-read-many model
    – ...
HDFS NameNode & DataNodes




NameNode                                                  DataNode
●   Manages File System N...
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html




                            Metadata operations
         ...
HDFS
     Application Programming Interface (API)
●   Enables programmers to access any HDFS from their code
●   Described...
MapReduce
●   Is a framework for distributed (parallel) processing
    of large datasets
●   Provides a programming model
...
MapReduce Continued...
●   Is great for processing large datasets!
    –   Send computation to data, so little data over l...
MapReduce
                          JobTracker & TaskTrackers




JobTracker                                              ...
MapReduce client




BioAssist Programmers' Day, January 21, 2011
MapReduce
     Application Programming Interface (API)
●   Enables programmers to write MapReduce jobs
    ●   More info o...
Case Study:
   Virtual Knowledge Studio




1) Load file into                         2) Submit code
   HDFS              ...
What's more on Hadoop?

Lots!
●   Apache Pig http://pig.apache.org
     –   Analyze datasets in a high level language, “Pi...
Hadoop @ SARA

A prototype cluster
●   Since December 2010
●   20 cores for MR (TT's)
●   110 TB gross for HDFS (DN) (55TB...
BioAssist Programmers' Day, January 21, 2011
Hadoop for:
    ●   Large-scale data storage and processing
    ●   Fundamental difference: data locality!
    ●   Small f...
Upcoming SlideShare
Loading in …5
×

of

Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 1 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 2 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 3 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 4 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 5 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 6 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 7 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 8 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 9 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 10 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 11 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 12 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 13 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 14 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 15 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 16 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 17 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 18 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 19 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 20 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 21 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 22 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 23 Large-Scale Data Storage and Processing for Scientists with Hadoop Slide 24
Upcoming SlideShare
Hadoop Hackathon Reader
Next
Download to read offline and view in fullscreen.

14 Likes

Share

Download to read offline

Large-Scale Data Storage and Processing for Scientists with Hadoop

Download to read offline

A presentation for the NBIC BioAssist Programmers' day on Friday January 21st, 2011

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Large-Scale Data Storage and Processing for Scientists with Hadoop

  1. Large-Scale Data Storage and Processing for Scientists in The Netherlands Evert.Lammerts@SARA.nl January 21, 2011 NBIC BioAssist Programmers' Day
  2. Super computing Cloud computing Grid computing Cluster computing GPU computing BioAssist Programmers' Day, January 21, 2011
  3. Status Quo: Storage separate from Compute BioAssist Programmers' Day, January 21, 2011
  4. Case Study: Virtual Knowledge Studio http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment ● How do categories in WikiPedia evolve over time? (And how do they relate to internal links?) ● 2.7 TB raw text, single file ● Java application, searches for categories in Wiki markup, like [[Category:NAME]] ● Executed on the Grid BioAssist Programmers' Day, January 21, 2011
  5. 1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in Machine to Grid storage Storage to single machine parallel: N machines 2.2) Cut into pieces of 10 GB run the Java application, 2.3) Stream back to Grid fetch a 10GB file as Storage input, processing it, and putting the result back BioAssist Programmers' Day, January 21, 2011
  6. Status Quo: Arrange your own (Data-)parallelism ● Cut the dataset up in “processable chunks”: ● Size of chunk depending on local space on processing node... ● … on the total processing capacity available … ● … on the smallest unit of work (“largest grade of parallelism”)... ● … on the substance (sometimes you don't want many output files, e.g. when building a search index). ● Submit the amount of jobs you consider necessary: ● To a cluster close to your data (270x10GB over WAN is a bad idea) ● Amount might depend on cluster capacity, amount of chunks, smallest unit of work, substance... When dealing with large data, let's say 100GB+, this is ERROR PRONE, TIME CONSUMING AND NOT FOR NEWBIES! BioAssist Programmers' Day, January 21, 2011
  7. 2002 2004 2006 Nutch* MR/GFS** Hadoop *  http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html    http://labs.google.com/papers/gfs.html BioAssist Programmers' Day, January 21, 2011
  8. 2010/2011: A Hype in Production http://wiki.apache.org/hadoop/PoweredBy BioAssist Programmers' Day, January 21, 2011
  9. BioAssist Programmers' Day, January 21, 2011
  10. Hadoop Distributed File System (HDFS) ● Very large DFS. Order of magnitude: – 10k nodes – millions of files – PetaBytes of storage ● Assumes “commodity hardware”: – redundancy through replication – failure handling and recovery ● Optimized for batch processing: – locations of data exposed to computation – high aggregate bandwidth BioAssist Programmers' Day, January 21, 2011 http://www.slideshare.net/jhammerb/hdfs-architecture
  11. HDFS Continued... ● Single Namespace for the entire cluster ● Data coherency – Write-once-read-many model – Only appending is supported for existing files ● Files are broken up in chunks (“blocks”) – Blocksizes ranging from 64 to 256 MB, depending on configuration – Blocks are distributed over nodes (a single FILE, existing of N blocks, is stored on M nodes) – Blocks are replicated and replications are distributed ● Client accesses the blocks of a file at the nodes directly – This creates high aggregate bandwidth! BioAssist Programmers' Day, January 21, 2011 http://www.slideshare.net/jhammerb/hdfs-architecture
  12. HDFS NameNode & DataNodes NameNode DataNode ● Manages File System Namespace ● A “Block Server” – Mapping filename to blocks – Stores data in local FS – Mapping blocks to DataNode – Stores metadata of a block (e.g. hash) ● Cluster Configuration – Serve (meta)data to clients ● Replication Management ● Facilitates pipeline to other DN's BioAssist Programmers' Day, January 21, 2011
  13. http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html Metadata operations ● Communicate with NN only – ls (see above), lsr, df, du, chmod, chown... etc. R/W (block) operations ● Communicate with NN and DN's – put, copyFromLocal, CopyToLocal, tail... etc. BioAssist Programmers' Day, January 21, 2011
  14. HDFS Application Programming Interface (API) ● Enables programmers to access any HDFS from their code ● Described at http://hadoop.apache.org/common/docs/r0.20.0/api/index.html ● Written in (and thus available for) Java, but... ● Is also exposed through Apache Thrift, so can be accessed from: ● C++, Python, PHP, Ruby, and others ● See http://wiki.apache.org/hadoop/HDFS-APIs ● Has a separate C-API (libhdfs) So: you can enable your services to work with HDFS BioAssist Programmers' Day, January 21, 2011
  15. MapReduce ● Is a framework for distributed (parallel) processing of large datasets ● Provides a programming model ● Lets users plug-in own code ● Uses a common pattern: cat   | grep | sort    | unique > file input | map  | shuffle | reduce > output ● Is useful for large scale data analytics and processing BioAssist Programmers' Day, January 21, 2011
  16. MapReduce Continued... ● Is great for processing large datasets! – Send computation to data, so little data over lines – Uses blocks stored in the DFS, so no splitting required (this is a bit more sophisticated depending on your input) ● Handles parallelism for you – One map per block, if possible ● Scales basically linearly – time_on_cluster = time_on_single_core / total_cores ● Java, but streaming possible (plus others, see later) BioAssist Programmers' Day, January 21, 2011
  17. MapReduce JobTracker & TaskTrackers JobTracker TaskTracker ● Holds job metadata ● Requests work from JT – Status of job – Fetch the code to execute from the DFS – Status of Tasks running on TTs – Apply job-specific configuration ● Decides on scheduling ● Communicate with JT on tasks: ● Delegates creation of 'InputSplits' – Sending output, Killing tasks, Task updates, etc BioAssist Programmers' Day, January 21, 2011
  18. MapReduce client BioAssist Programmers' Day, January 21, 2011
  19. MapReduce Application Programming Interface (API) ● Enables programmers to write MapReduce jobs ● More info on MR jobs: http://www.slideshare.net/evertlammerts/infodoc-6107350 ● Enables programmers to communicate with a JobTracker ● Submitting jobs, getting statuses, cancelling jobs, etc ● Described at http://hadoop.apache.org/common/docs/r0.20.0/api/index.html BioAssist Programmers' Day, January 21, 2011
  20. Case Study: Virtual Knowledge Studio 1) Load file into 2) Submit code HDFS to MR BioAssist Programmers' Day, January 21, 2011
  21. What's more on Hadoop? Lots! ● Apache Pig http://pig.apache.org – Analyze datasets in a high level language, “Pig Latin” – Simple! SQL like. Extremely fast experiments. – N-stage jobs (MR chaining!) ● Apache Hive http://hive.apache.org – Data Warehousing – Hive QL ● Apache Hbase http://hbase.apache.org – BigTable implementation (Google) – In-memory operation – Performance good enough for websites (Facebook built its Messaging Platform on top of it) ● Yahoo! Oozie http://yahoo.github.com/oozie/ – Hadoop workflow engine ● Apache [AVRO | Chukwa | Hama | Mahout] and so on ● 3rd Party: – ElephantBird – Cloudera's Distribution for Hadoop – Hue – Yahoo's Distribution for Hadoop BioAssist Programmers' Day, January 21, 2011
  22. Hadoop @ SARA A prototype cluster ● Since December 2010 ● 20 cores for MR (TT's) ● 110 TB gross for HDFS (DN) (55TB net) ● Hue web-interface for job submission & management ● SFTP interface to HDFS ● Pig 0.8 ● Hive ● Available for scientists / scientific programmers until May / June 2011 Towards a production infrastructure? ● Depending on results It's open for you all as well: ask me for an account! BioAssist Programmers' Day, January 21, 2011
  23. BioAssist Programmers' Day, January 21, 2011
  24. Hadoop for: ● Large-scale data storage and processing ● Fundamental difference: data locality! ● Small files? Don't, but... Hadoop Archives (HAR) ● Archival? Don't. Use tape storage. (We have lots!) ● Very fast analytics (Pig!) ● For data-parallel applications (not good at crossproducts – use Huygens or Lisa!) ● Legacy applications possible through piping / streaming (Weird dependencies? Use Cloud!) We'll do another Hackathon on Hadoop. Interested? Send me a mail! BioAssist Programmers' Day, January 21, 2011
  • josacob

    Apr. 30, 2015
  • SaiSwaroopaSara

    Jan. 7, 2013
  • shashwat2010

    Aug. 29, 2012
  • hongxian

    Dec. 28, 2011
  • binlijin

    Oct. 23, 2011
  • shalaindrasaraswat

    Sep. 15, 2011
  • prakaashkpk

    Jun. 28, 2011
  • hahakubile

    May. 9, 2011
  • fbahr

    Mar. 8, 2011
  • FrancesChang

    Jan. 25, 2011
  • jazzwang

    Jan. 25, 2011
  • saschel

    Jan. 24, 2011
  • samkiller

    Jan. 23, 2011
  • denisarnaud

    Jan. 23, 2011

A presentation for the NBIC BioAssist Programmers' day on Friday January 21st, 2011

Views

Total views

14,495

On Slideshare

0

From embeds

0

Number of embeds

232

Actions

Downloads

548

Shares

0

Comments

0

Likes

14

×