• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Large-Scale Data Storage and Processing for Scientists with Hadoop
 

Large-Scale Data Storage and Processing for Scientists with Hadoop

on

  • 10,865 views

A presentation for the NBIC BioAssist Programmers' day on Friday January 21st, 2011

A presentation for the NBIC BioAssist Programmers' day on Friday January 21st, 2011

Statistics

Views

Total Views
10,865
Views on SlideShare
10,666
Embed Views
199

Actions

Likes
13
Downloads
492
Comments
0

10 Embeds 199

http://www.davidmantilla.com 156
http://www.joonlee0228.net 31
http://www.linkedin.com 4
http://www.mongodb.org 2
https://twitter.com 1
http://us-w1.rockmelt.com 1
http://www.slideshare.net 1
http://2g-cenergy.com 1
http://twitter.com 1
http://oracle.sociview.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Large-Scale Data Storage and Processing for Scientists with Hadoop Large-Scale Data Storage and Processing for Scientists with Hadoop Presentation Transcript

    • Large-Scale Data Storage and Processing for Scientists in The NetherlandsEvert.Lammerts@SARA.nlJanuary 21, 2011NBIC BioAssist Programmers Day
    • Super computingCloud computing Grid computing Cluster computing GPU computing BioAssist Programmers Day, January 21, 2011
    • Status Quo:Storage separate from Compute BioAssist Programmers Day, January 21, 2011
    • Case Study:Virtual Knowledge Studio http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment ● How do categories in WikiPedia evolve over time? (And how do they relate to internal links?) ● 2.7 TB raw text, single file ● Java application, searches for categories in Wiki markup, like [[Category:NAME]] ● Executed on the Grid BioAssist Programmers Day, January 21, 2011
    • 1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in Machine to Grid storage Storage to single machine parallel: N machines 2.2) Cut into pieces of 10 GB run the Java application, 2.3) Stream back to Grid fetch a 10GB file as Storage input, processing it, and putting the result back BioAssist Programmers Day, January 21, 2011
    • Status Quo: Arrange your own (Data-)parallelism● Cut the dataset up in “processable chunks”: ● Size of chunk depending on local space on processing node... ● … on the total processing capacity available … ● … on the smallest unit of work (“largest grade of parallelism”)... ● … on the substance (sometimes you dont want many output files, e.g. when building a search index).● Submit the amount of jobs you consider necessary: ● To a cluster close to your data (270x10GB over WAN is a bad idea) ● Amount might depend on cluster capacity, amount of chunks, smallest unit of work, substance... When dealing with large data, lets say 100GB+, this is ERROR PRONE, TIME CONSUMING AND NOT FOR NEWBIES! BioAssist Programmers Day, January 21, 2011
    • 2002 2004 2006Nutch* MR/GFS** Hadoop*  http://nutch.apache.org/** http://labs.google.com/papers/mapreduce.html   http://labs.google.com/papers/gfs.html BioAssist Programmers Day, January 21, 2011
    • 2010/2011: A Hype in Productionhttp://wiki.apache.org/hadoop/PoweredBy BioAssist Programmers Day, January 21, 2011
    • BioAssist Programmers Day, January 21, 2011
    • Hadoop Distributed File System (HDFS)● Very large DFS. Order of magnitude: – 10k nodes – millions of files – PetaBytes of storage● Assumes “commodity hardware”: – redundancy through replication – failure handling and recovery● Optimized for batch processing: – locations of data exposed to computation – high aggregate bandwidth BioAssist Programmers Day, January 21, 2011 http://www.slideshare.net/jhammerb/hdfs-architecture
    • HDFS Continued...● Single Namespace for the entire cluster● Data coherency – Write-once-read-many model – Only appending is supported for existing files● Files are broken up in chunks (“blocks”) – Blocksizes ranging from 64 to 256 MB, depending on configuration – Blocks are distributed over nodes (a single FILE, existing of N blocks, is stored on M nodes) – Blocks are replicated and replications are distributed● Client accesses the blocks of a file at the nodes directly – This creates high aggregate bandwidth! BioAssist Programmers Day, January 21, 2011 http://www.slideshare.net/jhammerb/hdfs-architecture
    • HDFS NameNode & DataNodesNameNode DataNode● Manages File System Namespace ● A “Block Server” – Mapping filename to blocks – Stores data in local FS – Mapping blocks to DataNode – Stores metadata of a block (e.g. hash)● Cluster Configuration – Serve (meta)data to clients● Replication Management ● Facilitates pipeline to other DNs BioAssist Programmers Day, January 21, 2011
    • http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html Metadata operations ● Communicate with NN only – ls (see above), lsr, df, du, chmod, chown... etc. R/W (block) operations ● Communicate with NN and DNs – put, copyFromLocal, CopyToLocal, tail... etc. BioAssist Programmers Day, January 21, 2011
    • HDFS Application Programming Interface (API)● Enables programmers to access any HDFS from their code● Described at http://hadoop.apache.org/common/docs/r0.20.0/api/index.html● Written in (and thus available for) Java, but...● Is also exposed through Apache Thrift, so can be accessed from: ● C++, Python, PHP, Ruby, and others ● See http://wiki.apache.org/hadoop/HDFS-APIs● Has a separate C-API (libhdfs)So: you can enable your services to work with HDFS BioAssist Programmers Day, January 21, 2011
    • MapReduce● Is a framework for distributed (parallel) processing of large datasets● Provides a programming model● Lets users plug-in own code● Uses a common pattern: cat   | grep | sort    | unique > file input | map  | shuffle | reduce > output● Is useful for large scale data analytics and processing BioAssist Programmers Day, January 21, 2011
    • MapReduce Continued...● Is great for processing large datasets! – Send computation to data, so little data over lines – Uses blocks stored in the DFS, so no splitting required (this is a bit more sophisticated depending on your input)● Handles parallelism for you – One map per block, if possible● Scales basically linearly – time_on_cluster = time_on_single_core / total_cores● Java, but streaming possible (plus others, see later) BioAssist Programmers Day, January 21, 2011
    • MapReduce JobTracker & TaskTrackersJobTracker TaskTracker● Holds job metadata ● Requests work from JT – Status of job – Fetch the code to execute from the DFS – Status of Tasks running on TTs – Apply job-specific configuration● Decides on scheduling ● Communicate with JT on tasks:● Delegates creation of InputSplits – Sending output, Killing tasks, Task updates, etc BioAssist Programmers Day, January 21, 2011
    • MapReduce clientBioAssist Programmers Day, January 21, 2011
    • MapReduce Application Programming Interface (API)● Enables programmers to write MapReduce jobs ● More info on MR jobs: http://www.slideshare.net/evertlammerts/infodoc-6107350● Enables programmers to communicate with a JobTracker ● Submitting jobs, getting statuses, cancelling jobs, etc● Described at http://hadoop.apache.org/common/docs/r0.20.0/api/index.html BioAssist Programmers Day, January 21, 2011
    • Case Study: Virtual Knowledge Studio1) Load file into 2) Submit code HDFS to MR BioAssist Programmers Day, January 21, 2011
    • Whats more on Hadoop?Lots!● Apache Pig http://pig.apache.org – Analyze datasets in a high level language, “Pig Latin” – Simple! SQL like. Extremely fast experiments. – N-stage jobs (MR chaining!)● Apache Hive http://hive.apache.org – Data Warehousing – Hive QL● Apache Hbase http://hbase.apache.org – BigTable implementation (Google) – In-memory operation – Performance good enough for websites (Facebook built its Messaging Platform on top of it)● Yahoo! Oozie http://yahoo.github.com/oozie/ – Hadoop workflow engine● Apache [AVRO | Chukwa | Hama | Mahout] and so on● 3rd Party: – ElephantBird – Clouderas Distribution for Hadoop – Hue – Yahoos Distribution for Hadoop BioAssist Programmers Day, January 21, 2011
    • Hadoop @ SARAA prototype cluster● Since December 2010● 20 cores for MR (TTs)● 110 TB gross for HDFS (DN) (55TB net)● Hue web-interface for job submission & management● SFTP interface to HDFS● Pig 0.8● Hive● Available for scientists / scientific programmers until May / June 2011Towards a production infrastructure?● Depending on results Its open for you all as well: ask me for an account! BioAssist Programmers Day, January 21, 2011
    • BioAssist Programmers Day, January 21, 2011
    • Hadoop for: ● Large-scale data storage and processing ● Fundamental difference: data locality! ● Small files? Dont, but... Hadoop Archives (HAR) ● Archival? Dont. Use tape storage. (We have lots!) ● Very fast analytics (Pig!) ● For data-parallel applications (not good at crossproducts – use Huygens or Lisa!) ● Legacy applications possible through piping / streaming (Weird dependencies? Use Cloud!)Well do another Hackathon on Hadoop. Interested? Send me a mail! BioAssist Programmers Day, January 21, 2011