Large-Scale Data Storage and Processing for Scientists with Hadoop

Large-Scale Data Storage and Processing
for Scientists in The Netherlands

Evert.Lammerts@SARA.nl
January 21, 2011
NBIC BioAssist Programmers' Day

Super computing

Cloud computing Grid computing

Cluster computing GPU computing

BioAssist Programmers' Day, January 21, 2011

Status Quo:
Storage separate from Compute


Case Study:
Virtual Knowledge Studio
http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment

● How do categories in WikiPedia
evolve over time? (And how do
they relate to internal links?)

● 2.7 TB raw text, single file

● Java application, searches for
categories in Wiki markup,
like [[Category:NAME]]

● Executed on the Grid


1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in
Machine to Grid storage Storage to single machine parallel: N machines
2.2) Cut into pieces of 10 GB run the Java application,
2.3) Stream back to Grid fetch a 10GB file as
Storage input, processing it, and
putting the result back

Status Quo:
Arrange your own (Data-)parallelism
● Cut the dataset up in “processable chunks”:
● Size of chunk depending on local space on processing node...
● … on the total processing capacity available …
● … on the smallest unit of work (“largest grade of parallelism”)...
● … on the substance (sometimes you don't want many output files, e.g.
when building a search index).
● Submit the amount of jobs you consider necessary:
● To a cluster close to your data (270x10GB over WAN is a bad idea)
● Amount might depend on cluster capacity, amount of chunks, smallest
unit of work, substance...

When dealing with large data, let's say 100GB+, this is
ERROR PRONE, TIME CONSUMING AND NOT FOR NEWBIES!


2002 2004 2006

Nutch* MR/GFS** Hadoop

* http://nutch.apache.org/
** http://labs.google.com/papers/mapreduce.html
http://labs.google.com/papers/gfs.html


2010/2011: A Hype in Production

http://wiki.apache.org/hadoop/PoweredBy

Hadoop Distributed File System (HDFS)
● Very large DFS. Order of magnitude:
– 10k nodes
– millions of files
– PetaBytes of storage
● Assumes “commodity hardware”:
– redundancy through replication
– failure handling and recovery
● Optimized for batch processing:
– locations of data exposed to computation
– high aggregate bandwidth
http://www.slideshare.net/jhammerb/hdfs-architecture

HDFS Continued...
● Single Namespace for the entire cluster
● Data coherency
– Write-once-read-many model
– Only appending is supported for existing files
● Files are broken up in chunks (“blocks”)
– Blocksizes ranging from 64 to 256 MB, depending on configuration
– Blocks are distributed over nodes (a single FILE, existing of N
blocks, is stored on M nodes)
– Blocks are replicated and replications are distributed
● Client accesses the blocks of a file at the nodes directly
– This creates high aggregate bandwidth!
http://www.slideshare.net/jhammerb/hdfs-architecture

HDFS NameNode & DataNodes

NameNode DataNode
● Manages File System Namespace ● A “Block Server”
– Mapping filename to blocks – Stores data in local FS
– Mapping blocks to DataNode – Stores metadata of a block (e.g. hash)
● Cluster Configuration – Serve (meta)data to clients
● Replication Management ● Facilitates pipeline to other DN's

http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html

Metadata operations
● Communicate with NN only
– ls (see above), lsr, df, du, chmod,
chown... etc.

R/W (block) operations
● Communicate with NN and DN's
– put, copyFromLocal, CopyToLocal,
tail... etc.


HDFS
Application Programming Interface (API)
● Enables programmers to access any HDFS from their code
● Described at
http://hadoop.apache.org/common/docs/r0.20.0/api/index.html
● Written in (and thus available for) Java, but...
● Is also exposed through Apache Thrift, so can be accessed
from:
● C++, Python, PHP, Ruby, and others
● See http://wiki.apache.org/hadoop/HDFS-APIs
● Has a separate C-API (libhdfs)

So: you can enable your services to work with HDFS

MapReduce Continued...
● Is great for processing large datasets!
– Send computation to data, so little data over lines
– Uses blocks stored in the DFS, so no splitting required
(this is a bit more sophisticated depending on your input)
● Handles parallelism for you
– One map per block, if possible
● Scales basically linearly
– time_on_cluster = time_on_single_core / total_cores
● Java, but streaming possible (plus others, see later)

MapReduce
JobTracker & TaskTrackers

JobTracker TaskTracker
● Holds job metadata ● Requests work from JT
– Status of job – Fetch the code to execute from the DFS
– Status of Tasks running on TTs – Apply job-specific configuration
● Decides on scheduling ● Communicate with JT on tasks:
● Delegates creation of 'InputSplits' – Sending output, Killing tasks, Task updates, etc


MapReduce client


MapReduce
Application Programming Interface (API)
● Enables programmers to write MapReduce jobs
● More info on MR jobs:
http://www.slideshare.net/evertlammerts/infodoc-6107350
● Enables programmers to communicate with a
JobTracker
● Submitting jobs, getting statuses, cancelling jobs,
etc
● Described at
http://hadoop.apache.org/common/docs/r0.20.0/api/index.html


Case Study:
Virtual Knowledge Studio

1) Load file into 2) Submit code
HDFS to MR

What's more on Hadoop?

Lots!
● Apache Pig http://pig.apache.org
– Analyze datasets in a high level language, “Pig Latin”
– Simple! SQL like. Extremely fast experiments.
– N-stage jobs (MR chaining!)
● Apache Hive http://hive.apache.org
– Data Warehousing
– Hive QL
● Apache Hbase http://hbase.apache.org
– BigTable implementation (Google)
– In-memory operation
– Performance good enough for websites (Facebook built its Messaging Platform on top of it)
● Yahoo! Oozie http://yahoo.github.com/oozie/
– Hadoop workflow engine
● Apache [AVRO | Chukwa | Hama | Mahout] and so on
● 3rd Party:
– ElephantBird
– Cloudera's Distribution for Hadoop
– Hue
– Yahoo's Distribution for Hadoop


Hadoop @ SARA

A prototype cluster
● Since December 2010
● 20 cores for MR (TT's)
● 110 TB gross for HDFS (DN) (55TB net)
● Hue web-interface for job submission & management
● SFTP interface to HDFS
● Pig 0.8
● Hive
● Available for scientists / scientific programmers until May / June 2011

Towards a production infrastructure?
● Depending on results

It's open for you all as well: ask me for an account!

Hadoop for:
● Large-scale data storage and processing
● Fundamental difference: data locality!
● Small files? Don't, but... Hadoop Archives (HAR)
● Archival? Don't. Use tape storage. (We have lots!)
● Very fast analytics (Pig!)
● For data-parallel applications (not good at crossproducts – use
Huygens or Lisa!)
● Legacy applications possible through piping / streaming (Weird
dependencies? Use Cloud!)

We'll do another Hackathon on Hadoop. Interested? Send me a mail!

Large-Scale Data Storage and Processing for Scientists with Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Large-Scale Data Storage and Processing for Scientists with Hadoop

Similar to Large-Scale Data Storage and Processing for Scientists with Hadoop (20)

Large-Scale Data Storage and Processing for Scientists with Hadoop