Hadoop @ Sara & BiG Grid

Largescale data processing
at SARA and BiG Grid
with Apache Hadoop

Evert Lammerts
April 10, 2012, SZTAKI

First off...

About me
Consultant for SARA's eScience & Cloud Services
Technical lead for LifeWatch Netherlands
Lead Hadoop infrastructure

About you
Who uses large-scale computing as a supporting tool?
For who is large-scale computing core-business?

In this talk
Large-scale data processing?
Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
Hadoop @ SARA & BiG Grid

Large-scale data processing?
Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
Hadoop @ SARA & BiG Grid

Three observations

I: Data is easier to collect

(Jimmy Lin, University of Maryland / Twitter, 2011)

More business is done on-line
Mobile devices are more sophisticated
Governments collect more data
Sensing devices are becoming a commodity
Technology advanced: DNA sequencers!
Enormous funding for research infrastructures
And so on...

Lesson: everybody collects data

Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2011–2016

Three observations

II: Data is easier to store

Storage price decreases

http://www.mkomo.com/cost-per-gigabyte

Storage capacity increases

http://en.wikipedia.org/wiki/File:Hard_drive_capacity_over_time.svg

Three observations

III: Quantity beats quality

(IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)

s/knowledge/data/g

Jimmy Lin, University of Maryland / Twitter, 2011

How are these observations addressed?

We collect data, we store data, we have the
knowledge to interpret data. What tools do we
have that bring these together?

Pioneers: HPC centers, universities, and in recent
years, Internet companies. (Lots of knowledge
exchange, by the way.)

Some background (bear with me...) 1/3

Amdahl's Law


(The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)


Nodes (x2000):
8GB DRAM
4 x 1TB disks

Rack:
40 nodes
1Gbps switch

Datacenter:
8Gbps rack-to-cluster
switch connection

(The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)

SARA
the national center for scientific computing

Facilitating Science in The Netherlands with Equipment for
and Expertise on Large-Scale Computing, Large-Scale
Data Storage, High-Performance Networking,
eScience, and Visualization

Case Study: Virtual Knowledge Studio

How do categories in WikiPedia
evolve over time? (And how do
they relate to internal links?)

2.7 TB raw text, single file

Java application, searches for
categories in Wiki markup,
like [[Category:NAME]]

Executed on the Grid

http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment

Method
Take an article, including history, as input
Extract categories and links for each revision
Output all links for each category, per revision
Aggregate all links for each category, per revision
Generate graph linking all categories on links, per revision


1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in
Machine to Grid storage Storage to single machine parallel: N machines
2.2) Cut into pieces of 10 GB run the Java application,
2.3) Stream back to Grid fetch a 10GB file as
Storage input, processing it, and
putting the result back

A bit of history

2002 2004 2006

Nutch* MR/GFS** Hadoop

* http://nutch.apache.org/
** http://labs.google.com/papers/mapreduce.html
http://labs.google.com/papers/gfs.html

2010 - 2012: A Hype in Production

http://wiki.apache.org/hadoop/PoweredBy

What's different about Hadoop?

No more do-it-yourself parallelism – it's hard!
But rather linearly scalable data parallelism

Separating the what from the how

2009, Luiz André Barroso and Urs Hölzle)

Core principals
Scale out, not up
Move processing to the data
Process data sequentially, avoid random reads
Seamless scalability


A typical data-parallel problem in abstraction
Iterate over a large number of records
Extract something of interest
Create an ordering in intermediate results
Aggregate intermediate results
Generate output

MapReduce: functional abstraction of step 2 & step 4


MapReduce
Programmer specifies two functions
map(k, v) → <k', v'>*
reduce(k', v') → <k', v'>*
All values associated with a single key are sent to the same
reducer

The framework handles the rest

The rest?

Scheduling, data distribution, ordering,
synchronization, error handling...

This is how it would be done with Hadoop

1) Load file into 2) Submit code to
HDFS MR

Automatic distribution of data,
Parallelism based on data,
Automatic ordering of intermediate results

The ecosystem

The Forrester WaveTM: Enterprise Hadoop Solutions, Q1 2012

Timeline
2009: Piloting Hadoop on Cloud
2010: Test cluster available for scientists
6 machines * 4 cores / 24 TB storage / 16GB
RAM
Just me!
2011: Funding granted for production service
2012: Production cluster available (~March)
72 machines * 8 cores / 8 TB storage / 64GB
RAM
Integration with Kerberos for secure multi-
tenancy

Components

Hadoop, Hive, Pig, Hbase, HCatalog - others?

What are scientists doing?
Information Retrieval
Natural Language Processing
Machine Learning
Econometry
Bioinformatics
Computational Ecology / Ecoinformatics

Machine learning: Infrawatch, Hollandse Brug

Structural health monitoring

145 x 100 x 60 x 60 x 24 x 365 = large data
sensors Hz seconds minutes hours days

(Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)

And others: NLP & IR
e.g. ClueWeb: a ~13.4 TB webcrawl
e.g. Twitter gardenhose data
e.g. Wikipedia dumps
e.g. del.ico.us & flickr tags
Finding named entities: [person company place] names
Creating inverted indexes
Piloting real-time search
Personalization
Semantic web

Interest from industry

We're opening up shop.

Experiences: Data Science
DevOps Programming algorithms Domain knowledge

Experience: How we embrace Hadoop
Parallelism has never been easy… so we teach!
December 2010: hackathon (~50 participants - full)
April 2011: Workshop for Bioinformaticians
November 2011: 2 day PhD course (~60 participants – full)
June 2012: 1 day PhD course

The datascientist is still in school... so we fill the gap!
Devops maintain the system, fix bugs, develop new
functionality
Technical consultants learn how to efficiently implement
algorithms

Final thoughts
Hadoop is the first to provide commodity computing
Hadoop is not the only
Hadoop is probably not the best
Hadoop has momentum
What degree of diversification of infrastructure should we
embrace?
MapReduce fits surprisingly well as a programming model for
data-parallelism
Where is the data scientist?
Teach. A lot. And work together.

Hadoop @ Sara & BiG Grid

More Related Content

What's hot

Similar to Hadoop @ Sara & BiG Grid

Recently uploaded

Hadoop @ Sara & BiG Grid