Hadoop @ Sara & BiG Grid
Upcoming SlideShare
Loading in...5
×
 

Hadoop @ Sara & BiG Grid

on

  • 846 views

 

Statistics

Views

Total Views
846
Views on SlideShare
839
Embed Views
7

Actions

Likes
0
Downloads
8
Comments
0

3 Embeds 7

http://evertlammerts.nl 5
http://localhost 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop @ Sara & BiG Grid Hadoop @ Sara & BiG Grid Presentation Transcript

  • Large­scale data processing at SARA and BiG Grid with Apache Hadoop Evert Lammerts April 10, 2012, SZTAKI
  • First off... About me Consultant for SARAs eScience & Cloud Services Technical lead for LifeWatch Netherlands Lead Hadoop infrastructure About youWho uses large-scale computing as a supporting tool? For who is large-scale computing core-business?
  • In this talkLarge-scale data processing?Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduceHadoop @ SARA & BiG Grid
  • Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  • Three observationsI: Data is easier to collect
  • (Jimmy Lin, University of Maryland / Twitter, 2011)
  • More business is done on-line Mobile devices are more sophisticated Governments collect more data Sensing devices are becoming a commodity Technology advanced: DNA sequencers!Enormous funding for research infrastructures And so on... Lesson: everybody collects data Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2011–2016
  • Three observationsII: Data is easier to store
  • Storage price decreases http://www.mkomo.com/cost-per-gigabyte
  • Storage capacity increases http://en.wikipedia.org/wiki/File:Hard_drive_capacity_over_time.svg
  • Three observationsIII: Quantity beats quality
  • (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
  • s/knowledge/data/g Jimmy Lin, University of Maryland / Twitter, 2011
  • How are these observations addressed? We collect data, we store data, we have the knowledge to interpret data. What tools do we have that bring these together?Pioneers: HPC centers, universities, and in recent years, Internet companies. (Lots of knowledge exchange, by the way.)
  • Some background (bear with me...) 1/3 Amdahls Law
  • Some background (bear with me...) 2/3 (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
  • Some background (bear with me...) 3/3Nodes (x2000):8GB DRAM4 x 1TB disksRack:40 nodes1Gbps switchDatacenter:8Gbps rack-to-cluster switch connection (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
  • (NYT, 14/06/2006)
  • Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  • SARA the national center for scientific computingFacilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale Data Storage, High-Performance Networking, eScience, and Visualization
  • Large-scale data != new
  • Compute @ SARA
  • Case Study: Virtual Knowledge Studio How do categories in WikiPedia evolve over time? (And how do they relate to internal links?) 2.7 TB raw text, single file Java application, searches for categories in Wiki markup, like [[Category:NAME]] Executed on the Grid http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment
  • Case Study: Virtual Knowledge StudioMethodTake an article, including history, as inputExtract categories and links for each revisionOutput all links for each category, per revisionAggregate all links for each category, per revisionGenerate graph linking all categories on links, per revision
  • Case Study: Virtual Knowledge Studio1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in Machine to Grid storage Storage to single machine parallel: N machines 2.2) Cut into pieces of 10 GB run the Java application, 2.3) Stream back to Grid fetch a 10GB file as Storage input, processing it, and putting the result back
  • Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  • A bit of history2002 2004 2006Nutch* MR/GFS** Hadoop *  http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html    http://labs.google.com/papers/gfs.html
  • 2010 - 2012: A Hype in Productionhttp://wiki.apache.org/hadoop/PoweredBy
  • Whats different about Hadoop?No more do-it-yourself parallelism – its hard! But rather linearly scalable data parallelism Separating the what from the how 2009, Luiz André Barroso and Urs Hölzle)
  • Core principalsScale out, not upMove processing to the dataProcess data sequentially, avoid random readsSeamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)
  • A typical data-parallel problem in abstractionIterate over a large number of recordsExtract something of interestCreate an ordering in intermediate resultsAggregate intermediate resultsGenerate output MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
  • MapReduceProgrammer specifies two functionsmap(k, v) → <k, v>*reduce(k, v) → <k, v>*All values associated with a single key are sent to the same reducer The framework handles the rest
  • The rest?Scheduling, data distribution, ordering, synchronization, error handling...
  • Case Study: Virtual Knowledge Studio This is how it would be done with Hadoop1) Load file into 2) Submit code to HDFS MR Automatic distribution of data, Parallelism based on data,Automatic ordering of intermediate results
  • The ecosystemThe Forrester WaveTM: Enterprise Hadoop Solutions, Q1 2012
  • Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  • Timeline2009: Piloting Hadoop on Cloud2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me!2011: Funding granted for production service2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi- tenancy
  • Architecture
  • ComponentsHadoop, Hive, Pig, Hbase, HCatalog - others?
  • What are scientists doing?Information RetrievalNatural Language ProcessingMachine LearningEconometryBioinformaticsComputational Ecology / Ecoinformatics
  • Machine learning: Infrawatch, Hollandse Brug
  • Structural health monitoring 145 x 100 x 60 x 60 x 24 x 365 = large datasensors Hz seconds minutes hours days (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
  • And others: NLP & IRe.g. ClueWeb: a ~13.4 TB webcrawle.g. Twitter gardenhose datae.g. Wikipedia dumpse.g. del.ico.us & flickr tagsFinding named entities: [person company place] namesCreating inverted indexesPiloting real-time searchPersonalizationSemantic web
  • Interest from industryWere opening up shop.
  • Experiences: Data ScienceDevOps Programming algorithms Domain knowledge
  • Experience: How we embrace HadoopParallelism has never been easy… so we teach! December 2010: hackathon (~50 participants - full) April 2011: Workshop for Bioinformaticians November 2011: 2 day PhD course (~60 participants – full) June 2012: 1 day PhD courseThe datascientist is still in school... so we fill the gap! Devops maintain the system, fix bugs, develop new functionality Technical consultants learn how to efficiently implement algorithms
  • http://www.nlhug.org/
  • Final thoughtsHadoop is the first to provide commodity computing Hadoop is not the only Hadoop is probably not the best Hadoop has momentumWhat degree of diversification of infrastructure should we embrace? MapReduce fits surprisingly well as a programming model for data-parallelismWhere is the data scientist? Teach. A lot. And work together.