Hadoop @ Sara & BiG Grid


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop @ Sara & BiG Grid

  1. 1. Large­scale data processing at SARA and BiG Grid with Apache Hadoop Evert Lammerts April 10, 2012, SZTAKI
  2. 2. First off... About me Consultant for SARAs eScience & Cloud Services Technical lead for LifeWatch Netherlands Lead Hadoop infrastructure About youWho uses large-scale computing as a supporting tool? For who is large-scale computing core-business?
  3. 3. In this talkLarge-scale data processing?Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduceHadoop @ SARA & BiG Grid
  4. 4. Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  5. 5. Three observationsI: Data is easier to collect
  6. 6. (Jimmy Lin, University of Maryland / Twitter, 2011)
  7. 7. More business is done on-line Mobile devices are more sophisticated Governments collect more data Sensing devices are becoming a commodity Technology advanced: DNA sequencers!Enormous funding for research infrastructures And so on... Lesson: everybody collects data Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2011–2016
  8. 8. Three observationsII: Data is easier to store
  9. 9. Storage price decreases http://www.mkomo.com/cost-per-gigabyte
  10. 10. Storage capacity increases http://en.wikipedia.org/wiki/File:Hard_drive_capacity_over_time.svg
  11. 11. Three observationsIII: Quantity beats quality
  12. 12. (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
  13. 13. s/knowledge/data/g Jimmy Lin, University of Maryland / Twitter, 2011
  14. 14. How are these observations addressed? We collect data, we store data, we have the knowledge to interpret data. What tools do we have that bring these together?Pioneers: HPC centers, universities, and in recent years, Internet companies. (Lots of knowledge exchange, by the way.)
  15. 15. Some background (bear with me...) 1/3 Amdahls Law
  16. 16. Some background (bear with me...) 2/3 (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
  17. 17. Some background (bear with me...) 3/3Nodes (x2000):8GB DRAM4 x 1TB disksRack:40 nodes1Gbps switchDatacenter:8Gbps rack-to-cluster switch connection (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
  18. 18. (NYT, 14/06/2006)
  19. 19. Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  20. 20. SARA the national center for scientific computingFacilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale Data Storage, High-Performance Networking, eScience, and Visualization
  21. 21. Large-scale data != new
  22. 22. Compute @ SARA
  23. 23. Case Study: Virtual Knowledge Studio How do categories in WikiPedia evolve over time? (And how do they relate to internal links?) 2.7 TB raw text, single file Java application, searches for categories in Wiki markup, like [[Category:NAME]] Executed on the Grid http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment
  24. 24. Case Study: Virtual Knowledge StudioMethodTake an article, including history, as inputExtract categories and links for each revisionOutput all links for each category, per revisionAggregate all links for each category, per revisionGenerate graph linking all categories on links, per revision
  25. 25. Case Study: Virtual Knowledge Studio1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in Machine to Grid storage Storage to single machine parallel: N machines 2.2) Cut into pieces of 10 GB run the Java application, 2.3) Stream back to Grid fetch a 10GB file as Storage input, processing it, and putting the result back
  26. 26. Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  27. 27. A bit of history2002 2004 2006Nutch* MR/GFS** Hadoop *  http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html    http://labs.google.com/papers/gfs.html
  28. 28. 2010 - 2012: A Hype in Productionhttp://wiki.apache.org/hadoop/PoweredBy
  29. 29. Whats different about Hadoop?No more do-it-yourself parallelism – its hard! But rather linearly scalable data parallelism Separating the what from the how 2009, Luiz André Barroso and Urs Hölzle)
  30. 30. Core principalsScale out, not upMove processing to the dataProcess data sequentially, avoid random readsSeamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)
  31. 31. A typical data-parallel problem in abstractionIterate over a large number of recordsExtract something of interestCreate an ordering in intermediate resultsAggregate intermediate resultsGenerate output MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
  32. 32. MapReduceProgrammer specifies two functionsmap(k, v) → <k, v>*reduce(k, v) → <k, v>*All values associated with a single key are sent to the same reducer The framework handles the rest
  33. 33. The rest?Scheduling, data distribution, ordering, synchronization, error handling...
  34. 34. Case Study: Virtual Knowledge Studio This is how it would be done with Hadoop1) Load file into 2) Submit code to HDFS MR Automatic distribution of data, Parallelism based on data,Automatic ordering of intermediate results
  35. 35. The ecosystemThe Forrester WaveTM: Enterprise Hadoop Solutions, Q1 2012
  36. 36. Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  37. 37. Timeline2009: Piloting Hadoop on Cloud2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me!2011: Funding granted for production service2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi- tenancy
  38. 38. Architecture
  39. 39. ComponentsHadoop, Hive, Pig, Hbase, HCatalog - others?
  40. 40. What are scientists doing?Information RetrievalNatural Language ProcessingMachine LearningEconometryBioinformaticsComputational Ecology / Ecoinformatics
  41. 41. Machine learning: Infrawatch, Hollandse Brug
  42. 42. Structural health monitoring 145 x 100 x 60 x 60 x 24 x 365 = large datasensors Hz seconds minutes hours days (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
  43. 43. And others: NLP & IRe.g. ClueWeb: a ~13.4 TB webcrawle.g. Twitter gardenhose datae.g. Wikipedia dumpse.g. del.ico.us & flickr tagsFinding named entities: [person company place] namesCreating inverted indexesPiloting real-time searchPersonalizationSemantic web
  44. 44. Interest from industryWere opening up shop.
  45. 45. Experiences: Data ScienceDevOps Programming algorithms Domain knowledge
  46. 46. Experience: How we embrace HadoopParallelism has never been easy… so we teach! December 2010: hackathon (~50 participants - full) April 2011: Workshop for Bioinformaticians November 2011: 2 day PhD course (~60 participants – full) June 2012: 1 day PhD courseThe datascientist is still in school... so we fill the gap! Devops maintain the system, fix bugs, develop new functionality Technical consultants learn how to efficiently implement algorithms
  47. 47. http://www.nlhug.org/
  48. 48. Final thoughtsHadoop is the first to provide commodity computing Hadoop is not the only Hadoop is probably not the best Hadoop has momentumWhat degree of diversification of infrastructure should we embrace? MapReduce fits surprisingly well as a programming model for data-parallelismWhere is the data scientist? Teach. A lot. And work together.