0
Large­scale data processing  at SARA and BiG Grid   with Apache Hadoop         Evert Lammerts      April 10, 2012, SZTAKI
First off...                  About me Consultant for SARAs eScience & Cloud Services    Technical lead for LifeWatch Neth...
In this talkLarge-scale data processing?Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduceHadoop @ SARA & ...
Large-scale data processing?         Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce            Hadoop ...
Three observationsI: Data is easier to collect
(Jimmy Lin, University of Maryland / Twitter, 2011)
More business is done on-line   Mobile devices are more sophisticated       Governments collect more data Sensing devices ...
Three observationsII: Data is easier to store
Storage price decreases            http://www.mkomo.com/cost-per-gigabyte
Storage capacity increases        http://en.wikipedia.org/wiki/File:Hard_drive_capacity_over_time.svg
Three observationsIII: Quantity beats quality
(IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
s/knowledge/data/g             Jimmy Lin, University of Maryland / Twitter, 2011
How are these observations addressed?   We collect data, we store data, we have the  knowledge to interpret data. What too...
Some background (bear with me...) 1/3                             Amdahls Law
Some background (bear with me...) 2/3        (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
Some background (bear with me...) 3/3Nodes (x2000):8GB DRAM4 x 1TB disksRack:40 nodes1Gbps switchDatacenter:8Gbps rack-to-...
(NYT, 14/06/2006)
Large-scale data processing?         Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce            Hadoop ...
SARA    the national center for scientific computingFacilitating Science in The Netherlands with Equipment for and Experti...
Large-scale data != new
Compute @ SARA
Case Study: Virtual Knowledge Studio                               How do categories in WikiPedia                         ...
Case Study: Virtual Knowledge StudioMethodTake an article, including history, as inputExtract categories and links for eac...
Case Study: Virtual Knowledge Studio1.1) Copy file from local    2.1) Stream file from Grid      3.1) Process all files in...
Large-scale data processing?         Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce            Hadoop ...
A bit of history2002     2004        2006Nutch*   MR/GFS**    Hadoop                *  http://nutch.apache.org/           ...
2010 - 2012: A Hype in Productionhttp://wiki.apache.org/hadoop/PoweredBy
Whats different about Hadoop?No more do-it-yourself parallelism – its hard! But rather linearly scalable data parallelism ...
Core principalsScale out, not upMove processing to the dataProcess data sequentially, avoid random readsSeamless scalabili...
A typical data-parallel problem in abstractionIterate over a large number of recordsExtract something of interestCreate an...
MapReduceProgrammer specifies two functionsmap(k, v) → <k, v>*reduce(k, v) → <k, v>*All values associated with a single ke...
The rest?Scheduling, data distribution, ordering,  synchronization, error handling...
Case Study: Virtual Knowledge Studio                    This is how it would be done with Hadoop1) Load file into         ...
The ecosystemThe Forrester WaveTM: Enterprise Hadoop Solutions, Q1 2012
Large-scale data processing?         Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce            Hadoop ...
Timeline2009:      Piloting Hadoop on Cloud2010:      Test cluster available for scientists           6 machines * 4 cores...
Architecture
ComponentsHadoop, Hive, Pig, Hbase, HCatalog - others?
What are scientists doing?Information RetrievalNatural Language ProcessingMachine LearningEconometryBioinformaticsComputat...
Machine learning: Infrawatch, Hollandse Brug
Structural health monitoring  145 x   100   x     60   x  60    x  24   x       365 = large datasensors   Hz        second...
And others: NLP & IRe.g. ClueWeb: a ~13.4 TB webcrawle.g. Twitter gardenhose datae.g. Wikipedia dumpse.g. del.ico.us & fli...
Interest from industryWere opening up shop.
Experiences: Data ScienceDevOps   Programming algorithms   Domain knowledge
Experience: How we embrace HadoopParallelism has never been easy… so we teach!   December 2010: hackathon (~50 participant...
http://www.nlhug.org/
Final thoughtsHadoop is the first to provide commodity computing   Hadoop is not the only   Hadoop is probably not the bes...
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Upcoming SlideShare
Loading in...5
×

Hadoop @ Sara & BiG Grid

613

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
613
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop @ Sara & BiG Grid"

  1. 1. Large­scale data processing at SARA and BiG Grid with Apache Hadoop Evert Lammerts April 10, 2012, SZTAKI
  2. 2. First off... About me Consultant for SARAs eScience & Cloud Services Technical lead for LifeWatch Netherlands Lead Hadoop infrastructure About youWho uses large-scale computing as a supporting tool? For who is large-scale computing core-business?
  3. 3. In this talkLarge-scale data processing?Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduceHadoop @ SARA & BiG Grid
  4. 4. Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  5. 5. Three observationsI: Data is easier to collect
  6. 6. (Jimmy Lin, University of Maryland / Twitter, 2011)
  7. 7. More business is done on-line Mobile devices are more sophisticated Governments collect more data Sensing devices are becoming a commodity Technology advanced: DNA sequencers!Enormous funding for research infrastructures And so on... Lesson: everybody collects data Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2011–2016
  8. 8. Three observationsII: Data is easier to store
  9. 9. Storage price decreases http://www.mkomo.com/cost-per-gigabyte
  10. 10. Storage capacity increases http://en.wikipedia.org/wiki/File:Hard_drive_capacity_over_time.svg
  11. 11. Three observationsIII: Quantity beats quality
  12. 12. (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
  13. 13. s/knowledge/data/g Jimmy Lin, University of Maryland / Twitter, 2011
  14. 14. How are these observations addressed? We collect data, we store data, we have the knowledge to interpret data. What tools do we have that bring these together?Pioneers: HPC centers, universities, and in recent years, Internet companies. (Lots of knowledge exchange, by the way.)
  15. 15. Some background (bear with me...) 1/3 Amdahls Law
  16. 16. Some background (bear with me...) 2/3 (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
  17. 17. Some background (bear with me...) 3/3Nodes (x2000):8GB DRAM4 x 1TB disksRack:40 nodes1Gbps switchDatacenter:8Gbps rack-to-cluster switch connection (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
  18. 18. (NYT, 14/06/2006)
  19. 19. Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  20. 20. SARA the national center for scientific computingFacilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale Data Storage, High-Performance Networking, eScience, and Visualization
  21. 21. Large-scale data != new
  22. 22. Compute @ SARA
  23. 23. Case Study: Virtual Knowledge Studio How do categories in WikiPedia evolve over time? (And how do they relate to internal links?) 2.7 TB raw text, single file Java application, searches for categories in Wiki markup, like [[Category:NAME]] Executed on the Grid http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment
  24. 24. Case Study: Virtual Knowledge StudioMethodTake an article, including history, as inputExtract categories and links for each revisionOutput all links for each category, per revisionAggregate all links for each category, per revisionGenerate graph linking all categories on links, per revision
  25. 25. Case Study: Virtual Knowledge Studio1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in Machine to Grid storage Storage to single machine parallel: N machines 2.2) Cut into pieces of 10 GB run the Java application, 2.3) Stream back to Grid fetch a 10GB file as Storage input, processing it, and putting the result back
  26. 26. Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  27. 27. A bit of history2002 2004 2006Nutch* MR/GFS** Hadoop *  http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html    http://labs.google.com/papers/gfs.html
  28. 28. 2010 - 2012: A Hype in Productionhttp://wiki.apache.org/hadoop/PoweredBy
  29. 29. Whats different about Hadoop?No more do-it-yourself parallelism – its hard! But rather linearly scalable data parallelism Separating the what from the how 2009, Luiz André Barroso and Urs Hölzle)
  30. 30. Core principalsScale out, not upMove processing to the dataProcess data sequentially, avoid random readsSeamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)
  31. 31. A typical data-parallel problem in abstractionIterate over a large number of recordsExtract something of interestCreate an ordering in intermediate resultsAggregate intermediate resultsGenerate output MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
  32. 32. MapReduceProgrammer specifies two functionsmap(k, v) → <k, v>*reduce(k, v) → <k, v>*All values associated with a single key are sent to the same reducer The framework handles the rest
  33. 33. The rest?Scheduling, data distribution, ordering, synchronization, error handling...
  34. 34. Case Study: Virtual Knowledge Studio This is how it would be done with Hadoop1) Load file into 2) Submit code to HDFS MR Automatic distribution of data, Parallelism based on data,Automatic ordering of intermediate results
  35. 35. The ecosystemThe Forrester WaveTM: Enterprise Hadoop Solutions, Q1 2012
  36. 36. Large-scale data processing? Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  37. 37. Timeline2009: Piloting Hadoop on Cloud2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me!2011: Funding granted for production service2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi- tenancy
  38. 38. Architecture
  39. 39. ComponentsHadoop, Hive, Pig, Hbase, HCatalog - others?
  40. 40. What are scientists doing?Information RetrievalNatural Language ProcessingMachine LearningEconometryBioinformaticsComputational Ecology / Ecoinformatics
  41. 41. Machine learning: Infrawatch, Hollandse Brug
  42. 42. Structural health monitoring 145 x 100 x 60 x 60 x 24 x 365 = large datasensors Hz seconds minutes hours days (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
  43. 43. And others: NLP & IRe.g. ClueWeb: a ~13.4 TB webcrawle.g. Twitter gardenhose datae.g. Wikipedia dumpse.g. del.ico.us & flickr tagsFinding named entities: [person company place] namesCreating inverted indexesPiloting real-time searchPersonalizationSemantic web
  44. 44. Interest from industryWere opening up shop.
  45. 45. Experiences: Data ScienceDevOps Programming algorithms Domain knowledge
  46. 46. Experience: How we embrace HadoopParallelism has never been easy… so we teach! December 2010: hackathon (~50 participants - full) April 2011: Workshop for Bioinformaticians November 2011: 2 day PhD course (~60 participants – full) June 2012: 1 day PhD courseThe datascientist is still in school... so we fill the gap! Devops maintain the system, fix bugs, develop new functionality Technical consultants learn how to efficiently implement algorithms
  47. 47. http://www.nlhug.org/
  48. 48. Final thoughtsHadoop is the first to provide commodity computing Hadoop is not the only Hadoop is probably not the best Hadoop has momentumWhat degree of diversification of infrastructure should we embrace? MapReduce fits surprisingly well as a programming model for data-parallelismWhere is the data scientist? Teach. A lot. And work together.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×