Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

2,035 views

Published on

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

Published in: Technology, Education
  • Be the first to comment

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

  1. 1. Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group
  2. 2. Who's who?
  3. 3. Who's who? <ul><li>Who has worked on scale? </li><ul><li>e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes? </li></ul><li>>= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes? </li></ul>
  4. 4. In this talk <ul><li>Why large-scale data processing?
  5. 5. An introduction to scale @ SARA
  6. 6. An introduction to Hadoop & MapReduce
  7. 7. Hadoop @ SARA </li></ul>
  8. 8. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  9. 9. (Jimmy Lin, University of Maryland / Twitter, 2011)
  10. 10. (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
  11. 11. s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more You already have your data (*Jimmy Lin, University of Maryland / Twitter, 2011)
  12. 12. Data-processing as a commodity <ul><li>Cheap Clusters
  13. 13. Simple programming models
  14. 14. Easy-to-learn scripting
  15. 15. Anybody with the know-how can generate insights! </li></ul>
  16. 16. Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge
  17. 17. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  18. 18. SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on L arge-Scale Computing , L arge-Scale Data Storage , H igh-Performance Networking , eScience , and Visualization
  19. 19. Large-scale data != new
  20. 20. Different types of computing Parallelism <ul><li>Data parallelism
  21. 21. Task parallelism </li></ul>Architectures <ul><li>SIMD: Single Instruction Multiple Data
  22. 22. MIMD: Multiple Instruction Multiple Data
  23. 23. MISD: Multiple Instruction Single Data
  24. 24. SISD: Single Instruction Single Data (Von Neumann) </li></ul>
  25. 25. Parallelism: Amdahl's law
  26. 26. Data parallelism
  27. 27. Compute @ SARA
  28. 28. What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how (NYT, 14/06/2006)
  29. 29. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  30. 30. A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
  31. 31. http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production
  32. 32. Core principals <ul><li>Scale out, not up
  33. 33. Move processing to the data
  34. 34. Process data sequentially, avoid random reads
  35. 35. Seamless scalability </li></ul>(Jimmy Lin, University of Maryland / Twitter, 2011)
  36. 36. A typical data-parallel problem in abstraction <ul><li>Iterate over a large number of records
  37. 37. Extract something of interest
  38. 38. Create an ordering in intermediate results
  39. 39. Aggregate intermediate results
  40. 40. Generate output </li></ul>MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
  41. 41. MapReduce Programmer specifies two functions <ul><li>map (k, v) -> <k', v'>*
  42. 42. reduce (k', v') -> <k', v'>* </li></ul>All values associated with a single key are sent to the same reducer The framework handles the rest
  43. 43. The rest? Scheduling, data distribution, ordering, synchronization, error handling...
  44. 44. An overview of a Hadoop cluster
  45. 45. The ecosystem Hbase , Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...
  46. 46. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  47. 47. Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
  48. 48. Architecture
  49. 49. Components Hadoop, Hive, Pig, Hbase, HCatalog - others?
  50. 50. What are scientists doing? <ul><li>Information Retrieval
  51. 51. Natural Language Processing
  52. 52. Machine Learning
  53. 53. Econometry
  54. 54. Bioinformatics
  55. 55. Computational Ecology / Ecoinformatics </li></ul>
  56. 56. Machine learning: Infrawatch, Hollandse Brug
  57. 57. Structural health monitoring 145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
  58. 58. And others: NLP & IR <ul><li>e.g. ClueWeb: a ~13.4 TB webcrawl
  59. 59. e.g. Twitter gardenhose data
  60. 60. e.g. Wikipedia dumps
  61. 61. e.g. del.ico.us & flickr tags
  62. 62. Finding named entities: [person company place] names
  63. 63. Creating inverted indexes
  64. 64. Piloting real-time search
  65. 65. Personalization
  66. 66. Semantic web </li></ul>
  67. 67. Interest from industry We're opening shop. Come and pilot.
  68. 68. Final thoughts <ul><li>The tide rises, data is not getting less, let's ride that wave!
  69. 69. Hadoop is the first to provide commodity computing </li><ul><li>Hadoop is not the only
  70. 70. Hadoop is probably not the best
  71. 71. Hadoop has momentum
  72. 72. And how many infrastructures do we need? </li></ul><li>MapReduce fits surprisingly well as a programming model for data-parallelism
  73. 73. The data center is your computer
  74. 74. Where is the data scientist? Much to learn & teach! </li></ul>
  75. 75. Any questions? [email_address] @eevrt @sara_nl

×