Your SlideShare is downloading. ×
0
Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group
Who's who?
Who's who? <ul><li>Who has worked on scale? </li><ul><li>e.g. database sharding, round-robin HTTP, Hadoop, key-value datab...
In this talk <ul><li>Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA </li></ul>
Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
(Jimmy Lin, University of Maryland / Twitter, 2011)
(IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and ma...
Data-processing as a commodity <ul><li>Cheap Clusters
Simple programming models
Easy-to-learn scripting
Anybody with the know-how can generate insights! </li></ul>
Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge
Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
SARA the national center for scientific computing Facilitating   Science   in The Netherlands with Equipment for and Exper...
Large-scale data != new
Different types of computing Parallelism <ul><li>Data parallelism
Task parallelism </li></ul>Architectures <ul><li>SIMD: Single Instruction Multiple Data
MIMD: Multiple Instruction Multiple Data
MISD: Multiple Instruction Single Data
SISD: Single Instruction Single Data (Von Neumann) </li></ul>
Parallelism: Amdahl's law
Data parallelism
Compute @ SARA
What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data paralleli...
Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop *  http://nutch.apache.org/ **  http://labs.google.com/papers/...
http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production
Core principals <ul><li>Scale out, not up
Move processing to the data
Process data sequentially, avoid random reads
Seamless scalability </li></ul>(Jimmy Lin, University of Maryland / Twitter, 2011)
A typical data-parallel problem in abstraction <ul><li>Iterate over a large number of records
Extract something of interest
Create an ordering in intermediate results
Upcoming SlideShare
Loading in...5
×

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

1,636

Published on

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,636
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
26
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "First NL-HUG: Large-scale data processing at SARA with Apache Hadoop"

  1. 1. Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group
  2. 2. Who's who?
  3. 3. Who's who? <ul><li>Who has worked on scale? </li><ul><li>e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes? </li></ul><li>>= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes? </li></ul>
  4. 4. In this talk <ul><li>Why large-scale data processing?
  5. 5. An introduction to scale @ SARA
  6. 6. An introduction to Hadoop & MapReduce
  7. 7. Hadoop @ SARA </li></ul>
  8. 8. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  9. 9. (Jimmy Lin, University of Maryland / Twitter, 2011)
  10. 10. (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
  11. 11. s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more You already have your data (*Jimmy Lin, University of Maryland / Twitter, 2011)
  12. 12. Data-processing as a commodity <ul><li>Cheap Clusters
  13. 13. Simple programming models
  14. 14. Easy-to-learn scripting
  15. 15. Anybody with the know-how can generate insights! </li></ul>
  16. 16. Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge
  17. 17. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  18. 18. SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on L arge-Scale Computing , L arge-Scale Data Storage , H igh-Performance Networking , eScience , and Visualization
  19. 19. Large-scale data != new
  20. 20. Different types of computing Parallelism <ul><li>Data parallelism
  21. 21. Task parallelism </li></ul>Architectures <ul><li>SIMD: Single Instruction Multiple Data
  22. 22. MIMD: Multiple Instruction Multiple Data
  23. 23. MISD: Multiple Instruction Single Data
  24. 24. SISD: Single Instruction Single Data (Von Neumann) </li></ul>
  25. 25. Parallelism: Amdahl's law
  26. 26. Data parallelism
  27. 27. Compute @ SARA
  28. 28. What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how (NYT, 14/06/2006)
  29. 29. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  30. 30. A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
  31. 31. http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production
  32. 32. Core principals <ul><li>Scale out, not up
  33. 33. Move processing to the data
  34. 34. Process data sequentially, avoid random reads
  35. 35. Seamless scalability </li></ul>(Jimmy Lin, University of Maryland / Twitter, 2011)
  36. 36. A typical data-parallel problem in abstraction <ul><li>Iterate over a large number of records
  37. 37. Extract something of interest
  38. 38. Create an ordering in intermediate results
  39. 39. Aggregate intermediate results
  40. 40. Generate output </li></ul>MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
  41. 41. MapReduce Programmer specifies two functions <ul><li>map (k, v) -> <k', v'>*
  42. 42. reduce (k', v') -> <k', v'>* </li></ul>All values associated with a single key are sent to the same reducer The framework handles the rest
  43. 43. The rest? Scheduling, data distribution, ordering, synchronization, error handling...
  44. 44. An overview of a Hadoop cluster
  45. 45. The ecosystem Hbase , Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...
  46. 46. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  47. 47. Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
  48. 48. Architecture
  49. 49. Components Hadoop, Hive, Pig, Hbase, HCatalog - others?
  50. 50. What are scientists doing? <ul><li>Information Retrieval
  51. 51. Natural Language Processing
  52. 52. Machine Learning
  53. 53. Econometry
  54. 54. Bioinformatics
  55. 55. Computational Ecology / Ecoinformatics </li></ul>
  56. 56. Machine learning: Infrawatch, Hollandse Brug
  57. 57. Structural health monitoring 145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
  58. 58. And others: NLP & IR <ul><li>e.g. ClueWeb: a ~13.4 TB webcrawl
  59. 59. e.g. Twitter gardenhose data
  60. 60. e.g. Wikipedia dumps
  61. 61. e.g. del.ico.us & flickr tags
  62. 62. Finding named entities: [person company place] names
  63. 63. Creating inverted indexes
  64. 64. Piloting real-time search
  65. 65. Personalization
  66. 66. Semantic web </li></ul>
  67. 67. Interest from industry We're opening shop. Come and pilot.
  68. 68. Final thoughts <ul><li>The tide rises, data is not getting less, let's ride that wave!
  69. 69. Hadoop is the first to provide commodity computing </li><ul><li>Hadoop is not the only
  70. 70. Hadoop is probably not the best
  71. 71. Hadoop has momentum
  72. 72. And how many infrastructures do we need? </li></ul><li>MapReduce fits surprisingly well as a programming model for data-parallelism
  73. 73. The data center is your computer
  74. 74. Where is the data scientist? Much to learn & teach! </li></ul>
  75. 75. Any questions? [email_address] @eevrt @sara_nl
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×