Your SlideShare is downloading. ×
0
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

1,623

Published on

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,623
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
26
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group
  • 2. Who's who?
  • 3. Who's who? <ul><li>Who has worked on scale? </li><ul><li>e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes? </li></ul><li>>= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes? </li></ul>
  • 4. In this talk <ul><li>Why large-scale data processing?
  • 5. An introduction to scale @ SARA
  • 6. An introduction to Hadoop & MapReduce
  • 7. Hadoop @ SARA </li></ul>
  • 8. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 9. (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 10. (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
  • 11. s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more You already have your data (*Jimmy Lin, University of Maryland / Twitter, 2011)
  • 12. Data-processing as a commodity <ul><li>Cheap Clusters
  • 13. Simple programming models
  • 14. Easy-to-learn scripting
  • 15. Anybody with the know-how can generate insights! </li></ul>
  • 16. Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge
  • 17. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 18. SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on L arge-Scale Computing , L arge-Scale Data Storage , H igh-Performance Networking , eScience , and Visualization
  • 19. Large-scale data != new
  • 20. Different types of computing Parallelism <ul><li>Data parallelism
  • 21. Task parallelism </li></ul>Architectures <ul><li>SIMD: Single Instruction Multiple Data
  • 22. MIMD: Multiple Instruction Multiple Data
  • 23. MISD: Multiple Instruction Single Data
  • 24. SISD: Single Instruction Single Data (Von Neumann) </li></ul>
  • 25. Parallelism: Amdahl's law
  • 26. Data parallelism
  • 27. Compute @ SARA
  • 28. What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how (NYT, 14/06/2006)
  • 29. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 30. A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
  • 31. http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production
  • 32. Core principals <ul><li>Scale out, not up
  • 33. Move processing to the data
  • 34. Process data sequentially, avoid random reads
  • 35. Seamless scalability </li></ul>(Jimmy Lin, University of Maryland / Twitter, 2011)
  • 36. A typical data-parallel problem in abstraction <ul><li>Iterate over a large number of records
  • 37. Extract something of interest
  • 38. Create an ordering in intermediate results
  • 39. Aggregate intermediate results
  • 40. Generate output </li></ul>MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 41. MapReduce Programmer specifies two functions <ul><li>map (k, v) -> <k', v'>*
  • 42. reduce (k', v') -> <k', v'>* </li></ul>All values associated with a single key are sent to the same reducer The framework handles the rest
  • 43. The rest? Scheduling, data distribution, ordering, synchronization, error handling...
  • 44. An overview of a Hadoop cluster
  • 45. The ecosystem Hbase , Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...
  • 46. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 47. Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
  • 48. Architecture
  • 49. Components Hadoop, Hive, Pig, Hbase, HCatalog - others?
  • 50. What are scientists doing? <ul><li>Information Retrieval
  • 51. Natural Language Processing
  • 52. Machine Learning
  • 53. Econometry
  • 54. Bioinformatics
  • 55. Computational Ecology / Ecoinformatics </li></ul>
  • 56. Machine learning: Infrawatch, Hollandse Brug
  • 57. Structural health monitoring 145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
  • 58. And others: NLP & IR <ul><li>e.g. ClueWeb: a ~13.4 TB webcrawl
  • 59. e.g. Twitter gardenhose data
  • 60. e.g. Wikipedia dumps
  • 61. e.g. del.ico.us & flickr tags
  • 62. Finding named entities: [person company place] names
  • 63. Creating inverted indexes
  • 64. Piloting real-time search
  • 65. Personalization
  • 66. Semantic web </li></ul>
  • 67. Interest from industry We're opening shop. Come and pilot.
  • 68. Final thoughts <ul><li>The tide rises, data is not getting less, let's ride that wave!
  • 69. Hadoop is the first to provide commodity computing </li><ul><li>Hadoop is not the only
  • 70. Hadoop is probably not the best
  • 71. Hadoop has momentum
  • 72. And how many infrastructures do we need? </li></ul><li>MapReduce fits surprisingly well as a programming model for data-parallelism
  • 73. The data center is your computer
  • 74. Where is the data scientist? Much to learn & teach! </li></ul>
  • 75. Any questions? [email_address] @eevrt @sara_nl

×