Your SlideShare is downloading. ×
0
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

1,617

Published on

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,617
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
26
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group
  • 2. Who's who?
  • 3. Who's who?
    • Who has worked on scale?
      • e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes?
    • >= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes?
  • 4. In this talk
    • Why large-scale data processing?
    • 5. An introduction to scale @ SARA
    • 6. An introduction to Hadoop & MapReduce
    • 7. Hadoop @ SARA
  • 8. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 9. (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 10. (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
  • 11. s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more You already have your data (*Jimmy Lin, University of Maryland / Twitter, 2011)
  • 12. Data-processing as a commodity
    • Cheap Clusters
    • 13. Simple programming models
    • 14. Easy-to-learn scripting
    • 15. Anybody with the know-how can generate insights!
  • 16. Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge
  • 17. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 18. SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on L arge-Scale Computing , L arge-Scale Data Storage , H igh-Performance Networking , eScience , and Visualization
  • 19. Large-scale data != new
  • 20. Different types of computing Parallelism
    • Data parallelism
    • 21. Task parallelism
    Architectures
    • SIMD: Single Instruction Multiple Data
    • 22. MIMD: Multiple Instruction Multiple Data
    • 23. MISD: Multiple Instruction Single Data
    • 24. SISD: Single Instruction Single Data (Von Neumann)
  • 25. Parallelism: Amdahl's law
  • 26. Data parallelism
  • 27. Compute @ SARA
  • 28. What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how (NYT, 14/06/2006)
  • 29. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 30. A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
  • 31. http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production
  • 32. Core principals
    • Scale out, not up
    • 33. Move processing to the data
    • 34. Process data sequentially, avoid random reads
    • 35. Seamless scalability
    (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 36. A typical data-parallel problem in abstraction
    • Iterate over a large number of records
    • 37. Extract something of interest
    • 38. Create an ordering in intermediate results
    • 39. Aggregate intermediate results
    • 40. Generate output
    MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 41. MapReduce Programmer specifies two functions
    • map (k, v) -> <k', v'>*
    • 42. reduce (k', v') -> <k', v'>*
    All values associated with a single key are sent to the same reducer The framework handles the rest
  • 43. The rest? Scheduling, data distribution, ordering, synchronization, error handling...
  • 44. An overview of a Hadoop cluster
  • 45. The ecosystem Hbase , Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...
  • 46. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 47. Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
  • 48. Architecture
  • 49. Components Hadoop, Hive, Pig, Hbase, HCatalog - others?
  • 50. What are scientists doing?
    • Information Retrieval
    • 51. Natural Language Processing
    • 52. Machine Learning
    • 53. Econometry
    • 54. Bioinformatics
    • 55. Computational Ecology / Ecoinformatics
  • 56. Machine learning: Infrawatch, Hollandse Brug
  • 57. Structural health monitoring 145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
  • 58. And others: NLP & IR
    • e.g. ClueWeb: a ~13.4 TB webcrawl
    • 59. e.g. Twitter gardenhose data
    • 60. e.g. Wikipedia dumps
    • 61. e.g. del.ico.us & flickr tags
    • 62. Finding named entities: [person company place] names
    • 63. Creating inverted indexes
    • 64. Piloting real-time search
    • 65. Personalization
    • 66. Semantic web
  • 67. Interest from industry We're opening shop. Come and pilot.
  • 68. Final thoughts
    • The tide rises, data is not getting less, let's ride that wave!
    • 69. Hadoop is the first to provide commodity computing
      • Hadoop is not the only
      • 70. Hadoop is probably not the best
      • 71. Hadoop has momentum
      • 72. And how many infrastructures do we need?
    • MapReduce fits surprisingly well as a programming model for data-parallelism
    • 73. The data center is your computer
    • 74. Where is the data scientist? Much to learn & teach!
  • 75. Any questions? [email_address] @eevrt @sara_nl

×