First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

  • 1,522 views
Uploaded on

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,522
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
25
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group
  • 2. Who's who?
  • 3. Who's who?
    • Who has worked on scale?
      • e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes?
    • >= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes?
  • 4. In this talk
    • Why large-scale data processing?
    • 5. An introduction to scale @ SARA
    • 6. An introduction to Hadoop & MapReduce
    • 7. Hadoop @ SARA
  • 8. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 9. (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 10. (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
  • 11. s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more You already have your data (*Jimmy Lin, University of Maryland / Twitter, 2011)
  • 12. Data-processing as a commodity
    • Cheap Clusters
    • 13. Simple programming models
    • 14. Easy-to-learn scripting
    • 15. Anybody with the know-how can generate insights!
  • 16. Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge
  • 17. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 18. SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on L arge-Scale Computing , L arge-Scale Data Storage , H igh-Performance Networking , eScience , and Visualization
  • 19. Large-scale data != new
  • 20. Different types of computing Parallelism
    • Data parallelism
    • 21. Task parallelism
    Architectures
    • SIMD: Single Instruction Multiple Data
    • 22. MIMD: Multiple Instruction Multiple Data
    • 23. MISD: Multiple Instruction Single Data
    • 24. SISD: Single Instruction Single Data (Von Neumann)
  • 25. Parallelism: Amdahl's law
  • 26. Data parallelism
  • 27. Compute @ SARA
  • 28. What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how (NYT, 14/06/2006)
  • 29. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 30. A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
  • 31. http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production
  • 32. Core principals
    • Scale out, not up
    • 33. Move processing to the data
    • 34. Process data sequentially, avoid random reads
    • 35. Seamless scalability
    (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 36. A typical data-parallel problem in abstraction
    • Iterate over a large number of records
    • 37. Extract something of interest
    • 38. Create an ordering in intermediate results
    • 39. Aggregate intermediate results
    • 40. Generate output
    MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 41. MapReduce Programmer specifies two functions
    • map (k, v) -> <k', v'>*
    • 42. reduce (k', v') -> <k', v'>*
    All values associated with a single key are sent to the same reducer The framework handles the rest
  • 43. The rest? Scheduling, data distribution, ordering, synchronization, error handling...
  • 44. An overview of a Hadoop cluster
  • 45. The ecosystem Hbase , Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...
  • 46. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • 47. Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
  • 48. Architecture
  • 49. Components Hadoop, Hive, Pig, Hbase, HCatalog - others?
  • 50. What are scientists doing?
    • Information Retrieval
    • 51. Natural Language Processing
    • 52. Machine Learning
    • 53. Econometry
    • 54. Bioinformatics
    • 55. Computational Ecology / Ecoinformatics
  • 56. Machine learning: Infrawatch, Hollandse Brug
  • 57. Structural health monitoring 145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
  • 58. And others: NLP & IR
    • e.g. ClueWeb: a ~13.4 TB webcrawl
    • 59. e.g. Twitter gardenhose data
    • 60. e.g. Wikipedia dumps
    • 61. e.g. del.ico.us & flickr tags
    • 62. Finding named entities: [person company place] names
    • 63. Creating inverted indexes
    • 64. Piloting real-time search
    • 65. Personalization
    • 66. Semantic web
  • 67. Interest from industry We're opening shop. Come and pilot.
  • 68. Final thoughts
    • The tide rises, data is not getting less, let's ride that wave!
    • 69. Hadoop is the first to provide commodity computing
      • Hadoop is not the only
      • 70. Hadoop is probably not the best
      • 71. Hadoop has momentum
      • 72. And how many infrastructures do we need?
    • MapReduce fits surprisingly well as a programming model for data-parallelism
    • 73. The data center is your computer
    • 74. Where is the data scientist? Much to learn & teach!
  • 75. Any questions? [email_address] @eevrt @sara_nl