First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
Upcoming SlideShare
Loading in...5
×
 

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

on

  • 1,879 views

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

Statistics

Views

Total Views
1,879
Views on SlideShare
1,858
Embed Views
21

Actions

Likes
0
Downloads
25
Comments
0

3 Embeds 21

http://a0.twimg.com 10
http://www.linkedin.com 6
http://evertlammerts.nl 5

Accessibility

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    First NL-HUG: Large-scale data processing at SARA with Apache Hadoop First NL-HUG: Large-scale data processing at SARA with Apache Hadoop Presentation Transcript

    • Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group
    • Who's who?
    • Who's who?
      • Who has worked on scale?
        • e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes?
      • >= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes?
    • In this talk
      • Why large-scale data processing?
      • An introduction to scale @ SARA
      • An introduction to Hadoop & MapReduce
      • Hadoop @ SARA
    • Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
    • (Jimmy Lin, University of Maryland / Twitter, 2011)
    • (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
    • s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more You already have your data (*Jimmy Lin, University of Maryland / Twitter, 2011)
    • Data-processing as a commodity
      • Cheap Clusters
      • Simple programming models
      • Easy-to-learn scripting
      • Anybody with the know-how can generate insights!
    • Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge
    • Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
    • SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on L arge-Scale Computing , L arge-Scale Data Storage , H igh-Performance Networking , eScience , and Visualization
    • Large-scale data != new
    • Different types of computing Parallelism
      • Data parallelism
      • Task parallelism
      Architectures
      • SIMD: Single Instruction Multiple Data
      • MIMD: Multiple Instruction Multiple Data
      • MISD: Multiple Instruction Single Data
      • SISD: Single Instruction Single Data (Von Neumann)
    • Parallelism: Amdahl's law
    • Data parallelism
    • Compute @ SARA
    • What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how (NYT, 14/06/2006)
    • Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
    • A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
    • http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production
    • Core principals
      • Scale out, not up
      • Move processing to the data
      • Process data sequentially, avoid random reads
      • Seamless scalability
      (Jimmy Lin, University of Maryland / Twitter, 2011)
    • A typical data-parallel problem in abstraction
      • Iterate over a large number of records
      • Extract something of interest
      • Create an ordering in intermediate results
      • Aggregate intermediate results
      • Generate output
      MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
    • MapReduce Programmer specifies two functions
      • map (k, v) -> <k', v'>*
      • reduce (k', v') -> <k', v'>*
      All values associated with a single key are sent to the same reducer The framework handles the rest
    • The rest? Scheduling, data distribution, ordering, synchronization, error handling...
    • An overview of a Hadoop cluster
    • The ecosystem Hbase , Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...
    • Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
    • Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
    • Architecture
    • Components Hadoop, Hive, Pig, Hbase, HCatalog - others?
    • What are scientists doing?
      • Information Retrieval
      • Natural Language Processing
      • Machine Learning
      • Econometry
      • Bioinformatics
      • Computational Ecology / Ecoinformatics
    • Machine learning: Infrawatch, Hollandse Brug
    • Structural health monitoring 145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
    • And others: NLP & IR
      • e.g. ClueWeb: a ~13.4 TB webcrawl
      • e.g. Twitter gardenhose data
      • e.g. Wikipedia dumps
      • e.g. del.ico.us & flickr tags
      • Finding named entities: [person company place] names
      • Creating inverted indexes
      • Piloting real-time search
      • Personalization
      • Semantic web
    • Interest from industry We're opening shop. Come and pilot.
    • Final thoughts
      • The tide rises, data is not getting less, let's ride that wave!
      • Hadoop is the first to provide commodity computing
        • Hadoop is not the only
        • Hadoop is probably not the best
        • Hadoop has momentum
        • And how many infrastructures do we need?
      • MapReduce fits surprisingly well as a programming model for data-parallelism
      • The data center is your computer
      • Where is the data scientist? Much to learn & teach!
    • Any questions? [email_address] @eevrt @sara_nl