First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
Upcoming SlideShare
Loading in...5
×
 

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

on

  • 1,899 views

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

An introduction to large-scale data processing, and a short overview of SARA's Hadoop related activities.

Statistics

Views

Total Views
1,899
Views on SlideShare
1,878
Embed Views
21

Actions

Likes
0
Downloads
25
Comments
0

3 Embeds 21

http://a0.twimg.com 10
http://www.linkedin.com 6
http://evertlammerts.nl 5

Accessibility

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop First NL-HUG: Large-scale data processing at SARA with Apache Hadoop Presentation Transcript

  • Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group
  • Who's who?
  • Who's who?
    • Who has worked on scale?
      • e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes?
    • >= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes?
  • In this talk
    • Why large-scale data processing?
    • An introduction to scale @ SARA
    • An introduction to Hadoop & MapReduce
    • Hadoop @ SARA
  • Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • (Jimmy Lin, University of Maryland / Twitter, 2011)
  • (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
  • s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more You already have your data (*Jimmy Lin, University of Maryland / Twitter, 2011)
  • Data-processing as a commodity
    • Cheap Clusters
    • Simple programming models
    • Easy-to-learn scripting
    • Anybody with the know-how can generate insights!
  • Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge
  • Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on L arge-Scale Computing , L arge-Scale Data Storage , H igh-Performance Networking , eScience , and Visualization
  • Large-scale data != new
  • Different types of computing Parallelism
    • Data parallelism
    • Task parallelism
    Architectures
    • SIMD: Single Instruction Multiple Data
    • MIMD: Multiple Instruction Multiple Data
    • MISD: Multiple Instruction Single Data
    • SISD: Single Instruction Single Data (Von Neumann)
  • Parallelism: Amdahl's law
  • Data parallelism
  • Compute @ SARA
  • What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how (NYT, 14/06/2006)
  • Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
  • http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production
  • Core principals
    • Scale out, not up
    • Move processing to the data
    • Process data sequentially, avoid random reads
    • Seamless scalability
    (Jimmy Lin, University of Maryland / Twitter, 2011)
  • A typical data-parallel problem in abstraction
    • Iterate over a large number of records
    • Extract something of interest
    • Create an ordering in intermediate results
    • Aggregate intermediate results
    • Generate output
    MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
  • MapReduce Programmer specifies two functions
    • map (k, v) -> <k', v'>*
    • reduce (k', v') -> <k', v'>*
    All values associated with a single key are sent to the same reducer The framework handles the rest
  • The rest? Scheduling, data distribution, ordering, synchronization, error handling...
  • An overview of a Hadoop cluster
  • The ecosystem Hbase , Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...
  • Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
  • Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
  • Architecture
  • Components Hadoop, Hive, Pig, Hbase, HCatalog - others?
  • What are scientists doing?
    • Information Retrieval
    • Natural Language Processing
    • Machine Learning
    • Econometry
    • Bioinformatics
    • Computational Ecology / Ecoinformatics
  • Machine learning: Infrawatch, Hollandse Brug
  • Structural health monitoring 145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
  • And others: NLP & IR
    • e.g. ClueWeb: a ~13.4 TB webcrawl
    • e.g. Twitter gardenhose data
    • e.g. Wikipedia dumps
    • e.g. del.ico.us & flickr tags
    • Finding named entities: [person company place] names
    • Creating inverted indexes
    • Piloting real-time search
    • Personalization
    • Semantic web
  • Interest from industry We're opening shop. Come and pilot.
  • Final thoughts
    • The tide rises, data is not getting less, let's ride that wave!
    • Hadoop is the first to provide commodity computing
      • Hadoop is not the only
      • Hadoop is probably not the best
      • Hadoop has momentum
      • And how many infrastructures do we need?
    • MapReduce fits surprisingly well as a programming model for data-parallelism
    • The data center is your computer
    • Where is the data scientist? Much to learn & teach!
  • Any questions? [email_address] @eevrt @sara_nl