• Like
H is for_hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

H is for_hadoop

  • 1,010 views
Published

Introductory talk from 2008

Introductory talk from 2008

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,010
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • September 2008
  • This is one of Doug Cutting's kid's toys. It is what Hadoop is named after.
  • Lots of people have those google "our other computer is a datacentre" stickers. We just know where ours is and what runs on it. Soon: our machines will have hadoop stickers on them.
  • search for the google mapreduce paper
  • Everyone talks about word counting and click logs, here's something more fun. mapping out all devices with the same ID, debouncing, building up stats. Or by time: which are peak times; are there special groups (school, college students) who can be identified by timings and dates of travel?
  • The namenode really matters -lose that and the cluster is gone Lose the job tracker and all ongoing work is lost, currently the entire job chain needs to be restarted. But the rest of the cluster says live.
  • This is how applications used to be written. An App server driving a cluster, a database (Oracle, IBM DB/2 or something else) behind it all. An O/R mapping infrastructure, either entity beans or spring/hibernate to make Java classes persistent, and JSP front ends. On the side: Message beans for queued operations, Corba IIOP or Java WS-* to talk to other parts of the enterprise.
  • Here are where things get interesting
  • We may have our own cluster, but it isnt a stable one. We need to adapt to what gets provided to us.

Transcript

  • 1. September 2008 H is for Hadoop Steve Loughran [email_address] Julio Guijarro [email_address]
  • 2. What is Hadoop? September 2008
  • 3. A yellow elephant September 2008
  • 4. A use for a datacentre September 2008
  • 5. Hadoop is behind Yahoo!
    • Yahoo! has about 10,000 machines running Hadoop
    • Largest cluster is currently 1,600 nodes
    • Storage is about 1 petabyte of user data (compressed)
    • Yahoo! runs about 10,000 research jobs/week
    source: Eric Baldeschwieler, OSCON, July 25 2007
  • 6. Java Cloud Computing Edition
    • A filesystem that scales to petabytes
    • Google's MapReduce implemented in Java
    • The foundation for Yahoo!'s search, last.fm's music correlation, and other datamining applications
    • Open source: Apache hosted
    • A framework for data-centric computation
    • Commodity data processing for commodity data
    September 2008
  • 7. MapReduce
    • Map input data => (key,data')
    • Reduce (key, data')* => (key, data'')
    • Repeat until final output is generated
    • The fun comes applying it to terabytes of data
    • Uses: log analysis, correlations, statistics, indexing
    September 2008
  • 8. Example problem: Bluetooth phones September 2008
    • Map: Bluetooth device ID
    • Reduce: debounce to list of sightings and duration
    • Map: sightings and durations
    • Reduce: statistics for each device, day of week, …
    lost,"00:0F:B3:92:05:D3","2008-04-17T22:11:15",1124313075 found,"00:0F:B3:92:05:D3","2008-04-17T22:11:29",1124313089 lost,"00:0F:B3:92:05:D3","2008-04-17T22:24:45",1124313885 found,"00:0F:B3:92:05:D3","2008-04-17T22:25:00",1124313900 found,"00:60:57:70:25:0F","2008-04-17T22:29:00",1124314140
  • 9. Datacentre View September 2008
  • 10. old world: September 2008 Java EE
  • 11. September 2008
  • 12. Layers on Top September 2008 Pig ( from Pig Latin) MapReduce query language Hive SQL against the data (facebook) HBase non-relational database Mahout Machine Learning Distributed Lucene Search over HDFS Hama Mathematics
  • 13. Limitations of Hadoop
    • HDFS
      • is not HA —the NameNode is a SPOF
      • does not like small files (neither does S3, GFS)
      • server requirements (esp. RAM) high
    • Performance, scalability limits being discovered
    • Configuration, lifecycle to be improved
    • Need Apache project for web log analysis
    • Diagnostics could be better
    • How power efficient is Hadoop?
    September 2008
  • 14. What to do?
    • Start collecting data now!
    • Look at Hadoop for all your large data storage needs
    • Look at outsourced hosting of the cluster
    • Or learn to manage your own
    • Help code the layers on top to meet your needs
    September 2008 http://hadoop.apache.org
  • 15. what are we up to? September 2008
  • 16. September 2008
    • Make Hadoop deployment agile
    • Integrate with dynamic cluster deployments
  • 17. Around Hadoop ... with SmartFrog now Hardware Hadoop Vertical applications Management, Monitoring, Virtualization
  • 18.