H is for_hadoop


Published on

Introductory talk from 2008

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • September 2008
  • This is one of Doug Cutting's kid's toys. It is what Hadoop is named after.
  • Lots of people have those google "our other computer is a datacentre" stickers. We just know where ours is and what runs on it. Soon: our machines will have hadoop stickers on them.
  • search for the google mapreduce paper
  • Everyone talks about word counting and click logs, here's something more fun. mapping out all devices with the same ID, debouncing, building up stats. Or by time: which are peak times; are there special groups (school, college students) who can be identified by timings and dates of travel?
  • The namenode really matters -lose that and the cluster is gone Lose the job tracker and all ongoing work is lost, currently the entire job chain needs to be restarted. But the rest of the cluster says live.
  • This is how applications used to be written. An App server driving a cluster, a database (Oracle, IBM DB/2 or something else) behind it all. An O/R mapping infrastructure, either entity beans or spring/hibernate to make Java classes persistent, and JSP front ends. On the side: Message beans for queued operations, Corba IIOP or Java WS-* to talk to other parts of the enterprise.
  • Here are where things get interesting
  • We may have our own cluster, but it isnt a stable one. We need to adapt to what gets provided to us.
  • H is for_hadoop

    1. 1. September 2008 H is for Hadoop Steve Loughran [email_address] Julio Guijarro [email_address]
    2. 2. What is Hadoop? September 2008
    3. 3. A yellow elephant September 2008
    4. 4. A use for a datacentre September 2008
    5. 5. Hadoop is behind Yahoo! <ul><li>Yahoo! has about 10,000 machines running Hadoop </li></ul><ul><li>Largest cluster is currently 1,600 nodes </li></ul><ul><li>Storage is about 1 petabyte of user data (compressed) </li></ul><ul><li>Yahoo! runs about 10,000 research jobs/week </li></ul>source: Eric Baldeschwieler, OSCON, July 25 2007
    6. 6. Java Cloud Computing Edition <ul><li>A filesystem that scales to petabytes </li></ul><ul><li>Google's MapReduce implemented in Java </li></ul><ul><li>The foundation for Yahoo!'s search, last.fm's music correlation, and other datamining applications </li></ul><ul><li>Open source: Apache hosted </li></ul><ul><li>A framework for data-centric computation </li></ul><ul><li>Commodity data processing for commodity data </li></ul>September 2008
    7. 7. MapReduce <ul><li>Map input data => (key,data') </li></ul><ul><li>Reduce (key, data')* => (key, data'') </li></ul><ul><li>Repeat until final output is generated </li></ul><ul><li>The fun comes applying it to terabytes of data </li></ul><ul><li>Uses: log analysis, correlations, statistics, indexing </li></ul>September 2008
    8. 8. Example problem: Bluetooth phones September 2008 <ul><li>Map: Bluetooth device ID </li></ul><ul><li>Reduce: debounce to list of sightings and duration </li></ul><ul><li>Map: sightings and durations </li></ul><ul><li>Reduce: statistics for each device, day of week, … </li></ul>lost,&quot;00:0F:B3:92:05:D3&quot;,&quot;2008-04-17T22:11:15&quot;,1124313075 found,&quot;00:0F:B3:92:05:D3&quot;,&quot;2008-04-17T22:11:29&quot;,1124313089 lost,&quot;00:0F:B3:92:05:D3&quot;,&quot;2008-04-17T22:24:45&quot;,1124313885 found,&quot;00:0F:B3:92:05:D3&quot;,&quot;2008-04-17T22:25:00&quot;,1124313900 found,&quot;00:60:57:70:25:0F&quot;,&quot;2008-04-17T22:29:00&quot;,1124314140
    9. 9. Datacentre View September 2008
    10. 10. old world: September 2008 Java EE
    11. 11. September 2008
    12. 12. Layers on Top September 2008 Pig ( from Pig Latin) MapReduce query language Hive SQL against the data (facebook) HBase non-relational database Mahout Machine Learning Distributed Lucene Search over HDFS Hama Mathematics
    13. 13. Limitations of Hadoop <ul><li>HDFS </li></ul><ul><ul><li>is not HA —the NameNode is a SPOF </li></ul></ul><ul><ul><li>does not like small files (neither does S3, GFS) </li></ul></ul><ul><ul><li>server requirements (esp. RAM) high </li></ul></ul><ul><li>Performance, scalability limits being discovered </li></ul><ul><li>Configuration, lifecycle to be improved </li></ul><ul><li>Need Apache project for web log analysis </li></ul><ul><li>Diagnostics could be better </li></ul><ul><li>How power efficient is Hadoop? </li></ul>September 2008
    14. 14. What to do? <ul><li>Start collecting data now! </li></ul><ul><li>Look at Hadoop for all your large data storage needs </li></ul><ul><li>Look at outsourced hosting of the cluster </li></ul><ul><li>Or learn to manage your own </li></ul><ul><li>Help code the layers on top to meet your needs </li></ul>September 2008 http://hadoop.apache.org
    15. 15. what are we up to? September 2008
    16. 16. September 2008 <ul><li>Make Hadoop deployment agile </li></ul><ul><li>Integrate with dynamic cluster deployments </li></ul>
    17. 17. Around Hadoop ... with SmartFrog now Hardware Hadoop Vertical applications Management, Monitoring, Virtualization