    1. 1. September 2008 H is for Hadoop Steve Loughran [email_address] Julio Guijarro [email_address]
    2. 2. What is Hadoop? September 2008
    3. 3. A yellow elephant September 2008
    4. 4. A use for a datacentre September 2008
    5. 5. Hadoop is behind Yahoo! <ul><li>Yahoo! has about 10,000 machines running Hadoop </li></ul><ul><li>Largest cluster is currently 1,600 nodes </li></ul><ul><li>Storage is about 1 petabyte of user data (compressed) </li></ul><ul><li>Yahoo! runs about 10,000 research jobs/week </li></ul>source: Eric Baldeschwieler, OSCON, July 25 2007
    6. 6. Java Cloud Computing Edition <ul><li>A filesystem that scales to petabytes </li></ul><ul><li>Google's MapReduce implemented in Java </li></ul><ul><li>The foundation for Yahoo!'s search, last.fm's music correlation, and other datamining applications </li></ul><ul><li>Open source: Apache hosted </li></ul><ul><li>A framework for data-centric computation </li></ul><ul><li>Commodity data processing for commodity data </li></ul>September 2008
    7. 7. MapReduce <ul><li>Map input data => (key,data') </li></ul><ul><li>Reduce (key, data')* => (key, data'') </li></ul><ul><li>Repeat until final output is generated </li></ul><ul><li>The fun comes applying it to terabytes of data </li></ul><ul><li>Uses: log analysis, correlations, statistics, indexing </li></ul>September 2008
    8. 8. Example problem: Bluetooth phones September 2008 <ul><li>Map: Bluetooth device ID </li></ul><ul><li>Reduce: debounce to list of sightings and duration </li></ul><ul><li>Map: sightings and durations </li></ul><ul><li>Reduce: statistics for each device, day of week, … </li></ul>lost,&quot;00:0F:B3:92:05:D3&quot;,&quot;2008-04-17T22:11:15&quot;,1124313075 found,&quot;00:0F:B3:92:05:D3&quot;,&quot;2008-04-17T22:11:29&quot;,1124313089 lost,&quot;00:0F:B3:92:05:D3&quot;,&quot;2008-04-17T22:24:45&quot;,1124313885 found,&quot;00:0F:B3:92:05:D3&quot;,&quot;2008-04-17T22:25:00&quot;,1124313900 found,&quot;00:60:57:70:25:0F&quot;,&quot;2008-04-17T22:29:00&quot;,1124314140
    9. 9. Datacentre View September 2008
    10. 10. old world: September 2008 Java EE
    11. 11. September 2008
    12. 12. Layers on Top September 2008 Pig ( from Pig Latin) MapReduce query language Hive SQL against the data (facebook) HBase non-relational database Mahout Machine Learning Distributed Lucene Search over HDFS Hama Mathematics
    13. 13. Limitations of Hadoop <ul><li>HDFS </li></ul><ul><ul><li>is not HA —the NameNode is a SPOF </li></ul></ul><ul><ul><li>does not like small files (neither does S3, GFS) </li></ul></ul><ul><ul><li>server requirements (esp. RAM) high </li></ul></ul><ul><li>Performance, scalability limits being discovered </li></ul><ul><li>Configuration, lifecycle to be improved </li></ul><ul><li>Need Apache project for web log analysis </li></ul><ul><li>Diagnostics could be better </li></ul><ul><li>How power efficient is Hadoop? </li></ul>September 2008
    14. 14. What to do? <ul><li>Start collecting data now! </li></ul><ul><li>Look at Hadoop for all your large data storage needs </li></ul><ul><li>Look at outsourced hosting of the cluster </li></ul><ul><li>Or learn to manage your own </li></ul><ul><li>Help code the layers on top to meet your needs </li></ul>September 2008 http://hadoop.apache.org
