Hadoop - Overview

15,560 views

Published on

Technical package for a getting started discussion w/ hadoop

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total views
15,560
On SlideShare
0
From Embeds
0
Number of Embeds
40
Actions
Shares
0
Downloads
623
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Hadoop - Overview

  1. 1. Hadoop <ul><li>What is it? </li></ul><ul><li>What is Map/Reduce? </li></ul><ul><li>Why does it exist? </li></ul><ul><li>Can I see an Example ? </li></ul><ul><li>What's the “Architecture” ? </li></ul><ul><li>Who's Using It? </li></ul><ul><li>Is there More Info ? </li></ul>
  2. 2. Hadoop – What is It ? <ul><li>“ Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.” [1] </li></ul><ul><li>Apache Project to clone Google's “MapReduce” </li></ul><ul><li>Paradigm for programming and processing large data </li></ul><ul><li>[1] http://wiki.apache.org/lucene-hadoop/ </li></ul><ul><li>[2] http://wiki.apache.org/lucene-hadoop/ProjectDescription </li></ul>
  3. 3. Hadoop – What is MapReduce? <ul><li>Simplified Data Processing on Large Clusters [1] </li></ul><ul><ul><li>MapReduce is a set of code and infrastructure for parsing and building large data sets. A map function generates a key/value pair from the input data and this data is then reduced by a function to merges all values associated with equivalent keys. Programs are automatically parallelized and executed on a run-time system which manages partitioning the input data, scheduling execution and managing communication including recovery from machine failures </li></ul></ul><ul><ul><li>This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system! </li></ul></ul><ul><li>Google Stats [2] -----------------------------> </li></ul><ul><li>Learn More [3] </li></ul><ul><li>[1] http://labs.google.com/papers/mapreduce.html </li></ul><ul><li>[2] http://googlesystem.blogspot.com/2008/01/google-reveals-more-mapreduce-stats.html </li></ul><ul><li>[3] http://code.google.com/edu/content/submissions/mapreduce/listing.html </li></ul>
  4. 4. Hadoop / MapReduce – Why? <ul><li>Specifically Hadoop itself is a reaction to Google's competitive advantage provided by it's internal “MapReduce” system. </li></ul><ul><ul><li>Google has not released “MapReduce” but has partitipated in and influenced Hadoop development </li></ul></ul><ul><ul><li>IBM and Yahoo among Hadoop's backers </li></ul></ul><ul><li>Applies practices of computational clusters to processing data </li></ul><ul><li>Increasing amount of information for companies to process </li></ul><ul><ul><li>Reasons include; Growth of customers and Increasing necessity for competitive advantage </li></ul></ul><ul><li>Promotes an “Infrastructure approach” to “Application Programming” </li></ul>
  5. 5. Hadoop – Give me an Example... <ul><li>Map </li></ul><ul><ul><li>>>> seq = range(8) </li></ul></ul><ul><ul><li>>>> def add(x, y): return x+y </li></ul></ul><ul><ul><li>>>> map(add, seq, seq) </li></ul></ul><ul><ul><li>[0, 2, 4, 6, 8, 10, 12, 14] </li></ul></ul><ul><li>Hadoop / MapReduce </li></ul><ul><ul><li>Evens & Odds [2, 3, 4, 5, 6, 7, 8] </li></ul></ul><ul><ul><ul><li>Map evens to '1' and odds to '0': </li></ul></ul></ul><ul><ul><ul><ul><li>Evens –> (1,2), (1,4), (1,6), (1,8) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Odds –> (0, 3), (0, 5), (0, 7) </li></ul></ul></ul></ul><ul><ul><ul><li>Reduce: </li></ul></ul></ul><ul><ul><ul><ul><li>Evens: ('Even',[2468]) and Odds ('Odd',[357]) </li></ul></ul></ul></ul><ul><li>Hadoop Examples [2] – e.g. counting words in a file </li></ul><ul><li>[1] http://docs.python.org/tut/node7.html </li></ul><ul><li>[2] http://lucene.apache.org/hadoop/docs/current/api/org/apache/hadoop/examples/package-summary.html </li></ul><ul><li>Reduce </li></ul><ul><ul><li>>>> def add(x,y): return x+y </li></ul></ul><ul><ul><li>>>> reduce(add, range(1, 11)) </li></ul></ul><ul><ul><li>55 </li></ul></ul>
  6. 6. Hadoop – What's the Architecture ? <ul><li>Cluster Code </li></ul><ul><ul><li>Written in Java and consists of Compute Nodes aka “tasktrackers” managed by a “jobtracker” and a Distributed File System (HDFS) “namenode” with “datanodes” </li></ul></ul><ul><li>Infrastructure - Typically built on “commodity” hardware [2] </li></ul><ul><li>Application </li></ul><ul><ul><li>JNI and C/C++ bindings </li></ul></ul><ul><ul><li>Has a “streaming” component for command line interaction – http://lucene.apache.org/hadoop/docs/r0.15.2/streaming.html </li></ul></ul><ul><li>Scale - “Hadoop has been demonstrated on clusters of up to 2000 nodes. Sort performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes around 2.25 hours) and improving. Sort performances on 1400 nodes and 2000 nodes are pretty good too - sorting 14TB of data on a 1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster takes 2.5 hours.” [1] </li></ul><ul><li>[1] http://wiki.apache.org/lucene-hadoop/FAQ#3 </li></ul><ul><li>[2] http://wiki.apache.org/lucene-hadoop/MachineScaling </li></ul>
  7. 7. Hadoop – Who's Using It [1]? <ul><li>Hadoop Korean User Group - 50 node cluster in the university network </li></ul><ul><ul><li>Used for development projects </li></ul></ul><ul><ul><ul><li>Retrieving and Analyzing Biomedical Knowledge </li></ul></ul></ul><ul><ul><ul><li>Latent Semantic Analysis, Collaborative Filtering </li></ul></ul></ul><ul><ul><ul><li>Hbase, Hbase Shell Test </li></ul></ul></ul><ul><li>Last.fm - Used for charts calculation and web log analysis </li></ul><ul><ul><li>25 node cluster (dual xeon LV 2GHz, 4GB RAM, 1TB/node storage) </li></ul></ul><ul><ul><li>10 node cluster (dual xeon L5320 1.86GHz, 8GB RAM, 3TB/node storage) </li></ul></ul><ul><li>Nutch - flexible web search engine software </li></ul><ul><li>Powerset - Natural Language Search </li></ul><ul><ul><li>Up to 400 instances on Amazon EC2 </li></ul></ul><ul><ul><li>Data storage in Amazon S3 </li></ul></ul><ul><li>Yahoo! - Used to support research for Ad Systems and Web Search </li></ul><ul><ul><li>>5000 nodes running Hadoop as of July 2007 </li></ul></ul><ul><ul><li>Biggest cluster: 2000 nodes (2*4cpu boxes w 3TB disk each) </li></ul></ul><ul><ul><li>Also used to do scaling tests to support development of Hadoop on larger clusters </li></ul></ul><ul><li>Google and IBM [2] </li></ul><ul><ul><ul><li>[1] http://wiki.apache.org/lucene-hadoop/PoweredBy </li></ul></ul></ul><ul><ul><ul><li>[2] http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm?campaign_id=rss_tech </li></ul></ul></ul>
  8. 8. Hadoop – More Information <ul><li>http://wiki.apache.org/lucene-hadoop/ImportantConcepts </li></ul><ul><li>http://wiki.apache.org/lucene-hadoop/ </li></ul><ul><li>Overview - http://wiki.apache.org/lucene-hadoop/HadoopMapReduce </li></ul><ul><li>http://lucene.apache.org/hadoop/docs/r0.15.2/ </li></ul><ul><ul><li>Quickstart – http://lucene.apache.org/hadoop/docs/r0.15.2/quickstart.html </li></ul></ul><ul><ul><li>Tutorials – http://lucene.apache.org/hadoop/docs/r0.15.2/mapred_tutorial.html </li></ul></ul><ul><ul><li>Python Tutorial – http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python </li></ul></ul><ul><li>Google Code for Educators </li></ul><ul><ul><li>MapReduce in a Week – http://code.google.com/edu/content/submissions/mapreduce/listing.html </li></ul></ul><ul><ul><li>HBase: BigTable like Storage – </li></ul></ul><ul><ul><ul><li>Built on Hadoop's DFS just like BigTable uses Google's GFS </li></ul></ul></ul><ul><ul><ul><li>http://wiki.apache.org/lucene-hadoop/Hbase </li></ul></ul></ul><ul><ul><ul><li>Bigtable – http://labs.google.com/papers/bigtabxle.html </li></ul></ul></ul><ul><li>FAQ – http://wiki.apache.org/lucene-hadoop/FAQ </li></ul><ul><li>IBM MapReduce tools for Eclipse - http://www.alphaworks.ibm.com/tech/mapreducetools </li></ul>

×