Introduction to Apache Hadoop

4,992 views

Published on

Apache Hadoop Presentation by Steve Watt at Data Day Austin 2011

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,992
On SlideShare
0
From Embeds
0
Number of Embeds
895
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • Credit – Doug Cutting for Slide information
  • Credit Tom White for picure
  • Introduction to Apache Hadoop

    1. 1. Introduction to Apache Hadoop Steve Watt - IBM Big Data Lead @wattsteve #datadayaustin http://stevewatt.blogspot.com
    2. 2. The Origins of Hadoop
    3. 3. The Origins of Hadoop <ul><li>A Petabyte scale explosion of Data on the Internet and in the Enterprise, begs the following questions: </li></ul><ul><li>How do we handle unstructured data ? </li></ul><ul><li>How do we scale? </li></ul><ul><li>An example: A need to process 100 TB datasets </li></ul><ul><li>On 1 Node: </li></ul><ul><ul><ul><li>Scanning @50 MB/s = 23 days </li></ul></ul></ul><ul><li>On 1000 Node Cluster </li></ul><ul><ul><ul><li>Scanning @50 MB/s = 33 mins </li></ul></ul></ul>
    4. 4. The Origins of Hadoop <ul><li>In 2004 Google publishes seminal whitepapers on a new programming paradigm to handle data at Internet Scale (Google processes upwards of 20 PB per day using Map/Reduce) </li></ul><ul><li>http://research.google.com/people/sanjay/index.html </li></ul><ul><li>The Apache Foundation launches Hadoop – An Open-Source implementation of Google Map/Reduce and the distributed Google FileSystem </li></ul>
    5. 5. So what exactly is Apache Hadoop ? It is a cluster technology with a single master and multiple slaves, designed for commodity hardware It consists of two runtimes, the Hadoop distributed file system ( HDFS ) and Map/Reduce As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine ( data locality ). Hadoop may execute or re-execute a job on any node in the cluster. Node failures are automatically handled by the framework.
    6. 6. Hadoop – The Hadoop Cluster - Distributed File System - Map/Reduce
    7. 8. Hadoop - Map/Reduce <ul><li>Setup </li></ul><ul><ul><li>Jobs are submitted to each machine in the cluster to run against the blocks that are local to each particular machine </li></ul></ul><ul><ul><li>The Job specifies an InputFormatter which knows how to read the data in the block </li></ul></ul><ul><ul><li>The InputFormatter contains a record reader which identifies all the records in the block for processing </li></ul></ul><ul><li>Map step </li></ul><ul><ul><li>One map task for each block (aka Input Split) </li></ul></ul><ul><ul><li>The map function will be called for each record in the input dataset </li></ul></ul><ul><ul><li>Produces a list of (key, value) pairs </li></ul></ul><ul><li>Reduce step </li></ul><ul><ul><li>The Reducer receives a sorted list of Keys with their corresponding values </li></ul></ul><ul><ul><li>The Reducer is called once for each key </li></ul></ul>
    8. 9. Hadoop - Map/Reduce on the Cluster
    9. 10. Hadoop - Map/Reduce Logical Flow
    10. 11. Hadoop – Map/Reduce – JobTracker Details
    11. 12. Hadoop – Map/Reduce – Job Details
    12. 13. Examples of Industry using Hadoop <ul><li>Trend Analysis of existing unstructured data (such as mining log files for key metrics) </li></ul><ul><li>Targeted crawling (obtains the data) coupled with information extraction and classification (structures the data) </li></ul><ul><li>Text Analytics – the ability to run extractors over unstructured data to cleanse, structure and normalize it so that it can be queried via - (Pig / HIVE / BigSheets). </li></ul><ul><li>A programming model for cloud computing : Hadoop jobs running natively in the cloud, over data stored in the cloud and storing the output in the cloud – Amazon EC2 </li></ul>
    13. 14. The Hadoop Ecosystem ClusterChef / Apache Whirr Hadoop Pig / WuKong Cassandra / HBase Offline Systems (Analytics) Online Systems (OLTP @ Scale) BigSheets / DataMeer Hive Provisioning Nutch / SQOOP / Flume Scripting DBA Non-Programmer Load Tooling https://github.com/tomwhite/hadoop-ecosystem/raw/master/hadoop-ecosystem.dot.png
    14. 15. Installing and Running Hadoop - Demo <ul><li>Modes: Standalone, Pseudo-Distributed, Fully Distributed </li></ul><ul><li>Pseudo-Distributed Steps (http://stevewatt.blogspot.com): </li></ul><ul><ul><li>Untar Hadoop in desired directory </li></ul></ul><ul><ul><li>Setup Passwordless SSH </li></ul></ul><ul><ul><li>Set JAVA_HOME in conf/Hadoop-env.sh </li></ul></ul><ul><ul><li>Modify the conf/hdfs-site.xml, conf/mapred-site.xml and conf/core-site.xml </li></ul></ul><ul><ul><li>Set conf/Master and conf/Slaves to “localhost” </li></ul></ul><ul><ul><li>Format Namenode: bin/hadoop namenode -format </li></ul></ul><ul><ul><li>Start Hadoop: bin/start-all.sh </li></ul></ul><ul><ul><li>Check Runtime Status - http: //localhost :50030 & 50070 </li></ul></ul><ul><ul><li>Run TeraGen/Terasort/TeraValidate System Test </li></ul></ul>

    ×