July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Uploaded on

Slides from Chad Vawter's presentation at July 2010 Triangle Hadoop Users Group

Slides from Chad Vawter's presentation at July 2010 Triangle Hadoop Users Group

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Netlib: CS? HPC at ORNL; PVM/MPI as precursors to Hadoop; Not for exactly the same purposes (explain) Messaging/CEP/Analytics: Algorithmic trading - Usually predictive data mining for definition of complex events prior to CEP runtime JTV/DHS: CEP for anti-money laundering (AML), etc. BI: 180,000 reports daily; large-scale ETL (Pentaho and/or Talend) Interests: Links to Scala/Hadoop, R/Hadoop


  • 1. Setting Up Your First Hadoop Cluster Chad Vawter TriHUG Meeting: July 20, 2010
  • 2. Speaker Background
    • Netlib and Parallel Virtual Machine (PVM)
    • High-volume messaging, complex event processing (CEP), and predictive data mining
    • SOA/ESB at the U.S. Department of Homeland Security
    • Banking: BPM, ETL, Reporting and Analytics
    • Interests: Mahout and R/Hadoop, Functional and OO languages for the JVM (Clojure, Scala, etc.)
  • 3. Goals
    • High-level overview of the prerequisites to Hadoop cluster installation and operation
    • High-level overview of the Hadoop configuration files
  • 4. Hadoop Prerequisites
    • Supported Operating Systems
      • Linux
      • Mac OS/X
      • BSD
      • OpenSolaris
      • Windows
        • Need Cygwin (especially OpenSSH)
        • Java Service Wrapper from Tanuki Software
    • Supported Java (JRE) versions
      • Java 6 or later
  • 5. Let’s use Linux…
  • 6. Hadoop Distributions
    • Apache Hadoop
    • Cloudera
      • Cloudera’s Distribution for Hadoop (CDH)
        • Flume – streaming data collection (e.g., log files)
        • Oozie – Yahoo!’s workflow engine for complex Hadoop jobs and data pipelines
        • Sqoop - SQL-to-Hadoop database import and export tool
        • Hadoop User Environment (Hue) – UI framework and SDK for visual Hadoop applications
      • Cloudera Enterprise
        • CDH + management and monitoring tools and production support services
    • Yahoo! Distribution of Hadoop
      • Code patches for performance and stability
      • Security
      • Oozie
  • 7. Install the Apache Hadoop Distribution
    • Create a user and group for ownership and permissions
      • e.g., hadoop:hadoop
    • Download Hadoop from the Apache Hadoop releases page:
      • http://hadoop.apache.org/common/releases.html
  • 8. Hadoop Configuration
    • SSH configuration
      • Hadoop control scripts communicate with machines in a Hadoop cluster via SSH.
    • Hadoop environment configuration
      • Configure the environment in which the Hadoop daemons run.
    • Configuration parameters for the Hadoop daemons
      • NameNode / DataNode
      • JobTracker / TaskTracker
  • 9. SSH Configuration
    • Hadoop control scripts use SSH for cluster-wide operations, so…
    • In the hadoop user account’s home directory, generate a public/private key pair:
      • ssh-keygen –t rsa –f ~/.ssh/id_rsa
      • The private key will be in the ~/.ssh/id_rsa file.
      • The public key will be in the ~/.ssh/id_rsa.pub file.
  • 10. SSH Configuration (continued)
    • The public key must be in the ~/.ssh/authorized_keys file on each machine in the Hadoop cluster:
      • cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    • Use ssh-agent to avoid having to type the passphrase of the private key when connecting from one machine in the Hadoop cluster to another.
    • Run ssh-add to store the passphrase.
    • We now have secure, encrypted passwordless logins.
  • 11. The Hadoop “Environment”
    • Each machine in a Hadoop cluster has a configuration script for environment settings.
    • Edit the hadoop-env.sh Bash script on each machine or have a mechanism for sharing environment settings; e.g., rsync .
    • Values for many environment variables can be identical for all machines in the cluster. Not all machines will have the same hardware profile, though. Configure each machine’s Hadoop environment so that it best uses its resources.
  • 12. hadoop-env.sh
  • 13. Read-Only Default Configuration Files
    • src/core/core-default.xml
    • src/hdfs/hdfs-default.xml
    • src/mapred/mapred-default.xml
  • 14. Site-Specific Configuration Files
    • Override the values provided in the default configuration files:
      • conf/core-site.xml
      • conf/hdfs-site.xml
      • conf/mapred-site.xml
  • 15. Other Configuration Files
    • slaves
      • This file defines which machines will run datanodes and/or tasktrackers
      • Note: We don’t need to specify which machine(s) will run a NameNode and/or a JobTracker. The Hadoop control scripts are responsible for NamNode and JobTracker nodes when they are run on a given machine.
    • hadoop-metrics.properties
    • log4j.properties
  • 16. Hadoop Startup
    • Format a new distributed file system:
    • bin/hadoop namenode –format
    • Start the HDFS on the designated NameNode:
    • bin/start-dfs.sh
    • The start-dfs.sh scripts consults the conf/slaves file on the NameNode and starts a DataNode daemon on each of the listed slaves.
    • Start MapReduce on the designated JobTracker:
    • bin/start-mapred.sh
    • The start-mapred.sh scripts consults the conf/slaves file on the JobTracker and starts a TaskTracker daemon on each of the listed slaves.
  • 17. Hadoop Shutdown
    • Stop the HDFS on the designated NameNode:
    • bin/stop-dfs.sh
    • The start-dfs.sh scripts consults the conf/slaves file on the NameNode and stops the DataNode daemon on each of the listed slaves.
    • Stop MapReduce on the designated JobTracker:
    • bin/stop-mapred.sh
    • The stop-mapred.sh scripts consults the conf/slaves file on the JobTracker and stops the TaskTracker daemon on each of the listed slaves.
  • 18. Other Hadoop Installation Options
    • Cloud Computing with Hadoop
      • Amazon EC2
        • Xen open-source virtual machine monitor (hypervisor)
      • Amazon Elastic MapReduce
      • VMware vCloud
      • Windows Azure?
  • 19. TriHUG Meeting Suggestions?
    • Hadoop Performance-Tuning with Advanced Configuration
    • Data Warehousing and Large-Scale Extraction, Transformation and Loading (ETL) with Hadoop
    • High-Volume Reporting with Hadoop
    • Hadoop and Object-Functional Languages for the JVM
    • Others?
  • 20. Resources - Hadoop
    • Apache Hadoop
      • http://hadoop.apache.org/
    • Hadoop: The Definitive Guide
      • http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979/ref=sr_1_1?ie=UTF8&s=books&qid=1279640275&sr=8-1
    • Downloading and Installing Hadoop
      • http://wiki.apache.org/hadoop/GettingStartedWithHadoop
    • Cloudera’s Hadoop Distribution
      • http://www.cloudera.com/
    • Yahoo’s Hadoop Distribution
      • http://developer.yahoo.com/hadoop/
  • 21. Resources - Hadoop ( continued )
    • Supported Java Versions
      • http://wiki.apache.org/hadoop/HadoopJavaVersions
    • Hadoop on Windows with Eclipse
      • http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html
  • 22. Resources - Amazon EC2
    • Amazon Elastic Compute Cloud (EC2)
      • http://aws.amazon.com/ec2/
    • Amazon Elastic MapReduce
      • http://aws.amazon.com/elasticmapreduce/
    • EC2 Starter’s Guide for Ubuntu
      • https://help.ubuntu.com/community/EC2StartersGuide
  • 23. Resources - Miscellaneous
    • Xen Open-Source Virtual Machine Monitor
      • http://www.xen.org/
    • Virtualization - Comparison
      • http://www.virtualbox.org/wiki/VBox_vs_Others
  • 24. Keep in Touch
    • [email_address]