Hadoop
                 divide and conquer petabyte-scale data




© Matthew McCullough, Ambient Ideas, LLC
Metadata

Matthew McCullough
  Ambient Ideas, LLC
  matthewm@ambientideas.com
  http://ambientideas.com/blog
  @matthewmcc...
http://delicious.com/matthew.mccullough/hadoop
ado op o ften.
I u se H
                   d my  Tivo
        the s  oun
Th at’s             I s kip a
         very  time...
Why?
u are using
       mpu ter yo
The co
right now
        ell hav e the
    very w             sor
may
          GHz  pro ces...
Scale up?
Scale out
Talk Roadmap

History
MapReduce
Filesystem
DSLs
History
Do
   ug
        Cu
          tt
             in
               g
open source
Lucene
Lucene
Nutch
Lucene
Nutch
Hadoop
Lucene
Nutch
Hadoop
Lucene
Mahout Nutch
   Hadoop
Today
0.21.0 current version

Dozens of companies contributing

Hundreds of companies using
Virtual
Machine
VM Sources

 Yahoo
  True to the OSS distribution
 Cloudera
  Desktop tools
 Both VMWare based
Set up the Environment


      • Launch “cloudera-training 0.3.3” VMWare instance
      • Open a terminal window
Lab
Log Files



      •Attach tail to all the Hadoop logs
Lab
processing / Data
Processing Framework                     Data Storage

MapReduce                             HDFS
Hive QL                 ...
Hadoop
Components
Hadoop Components
   Tool                      Purpose
   Common       MapReduce

      HDFS      Filesystem

          Pi...
the Players



                             Comm
       C hukwa   ZooKeeper          on   H Base
Hive                     ...
Motivations
Original Goals


 Web Indexer
 Yahoo Search Engine
Pre-Hadoop
Shortcomings

 Search engine update frequency
 Storage costs
 Expense of durability
 Ad-hoc queries
Anti-Patterns Cured


 RAM-heavy RDBMS boxes
 Sharding
 Archiving
 Ever-smaller-range SQL queries
1 Gigabyte
1 Terabyte
1 Petabyte
16 Petabytes
Near-Linear Hardware Scalability
Applications

 Protein folding
 pharmaceutical research

 Search Engine Indexing
 walking billions of web pages

 Product ...
Contextual Ads
SELECT reccProd.name, reccProd.id
 FROM products reccProd
 WHERE purchases.customerId =


 (SELECT customerId
   FROM cust...
30%
of Amazon sales are from
recommendations
ACID


 ATOMICITY

 CONSISTENCY

 ISOLATION

 DURABILITY
CAP


 Consistency
 Availability
 Partition Tolerance
MapReduce
MapReduce: Simplified Data
                                               Processing                                    on ...
mode l and
        gram ming
“ A pro           for
      ment ation
imple
                  nd ge nera ting
 pro cess   in...
MapReduce
a word counting conceptual example
The Goal


Provide the occurrence count
of each distinct word across
all documents
Raw Data
             a folder of documents

mydoc1.txt     mydoc2.txt            mydoc3.txt
Map
break documents into words
Shuffle
physically group (relocate) by key
Reduce
count word occurrences
Reduce Again
  sort occurrences alphabetically
Grep.java
package org.apache.hadoop.examples;

import java.util.Random;

import   org.apache.hadoop.conf.Configuration;
import   org...
import org.apache.hadoop.util.ToolRunner;

/* Extracts matching regexs from input files and counts them. */
public class G...
grepJob.setJobName("grep-search");

FileInputFormat.setInputPaths(grepJob, args[0]);

grepJob.setMapperClass(RegexMapper.c...
FileOutputFormat.setOutputPath(grepJob, tempDir);
        grepJob.setOutputFormat(SequenceFileOutputFormat.class);
       ...
JobClient.runJob(grepJob);

        JobConf sortJob = new JobConf(Grep.class);
        sortJob.setJobName("grep-sort");

 ...
RegExMapper.java
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements. See the NOTIC...
/** A {@link Mapper} that extracts text matching a regular expression.
*/
public class RegexMapper<K> extends MapReduceBas...
Have Code,
  Will Travel


 Code travels to the data
 Opposite of traditional systems
Filesystem
uted     file
    alab le di strib
“sc                       ibuted
            r large distr
s yst em fo                 ...
HDFS Basics

 Open Source implementation of
 Google BigTable
 Replicated data store
 Stored in 64MB blocks
HDFS

 Rack location aware
 Configurable redundancy factor
 Self-healing
 Looks almost like *NIX filesystem
Why HDFS?


 Random reads
 Parallel reads
 Redundancy
Data Overload
Test HDFS


      •   Upload a file
      •   List directories
Lab
HDFS Upload


      •   Show contents of file in HDFS
      •   Show vocabulary of HDFS
Lab
HDFS Distcp


      •   Copy file to other nodes
      •   MapReduce job starts
Lab
HDFS Challenges

 Writes are re-writes today
  Append is planned
 Block size, alignment
 Small file inefficiencies
 NameNo...
FSCK



      •Check filesystem
Lab
NO SQL
Data Categories
Structured




Semi-structured




 Unstructured
NOSQL is...
Semi-structured


                   not death of the RDBMS
                   no JOINing
                   n...
Grid Benefits
Scalable


 Data storage is pipelined
 Code travels to data
 Near linear hardware scalability
Optimization


 Preemptive execution
 Maximizes use of faster hardware
 New jobs trump P.E.
Fault Tolerant

 Configurable data redundancy
 Minimizes hardware failure impact
 Automatic job retries
 Self healing file...
Sproinnnng!

Bzzzt!

                       Poof!
server Funerals



No pagers go off when machines die
Report of dead machines once a week
 Clean out the carcasses
ss attributes
Ro bustne             eding
        ed fro  m ble
p revent
       plication code
into ap
             Data r...
Node TYPES
NameNode
SecondaryNameNode
DataNode
JobTracker
TaskTracker
Robust, Primary
                           1                     Hot Backup
                                              ...
NameNode


 Catalog of data blocks
  Prefers large files
  High memory consumption

 Journaled file system
  Saves state t...
Robust, Primary
                     1



               NameNode




               JobTracker




Commodity
    1..N


T...
Secondary NameNode

 Near-hot backup for NN
 Snapshots
  Missing journal entries
  Data loss in non-optimal setup
 Grid re...
Robust, Primary
       1          Hot Backup
                      0..1

 NameNode
                  Secondary
           ...
NameNode, Cont’d

 Backup via SNN
  Journal to NFS with replication?
 Single point of failure
  Failover plan
  Big IP dev...
Robust, Primary
                           1                       Hot Backup
                                            ...
DataNode

 Anonymous
 Data block storage
 “No identity” is positive trait
 Commodity equipment
Robust, Primary
                           1                     Hot Backup
                                              ...
JobTracker
 Singleton
 Commonly co-located with NN
 Coordinates TaskTracker
 Load balances
 FairPlay scheduling
 Preemptiv...
Robust, Primary
                           1                     Hot Backup
                                              ...
TaskTracker


 Runs MapReduce jobs
 Reports back to JobTracker
Robust, Primary
                           1                     Hot Backup
                                              ...
Listing Nodes


      •Use Java’sa JPS tool to get list of
       nodes on box
      •sudo jps -l
Lab
Processing Nodes


 Anonymous
 “No identity” is positive trait
 Commodity equipment
Master Node

 Master is a special machine
 Use high quality hardware
 Single point of failure
 Recoverable
Key-value
  Store
HBase
HBase Basics
 Map-oriented storage
 Key value pairs
 Column families
 Stores to HDFS
 Fast
 Usable for synchronous respons...
Direct Comparison



 Amazon SimpleDB
 Google BigTable (Datastore)
Competitors


 Facebook Cassandra
 LinkedIn Project Voldemort
Shortcomings

 HQL
 WHERE clause limited to key
 Other column filters are full-table
 scans
 No ad-hoc queries
hbase>
help

create 'mylittletable', 'mylittlecolumnfamily'
describe 'mylittletable'

put 'mylittletable', 'r2', 'mylittle...
HBase


      •   Create column family
      •   Insert data
      •   Select data
Lab
DSLs
Pig
Pig Basics

 Yahoo-authored add-on
 High-level language for authoring
 data analysis programs
 Console
PIG Questions
 Ask big questions on unstructured
 data
  How many ___?
  Should we ____?
 Decide on the questions you want...
Pig Data

999991,female,Mary,T,Hargrave,600 Quiet Valley Lane,Los Angeles,CA,90017,US,Mary.T.Hargrave@dodgit.com,
999992,m...
Pig Sample

Person = LOAD 'people.csv' using PigStorage(',');
Names = FOREACH Person GENERATE $2 AS name;
OrderedNames = O...
Pig Scripting



      •Re-column and sort data
Lab
Hive
Hive Basics
 Authored by
 SQL interface to HBase
 Hive is low-level
 Hive-specific metadata
 Data warehousing
hive
  -e 'select t1.x from mylittletable t1'
hive -S
  -e 'select a.col from tab1 a'
  > dump.txt
SELECT * FROM shakespeare
WHERE freq > 100
SORT BY freq ASC
LIMIT 10;
Load Hive Data



      • Load and select Hive data
Lab
Sync, Async


RDBMS SQL is realtime
Hive is primarily asynchronous
Sqoop
Sqoop


 A              utility
 Imports from RDBMS
 Outputs plaintext, SequenceFile, or Hive
sqoop --connect jdbc:mysql://database.example.com/
Data
Warehousing
Sync, Async


RDBMS is realtime
HBase is near realtime
Hive is asynchronous
Monitoring
Web Status Panels


 NameNode
 http://localhost:50070/
 JobTracker
 http://localhost:50030/
View Consoles


      •   Start job
      •   Watch consoles
      •   Browse HDFS via web interface
Lab
Configuration
     Files
Robust, Primary
                           1                     Hot Backup
                                              ...
Grid Execution
Single Node
 start-all.sh
 Needs XML config
 ssh account setup
 Multiple JVMs
 List of processes
  jps -l
Start and Stop Grid

      •cd /usr/lib/hadoop-0.20/bin
      •sudo -u hadoop ./stop-all.sh
      •sudo jps -l
      •tail...
SSH Setup


 create key
 ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
 $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Commodity
               Robust, Primary
                      1                                1..N



                  ...
Complete Grid


 start-all.sh
 Host names, ssh setup
 Default out-of-box experience
Commodity
                                                                  1..N


                                       ...
Grid on the Cloud
EMR Pricing
Summary
Sapir-WHorF Hypothesis...
                Remixed



The ability to store and process
massive data influences what you
dec...
Hadoop
                 divide and conquer petabyte-scale data




© Matthew McCullough, Ambient Ideas, LLC
Metadata

Matthew McCullough
  Ambient Ideas, LLC
  matthewm@ambientideas.com
  http://ambientideas.com/blog
  @matthewmcc...
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
Upcoming SlideShare
Loading in...5
×

Hadoop at JavaZone 2010

3,668

Published on

Matthew McCullough's presentation of Hadoop to the JavaZone 2010 conference in Oslo, Norway

Published in: Education
3 Comments
13 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,668
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
3
Likes
13
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop at JavaZone 2010"

  1. 1. Hadoop divide and conquer petabyte-scale data © Matthew McCullough, Ambient Ideas, LLC
  2. 2. Metadata Matthew McCullough Ambient Ideas, LLC matthewm@ambientideas.com http://ambientideas.com/blog @matthewmccull Code http://github.com/matthewmccullough/hadoop-intro
  3. 3. http://delicious.com/matthew.mccullough/hadoop
  4. 4. ado op o ften. I u se H d my Tivo the s oun Th at’s I s kip a very time ma kes e com mer cial. rrency in Practice Ja va Concu n Goetz, author of -Bria
  5. 5. Why?
  6. 6. u are using mpu ter yo The co right now ell hav e the very w sor may GHz pro ces fas test u’ll ever own yo
  7. 7. Scale up?
  8. 8. Scale out
  9. 9. Talk Roadmap History MapReduce Filesystem DSLs
  10. 10. History
  11. 11. Do ug Cu tt in g
  12. 12. open source
  13. 13. Lucene
  14. 14. Lucene Nutch
  15. 15. Lucene Nutch Hadoop
  16. 16. Lucene Nutch Hadoop
  17. 17. Lucene Mahout Nutch Hadoop
  18. 18. Today 0.21.0 current version Dozens of companies contributing Hundreds of companies using
  19. 19. Virtual Machine
  20. 20. VM Sources Yahoo True to the OSS distribution Cloudera Desktop tools Both VMWare based
  21. 21. Set up the Environment • Launch “cloudera-training 0.3.3” VMWare instance • Open a terminal window Lab
  22. 22. Log Files •Attach tail to all the Hadoop logs Lab
  23. 23. processing / Data
  24. 24. Processing Framework Data Storage MapReduce HDFS Hive QL Hive Chuckwa HBase Flume Avro ZooKeeper Hybrid Pig
  25. 25. Hadoop Components
  26. 26. Hadoop Components Tool Purpose Common MapReduce HDFS Filesystem Pig Analyst MapReduce language HBase Column-oriented data storage Hive SQL-like language for HBase ZooKeeper Workflow & distributed transactions Chukwa Log file processing
  27. 27. the Players Comm C hukwa ZooKeeper on H Base Hive HDFS
  28. 28. Motivations
  29. 29. Original Goals Web Indexer Yahoo Search Engine
  30. 30. Pre-Hadoop Shortcomings Search engine update frequency Storage costs Expense of durability Ad-hoc queries
  31. 31. Anti-Patterns Cured RAM-heavy RDBMS boxes Sharding Archiving Ever-smaller-range SQL queries
  32. 32. 1 Gigabyte
  33. 33. 1 Terabyte
  34. 34. 1 Petabyte
  35. 35. 16 Petabytes
  36. 36. Near-Linear Hardware Scalability
  37. 37. Applications Protein folding pharmaceutical research Search Engine Indexing walking billions of web pages Product Recommendations based on other customer purchases Sorting terabytes to petabyes in size Classification government intelligence
  38. 38. Contextual Ads
  39. 39. SELECT reccProd.name, reccProd.id FROM products reccProd WHERE purchases.customerId = (SELECT customerId FROM customers WHERE purchases.productId = thisProd) LIMIT 5
  40. 40. 30% of Amazon sales are from recommendations
  41. 41. ACID ATOMICITY CONSISTENCY ISOLATION DURABILITY
  42. 42. CAP Consistency Availability Partition Tolerance
  43. 43. MapReduce
  44. 44. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghem awat jeff@google.com, sanjay@goog le.com Google, Inc. Abstract given day, etc. Most such com MapReduce is a programmin putations are conceptu- g model and an associ- ally straightforward. However ated implementation for proc , the input data is usually essing and generating large large and the computations have data sets. Users specify a map to be distributed across function that processes a hundreds or thousands of mac key/value pair to generate a set hines in order to finish in of intermediate key/value a reasonable amount of time. pairs, and a reduce function that The issues of how to par- merges all intermediate allelize the computation, dist values associated with the sam ribute the data, and handle e intermediate key. Many failures conspire to obscure the real world tasks are expressible original simple compu- in this model, as shown tation with large amounts of in the paper. complex code to deal with these issues. Programs written in this func As a reaction to this complex tional style are automati- ity, we designed a new cally parallelized and executed abstraction that allows us to exp on a large cluster of com- ress the simple computa- modity machines. The run-time tions we were trying to perform system takes care of the but hides the messy de- details of partitioning the inpu tails of parallelization, fault-tol t data, scheduling the pro- erance, data distribution gram’s execution across a set and load balancing in a libra of machines, handling ma- ry. Our abstraction is in- chine failures, and managing spired by the map and reduce the required inter-machine primitives present in Lisp communication. This allows and many other functional lang programmers without any uages. We realized that experience with parallel and most of our computations invo distributed systems to eas- lved applying a map op- ily utilize the resources of a larg eration to each logical “record” e distributed system. in our input in order to Our implementation of Map compute a set of intermediat Reduce runs on a large e key/value pairs, and then cluster of commodity machine applying a reduce operation to s and is highly scalable: all the values that shared a typical MapReduce computa the same key, in order to com tion processes many ter- bine the derived data ap- abytes of data on thousands of propriately. Our use of a func machines. Programmers tional model with user- find the system easy to use: hun specified map and reduce ope dreds of MapReduce pro- rations allows us to paral- grams have been implemente lelize large computations easi d and upwards of one thou- ly and to use re-execution sand MapReduce jobs are exec as the primary mechanism for uted on Google’s clusters fault tolerance. every day. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined 1 Introduction with an implementation of this interface that achieves high performance on large clus ters of commodity PCs. Over the past five years, the Section 2 describes the basic authors and many others at programming model and Google have implemented hun gives several examples. Sec dreds of special-purpose tion 3 describes an imple- computations that process larg mentation of the MapReduce e amounts of raw data, interface tailored towards such as crawled documents, our cluster-based computing web request logs, etc., to environment. Section 4 de- compute various kinds of deri scribes several refinements of ved data, such as inverted the programming model indices, various representatio that we have found useful. Sec ns of the graph structure tion 5 has performance of web documents, summaries measurements of our implem of the number of pages entation for a variety of crawled per host, the set of tasks. Section 6 explores the most frequent queries in a use of MapReduce within Google including our experie nces in using it as the basis To appear in OSDI 2004 1
  45. 45. mode l and gram ming “ A pro for ment ation imple nd ge nera ting pro cess ing a dat a s ets ” la rg e
  46. 46. MapReduce a word counting conceptual example
  47. 47. The Goal Provide the occurrence count of each distinct word across all documents
  48. 48. Raw Data a folder of documents mydoc1.txt mydoc2.txt mydoc3.txt
  49. 49. Map break documents into words
  50. 50. Shuffle physically group (relocate) by key
  51. 51. Reduce count word occurrences
  52. 52. Reduce Again sort occurrences alphabetically
  53. 53. Grep.java
  54. 54. package org.apache.hadoop.examples; import java.util.Random; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.*; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; /* Extracts matching regexs from input files and counts them. */ public class Grep extends Configured implements Tool { private Grep() {} // singleton public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
  55. 55. import org.apache.hadoop.util.ToolRunner; /* Extracts matching regexs from input files and counts them. */ public class Grep extends Configured implements Tool { private Grep() {} // singleton public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return -1; } Path tempDir = new Path("grep-temp-"+ Integer.toString(new Random().nextInt (Integer.MAX_VALUE))); JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]);
  56. 56. grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class);
  57. 57. FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); sortJob.setNumReduceTasks(1); // write a single file FileOutputFormat.setOutputPath(sortJob, new Path(args [1])); sortJob.setOutputKeyComparatorClass // sort by decreasing freq (LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); }
  58. 58. JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); sortJob.setNumReduceTasks(1); // write a single file FileOutputFormat.setOutputPath(sortJob, new Path(args [1])); sortJob.setOutputKeyComparatorClass // sort by decreasing freq (LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; }
  59. 59. RegExMapper.java
  60. 60. /** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.mapred.lib; import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern;
  61. 61. /** A {@link Mapper} that extracts text matching a regular expression. */ public class RegexMapper<K> extends MapReduceBase implements Mapper<K, Text, Text, LongWritable> { private Pattern pattern; private int group; public void configure(JobConf job) { pattern = Pattern.compile(job.get("mapred.mapper.regex")); group = job.getInt("mapred.mapper.regex.group", 0); } public void map(K key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String text = value.toString(); Matcher matcher = pattern.matcher(text); while (matcher.find()) { output.collect(new Text(matcher.group(group)), new LongWritable (1)); } } }
  62. 62. Have Code, Will Travel Code travels to the data Opposite of traditional systems
  63. 63. Filesystem
  64. 64. uted file alab le di strib “sc ibuted r large distr s yst em fo icati ons. nsive appl d ata-inte ance fa ult t oler ovides It pr in expensive e runn ing on whil ty h ardw are” com modi
  65. 65. HDFS Basics Open Source implementation of Google BigTable Replicated data store Stored in 64MB blocks
  66. 66. HDFS Rack location aware Configurable redundancy factor Self-healing Looks almost like *NIX filesystem
  67. 67. Why HDFS? Random reads Parallel reads Redundancy
  68. 68. Data Overload
  69. 69. Test HDFS • Upload a file • List directories Lab
  70. 70. HDFS Upload • Show contents of file in HDFS • Show vocabulary of HDFS Lab
  71. 71. HDFS Distcp • Copy file to other nodes • MapReduce job starts Lab
  72. 72. HDFS Challenges Writes are re-writes today Append is planned Block size, alignment Small file inefficiencies NameNode SPOF
  73. 73. FSCK •Check filesystem Lab
  74. 74. NO SQL
  75. 75. Data Categories
  76. 76. Structured Semi-structured Unstructured
  77. 77. NOSQL is... Semi-structured not death of the RDBMS no JOINing no Normalization big-data tools solving different issues than RDBMSes
  78. 78. Grid Benefits
  79. 79. Scalable Data storage is pipelined Code travels to data Near linear hardware scalability
  80. 80. Optimization Preemptive execution Maximizes use of faster hardware New jobs trump P.E.
  81. 81. Fault Tolerant Configurable data redundancy Minimizes hardware failure impact Automatic job retries Self healing filesystem
  82. 82. Sproinnnng! Bzzzt! Poof!
  83. 83. server Funerals No pagers go off when machines die Report of dead machines once a week Clean out the carcasses
  84. 84. ss attributes Ro bustne eding ed fro m ble p revent plication code into ap Data redundancy Node death Retries Data geography Parallelism Scalability
  85. 85. Node TYPES
  86. 86. NameNode SecondaryNameNode DataNode JobTracker TaskTracker
  87. 87. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  88. 88. NameNode Catalog of data blocks Prefers large files High memory consumption Journaled file system Saves state to disk
  89. 89. Robust, Primary 1 NameNode JobTracker Commodity 1..N TaskTracker DataNode
  90. 90. Secondary NameNode Near-hot backup for NN Snapshots Missing journal entries Data loss in non-optimal setup Grid reconfig? Big IP device
  91. 91. Robust, Primary 1 Hot Backup 0..1 NameNode Secondary NameNode JobTracker JobTracker
  92. 92. NameNode, Cont’d Backup via SNN Journal to NFS with replication? Single point of failure Failover plan Big IP device, virtual IP
  93. 93. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity 1..N TaskTracker Virtual IP DataNode
  94. 94. DataNode Anonymous Data block storage “No identity” is positive trait Commodity equipment
  95. 95. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  96. 96. JobTracker Singleton Commonly co-located with NN Coordinates TaskTracker Load balances FairPlay scheduling Preemptive execution
  97. 97. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  98. 98. TaskTracker Runs MapReduce jobs Reports back to JobTracker
  99. 99. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  100. 100. Listing Nodes •Use Java’sa JPS tool to get list of nodes on box •sudo jps -l Lab
  101. 101. Processing Nodes Anonymous “No identity” is positive trait Commodity equipment
  102. 102. Master Node Master is a special machine Use high quality hardware Single point of failure Recoverable
  103. 103. Key-value Store
  104. 104. HBase
  105. 105. HBase Basics Map-oriented storage Key value pairs Column families Stores to HDFS Fast Usable for synchronous responses
  106. 106. Direct Comparison Amazon SimpleDB Google BigTable (Datastore)
  107. 107. Competitors Facebook Cassandra LinkedIn Project Voldemort
  108. 108. Shortcomings HQL WHERE clause limited to key Other column filters are full-table scans No ad-hoc queries
  109. 109. hbase> help create 'mylittletable', 'mylittlecolumnfamily' describe 'mylittletable' put 'mylittletable', 'r2', 'mylittlecolumnfamily', 'x' get 'mylittletable', 'r2' scan 'mylittletable'
  110. 110. HBase • Create column family • Insert data • Select data Lab
  111. 111. DSLs
  112. 112. Pig
  113. 113. Pig Basics Yahoo-authored add-on High-level language for authoring data analysis programs Console
  114. 114. PIG Questions Ask big questions on unstructured data How many ___? Should we ____? Decide on the questions you want to ask long after you’ve collected the data.
  115. 115. Pig Data 999991,female,Mary,T,Hargrave,600 Quiet Valley Lane,Los Angeles,CA,90017,US,Mary.T.Hargrave@dodgit.com, 999992,male,Harold,J,Candelario,294 Ford Street,OAKLAND,CA,94607,US,Harold.J.Candelario@dodgit.com,ad2U 999993,female,Ruth,G,Carter,4890 Murphy Court,Shakopee,MN,55379,US,Ruth.G.Carter@mailinator.com,uaseu8e 999994,male,Lionel,J,Carter,2701 Irving Road,Saint Clairsville,OH,43950,US,Lionel.J.Carter@trashymail.c 999995,female,Georgia,C,Medina,4541 Cooks Mine Road,CLOVIS,NM,88101,US,Georgia.C.Medina@trashymail.com, 999996,male,Stanley,S,Cruz,1463 Jennifer Lane,Durham,NC,27703,US,Stanley.S.Cruz@pookmail.com,aehoh5rooG 999997,male,Justin,A,Delossantos,3169 Essex Court,MANCHESTER,VT,05254,US,Justin.A.Delossantos@mailinato 999998,male,Leonard,K,Baker,4672 Margaret Street,Houston,TX,77063,US,Leonard.K.Baker@trashymail.com,Aep 999999,female,Charissa,J,Thorne,2806 Cedar Street,Little Rock,AR,72211,US,Charissa.J.Thorne@trashymail. 1000000,male,Michael,L,Powell,2797 Turkey Pen Road,New York,NY,10013,US,Michael.L.Powell@mailinator.com
  116. 116. Pig Sample Person = LOAD 'people.csv' using PigStorage(','); Names = FOREACH Person GENERATE $2 AS name; OrderedNames = ORDER Names BY name ASC; GroupedNames = GROUP OrderedNames BY name; NameCount = FOREACH GroupedNames GENERATE group, COUNT(OrderedNames); store NameCount into 'names.out';
  117. 117. Pig Scripting •Re-column and sort data Lab
  118. 118. Hive
  119. 119. Hive Basics Authored by SQL interface to HBase Hive is low-level Hive-specific metadata Data warehousing
  120. 120. hive -e 'select t1.x from mylittletable t1'
  121. 121. hive -S -e 'select a.col from tab1 a' > dump.txt
  122. 122. SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10;
  123. 123. Load Hive Data • Load and select Hive data Lab
  124. 124. Sync, Async RDBMS SQL is realtime Hive is primarily asynchronous
  125. 125. Sqoop
  126. 126. Sqoop A utility Imports from RDBMS Outputs plaintext, SequenceFile, or Hive
  127. 127. sqoop --connect jdbc:mysql://database.example.com/
  128. 128. Data Warehousing
  129. 129. Sync, Async RDBMS is realtime HBase is near realtime Hive is asynchronous
  130. 130. Monitoring
  131. 131. Web Status Panels NameNode http://localhost:50070/ JobTracker http://localhost:50030/
  132. 132. View Consoles • Start job • Watch consoles • Browse HDFS via web interface Lab
  133. 133. Configuration Files
  134. 134. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  135. 135. Grid Execution
  136. 136. Single Node start-all.sh Needs XML config ssh account setup Multiple JVMs List of processes jps -l
  137. 137. Start and Stop Grid •cd /usr/lib/hadoop-0.20/bin •sudo -u hadoop ./stop-all.sh •sudo jps -l •tail •sudo -u hadoop ./start-all.sh •sudo jps -l Lab
  138. 138. SSH Setup create key ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  139. 139. Commodity Robust, Primary 1 1..N TaskTracker start-all.sh NameNode start-dfs.sh start-mapred.sh JobTracker DataNode
  140. 140. Complete Grid start-all.sh Host names, ssh setup Default out-of-box experience
  141. 141. Commodity 1..N TaskTracker h dfs.s Robust, Primary rt- 1 sta h DataNode e d.s -m apr s tart start-all.sh NameNode Commodity 1..N JobTracker sta rt-d fs. sta sh rt- ma TaskTracker pr ed .sh DataNode
  142. 142. Grid on the Cloud
  143. 143. EMR Pricing
  144. 144. Summary
  145. 145. Sapir-WHorF Hypothesis... Remixed The ability to store and process massive data influences what you decide to store.
  146. 146. Hadoop divide and conquer petabyte-scale data © Matthew McCullough, Ambient Ideas, LLC
  147. 147. Metadata Matthew McCullough Ambient Ideas, LLC matthewm@ambientideas.com http://ambientideas.com/blog @matthewmccull Code http://github.com/matthewmccullough/hadoop-intro

×