Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HadoopScott Leberknight
Yahoo! "Search Assist"
e Hadoop users. .Notabl           Yahoo!            LinkedIn         Facebook          New York Times           Twitter   ...
Hadoop in the Real     World..
Recommendation                                 Financial analysis       systems   Natural Language                        ...
Finance        Social networking  Health &                Academic researchLife SciencesGovernment      Telecommunications
History..
Inspired by Google BigTable and    MapReduce papers circa 2004      Created by Doug CuttingOriginally built to support dis...
OK, So what exactly     is Hadoop?
An open source...       batch/offline oriented...             data & I/O intensive...                       general purpose...
One definition of "huge"              25,000 machines           More than 10 clusters3 petabytes of data (compressed, unre...
Had oopM ajor nts: C omp one         Distributed File System                 (HDFS)                Map/Reduce System
But first, what isnt Hadoop?
doop is NOT:Ha   ...a relational database!    ...an online transaction processing (OLTP) system!    ...a structured data s...
Hadoop vs. Relational
Hadoop                                  Relational       Scale-out                                 Scale-up(*)  Key/value ...
HDFS(Hadoop Distributed File System)
Data is distributed and replicated    over multiple machines    Designed for large files(where "large" means GB to TB)     ...
NameNode                      File Block Mappings:                      /user/aaron/data1.txt -> 1, 2, 3                  ...
fault tolerant when nodes failSelf-healing      rebalances files across clusterscalable   just by adding new nodes!
Map/Reduce
Split input files (e.g. by HDFS blocks)    Operate on key/value pairsMappers filter & transform input data Reducers aggregat...
move code to data
map:       (K1, V1)         list(K2, V2)reduce:       (K2, list(V2))   list(K3, V3)
Word Count(the canonical Map/Reduce example)
the quick brown fox    jumped over the lazy brown dog
m ap phase -    inputs                  (K1, V1)           (0, "the quick brown fox")           (20, "jumped over")       ...
map ph                                      ase -             list(K2, V2)      outpu                                     ...
redu ce phase -     inputs     (K2, list(V2))    ("brown", (1, 1))       ("dog", (1))    ("fox", (1))            ("jumped"...
reduce                                      phase                                 outpu      -               list(K3, V3) ...
WordCount in code..
public class SimpleWordCount  extends Configured implements Tool {    public static class MapClass      extends Mapper<Obj...
public static class MapClass  extends Mapper<Object, Text, Text, IntWritable> {    private static final IntWritable ONE = ...
public static class Reduce  extends Reducer<Text, IntWritable, Text, IntWritable> {    private IntWritable count = new Int...
public int run(String[] args) throws Exception {  Configuration conf = getConf();    Job job = new Job(conf, "Counting Wor...
public static void main(String[] args) throws Exception {  int result = ToolRunner.run(new Configuration(),               ...
aF low                   uce Dat       p/Red   M  a(Image from Hadoop in Action...great book!)
Partitioning Deciding which keys go to which reducer  Desire even distribution across reducersSkewed data can overload a s...
Map/Reduce Partitioning & Shuffling(Image from Hadoop in Action...great book!)
CombinerEffectively a reduce in the mappers       a.k.a. "Local Reduce"
Shuffling WordCount                                  data               # k/v pairs shuffledwithout combiner             ("...
Advanced Map/Reduce     Hadoop Streaming  Chaining Map/Reduce jobs        Joining data        Bloom filters
Architecture
HDFSNameNodeSecondaryNameNode            Map/ReduceDataNode      JobTracker             TaskTracker
Secondary               NameNode               NameNode                   JobTracker DataNode1                  DataNode2 ...
NameNode     Bookkeeper for HDFS      Manages DataNodesShould not store data or run jobs     Single point of failure!
DataNode   Store actual file blocks on disk    Does not store entire files!  Report block info to NameNodeReceive instructio...
Secondary NameNode    Snapshot of NameNodeNot a failover server for NameNode!Help minimize downtime/data loss       if Nam...
JobTracker Partition tasks across HDFS cluster       Track map/reduce tasksRe-start failed tasks on different nodes       ...
TaskTrackerTrack individual map & reduce tasks  Report progress to JobTracker
Monitoring/ Debugging
distributed processingdistributed debugging
Logs      View task logs on machine where         specific task was processed               (or via web UI)$HADOOP_HOME/log...
Counters       Define one or more countersIncrement counters during map/reduce tasks Counter values displayed in job tracke...
IsolationRunnerRe-run failed tasks with original input data  Must set keep.failed.tasks.files to true
Skipping Bad Records        Data may not always be clean  New data may have new interesting twistsCan you pre-process to fi...
Performance Tuning
Speculative execution   Use a Combiner      (on by default) Reduce amount of         JVM Re-use    input data             ...
ManagingHadoop
Lots of knobs          Trash can Needs active          Add/remove management            data nodes                    Netw...
Hive
Simulate structure for data stored in HadoopQuery language analogous to SQL (Hive QL)Translates queries into Map/Reduce jo...
Queries:     Projection           Joins (inner, outer, semi)     Grouping             Aggregation     Sub-queries         ...
/user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt"CITING","CITED"3858241,9562033858241,13242343858241,339840...
create external table patent_citations (citing string, cited string)row format delimited fields terminated by ,stored as t...
insert overwrite table citation_histogramselect num_citations, count(num_citations) from    (select cited, count(cited) as...
Hadoop in the clouds
Amazon EC2 + S3EC2 instances are compute nodes (Map/Reduce)Storage options:    HDFS on EC2 nodes    HDFS on EC2 nodes load...
Amazon Elastic MapReduce         Interact via web-based console            Submit Map/Reduce job               (streaming,...
Recap..
Hadoop = HDFS + Map/ReduceDistributed, parallel processing Designed for fault tolerance     Horizontal scale-out Structure...
References
http://hadoop.apache.org/http://hadoop.apache.org/hive/Hadoop in Action http://www.manning.com/lam/Definitive Guide to Hado...
http://lmgtfy.com/?q=hadoophttp://www.letmebingthatforyou.com/?q=hadoop
(my info)scott.leberknight@nearinfinity.comwww.nearinfinity.com/blogs/twitter: sleberknight
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Upcoming SlideShare
Loading in …5
×

Hadoop

33,016 views

Published on

Introductory presentation on Apache Hadoop and Apache Hive.

Published in: Technology
  • Dating direct: ❶❶❶ http://bit.ly/2ZDZFYj ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ❶❶❶ http://bit.ly/2ZDZFYj ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Hadoop

  1. 1. HadoopScott Leberknight
  2. 2. Yahoo! "Search Assist"
  3. 3. e Hadoop users. .Notabl Yahoo! LinkedIn Facebook New York Times Twitter Rackspace Baidu eHarmony eBay Powerset http://wiki.apache.org/hadoop/PoweredBy
  4. 4. Hadoop in the Real World..
  5. 5. Recommendation Financial analysis systems Natural Language Correlation engines Processing (NLP) Data warehousing Image/video processingMarket research/forecasting Log analysis
  6. 6. Finance Social networking Health & Academic researchLife SciencesGovernment Telecommunications
  7. 7. History..
  8. 8. Inspired by Google BigTable and MapReduce papers circa 2004 Created by Doug CuttingOriginally built to support distribution for Nutch search engine Named after a stuffed elephant
  9. 9. OK, So what exactly is Hadoop?
  10. 10. An open source... batch/offline oriented... data & I/O intensive... general purpose framework for creating distributed applications that process huge amounts of data.
  11. 11. One definition of "huge" 25,000 machines More than 10 clusters3 petabytes of data (compressed, unreplicated) 700+ users 10,000+ jobs/week
  12. 12. Had oopM ajor nts: C omp one Distributed File System (HDFS) Map/Reduce System
  13. 13. But first, what isnt Hadoop?
  14. 14. doop is NOT:Ha ...a relational database! ...an online transaction processing (OLTP) system! ...a structured data store of any kind!
  15. 15. Hadoop vs. Relational
  16. 16. Hadoop Relational Scale-out Scale-up(*) Key/value pairs TablesSay how to process Say what you want the data (SQL) Offline/batch Online/real-time (*) Sharding attempts to horizontally scale RDBMS, but is difficult at best
  17. 17. HDFS(Hadoop Distributed File System)
  18. 18. Data is distributed and replicated over multiple machines Designed for large files(where "large" means GB to TB) Block orientedLinux-style commands, e.g. ls, cp, mv, rm, etc.
  19. 19. NameNode File Block Mappings: /user/aaron/data1.txt -> 1, 2, 3 /user/aaron/data2.txt -> 4, 5 /user/andrew/data3.txt -> 6, 7DataNode(s)5 1 4 2 2 3 7 4 6 1 4 62 3 6 1 3 7 5 7 5
  20. 20. fault tolerant when nodes failSelf-healing rebalances files across clusterscalable just by adding new nodes!
  21. 21. Map/Reduce
  22. 22. Split input files (e.g. by HDFS blocks) Operate on key/value pairsMappers filter & transform input data Reducers aggregate mapper output
  23. 23. move code to data
  24. 24. map: (K1, V1) list(K2, V2)reduce: (K2, list(V2)) list(K3, V3)
  25. 25. Word Count(the canonical Map/Reduce example)
  26. 26. the quick brown fox jumped over the lazy brown dog
  27. 27. m ap phase - inputs (K1, V1) (0, "the quick brown fox") (20, "jumped over") (32, "the lazy brown dog")
  28. 28. map ph ase - list(K2, V2) outpu ts("the", 1) ("quick", 1)("brown", 1) ("fox", 1)("jumped", 1) ("over", 1)("the", 1) ("lazy", 1)("brown", 1) ("dog", 1)
  29. 29. redu ce phase - inputs (K2, list(V2)) ("brown", (1, 1)) ("dog", (1)) ("fox", (1)) ("jumped", (1)) ("lazy", (1)) ("over", (1)) ("quick", (1)) ("the", (1, 1))
  30. 30. reduce phase outpu - list(K3, V3) ts("brown", 2) ("dog", 1)("fox", 1) ("jumped", 1)("lazy", 1) ("over", 1)("quick", 1) ("the", 2)
  31. 31. WordCount in code..
  32. 32. public class SimpleWordCount extends Configured implements Tool { public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { ... } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ... } public int run(String[] args) throws Exception { ... } public static void main(String[] args) { ... }}
  33. 33. public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable ONE = new IntWritable(1L); private Text word = new Text(); @Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString()); while (st.hasMoreTokens()) { word.set(st.nextToken()); context.write(word, ONE); } }}
  34. 34. public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } count.set(sum); context.write(key, count); }}
  35. 35. public int run(String[] args) throws Exception { Configuration conf = getConf(); Job job = new Job(conf, "Counting Words"); job.setJarByClass(SimpleWordCount.class); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1;}
  36. 36. public static void main(String[] args) throws Exception { int result = ToolRunner.run(new Configuration(), new SimpleWordCount(), args); System.exit(result);}
  37. 37. aF low uce Dat p/Red M a(Image from Hadoop in Action...great book!)
  38. 38. Partitioning Deciding which keys go to which reducer Desire even distribution across reducersSkewed data can overload a single reducer!
  39. 39. Map/Reduce Partitioning & Shuffling(Image from Hadoop in Action...great book!)
  40. 40. CombinerEffectively a reduce in the mappers a.k.a. "Local Reduce"
  41. 41. Shuffling WordCount data # k/v pairs shuffledwithout combiner ("the", 1) 1000 with combiner ("the", 1000) 1 (looking at one mapper that sees the word "the" 1000 times)
  42. 42. Advanced Map/Reduce Hadoop Streaming Chaining Map/Reduce jobs Joining data Bloom filters
  43. 43. Architecture
  44. 44. HDFSNameNodeSecondaryNameNode Map/ReduceDataNode JobTracker TaskTracker
  45. 45. Secondary NameNode NameNode JobTracker DataNode1 DataNode2 DataNodeNTaskTracker1 TaskTracker2 TaskTrackerNmap map map reduce reduce reduce
  46. 46. NameNode Bookkeeper for HDFS Manages DataNodesShould not store data or run jobs Single point of failure!
  47. 47. DataNode Store actual file blocks on disk Does not store entire files! Report block info to NameNodeReceive instructions from NameNode
  48. 48. Secondary NameNode Snapshot of NameNodeNot a failover server for NameNode!Help minimize downtime/data loss if NameNode fails
  49. 49. JobTracker Partition tasks across HDFS cluster Track map/reduce tasksRe-start failed tasks on different nodes Speculative execution
  50. 50. TaskTrackerTrack individual map & reduce tasks Report progress to JobTracker
  51. 51. Monitoring/ Debugging
  52. 52. distributed processingdistributed debugging
  53. 53. Logs View task logs on machine where specific task was processed (or via web UI)$HADOOP_HOME/logs/userlogs on task tracker
  54. 54. Counters Define one or more countersIncrement counters during map/reduce tasks Counter values displayed in job tracker UI
  55. 55. IsolationRunnerRe-run failed tasks with original input data Must set keep.failed.tasks.files to true
  56. 56. Skipping Bad Records Data may not always be clean New data may have new interesting twistsCan you pre-process to filter & validate input?
  57. 57. Performance Tuning
  58. 58. Speculative execution Use a Combiner (on by default) Reduce amount of JVM Re-use input data (be careful) Refactor code/ Data compression algorithms
  59. 59. ManagingHadoop
  60. 60. Lots of knobs Trash can Needs active Add/remove management data nodes Network topology/"Fair" scheduling rack awarenessNameNode/SNN Permissions/quotas management
  61. 61. Hive
  62. 62. Simulate structure for data stored in HadoopQuery language analogous to SQL (Hive QL)Translates queries into Map/Reduce job(s)... ...so not for real-time processing!
  63. 63. Queries: Projection Joins (inner, outer, semi) Grouping Aggregation Sub-queries Multi-table insertCustomizable: User-defined functions Input/output formats with SerDe
  64. 64. /user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt"CITING","CITED"3858241,9562033858241,13242343858241,33984063858241,35573843858241,36348893858242,15157013858242,3319261 Patent citation dataset3858242,36687053858242,37070043858243,29496113858243,31464653858243,31569273858243,32213413858243,3574238... http://www.nber.org/patents
  65. 65. create external table patent_citations (citing string, cited string)row format delimited fields terminated by ,stored as textfilelocation /user/sleberkn/nber-patent/tables/patent_citation;create table citation_histogram (num_citations int, count int)stored as sequencefile;
  66. 66. insert overwrite table citation_histogramselect num_citations, count(num_citations) from (select cited, count(cited) as num_citations from patent_citations group by cited) citation_countsgroup by num_citationsorder by num_citations;
  67. 67. Hadoop in the clouds
  68. 68. Amazon EC2 + S3EC2 instances are compute nodes (Map/Reduce)Storage options: HDFS on EC2 nodes HDFS on EC2 nodes loading data from S3 Native S3 (bypasses HDFS)
  69. 69. Amazon Elastic MapReduce Interact via web-based console Submit Map/Reduce job (streaming, Hive, Pig, or JAR)EMR configures & launches Hadoop cluster for job Uses S3 for data input/output
  70. 70. Recap..
  71. 71. Hadoop = HDFS + Map/ReduceDistributed, parallel processing Designed for fault tolerance Horizontal scale-out Structure & queries via Hive
  72. 72. References
  73. 73. http://hadoop.apache.org/http://hadoop.apache.org/hive/Hadoop in Action http://www.manning.com/lam/Definitive Guide to Hadoop, 2nd ed. http://oreilly.com/catalog/0636920010388Yahoo! Hadoop blog http://developer.yahoo.net/blogs/hadoop/Cloudera http://www.cloudera.com/
  74. 74. http://lmgtfy.com/?q=hadoophttp://www.letmebingthatforyou.com/?q=hadoop
  75. 75. (my info)scott.leberknight@nearinfinity.comwww.nearinfinity.com/blogs/twitter: sleberknight

×