Upcoming SlideShare
Loading in...5




Introductory presentation on Apache Hadoop and Apache Hive.

Introductory presentation on Apache Hadoop and Apache Hive.



Total Views
Views on SlideShare
Embed Views



6 Embeds 5,215

http://mobicon.tistory.com 5210
http://www.docshut.com 1
http://blog.naver.com 1
https://www.google.co.kr 1
http://mail.naver.com 1
http://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Hadoop Hadoop Presentation Transcript

  • HadoopScott Leberknight
  • Yahoo! "Search Assist"
  • e Hadoop users. .Notabl Yahoo! LinkedIn Facebook New York Times Twitter Rackspace Baidu eHarmony eBay Powerset http://wiki.apache.org/hadoop/PoweredBy View slide
  • Hadoop in the Real World.. View slide
  • Recommendation Financial analysis systems Natural Language Correlation engines Processing (NLP) Data warehousing Image/video processingMarket research/forecasting Log analysis
  • Finance Social networking Health & Academic researchLife SciencesGovernment Telecommunications
  • History..
  • Inspired by Google BigTable and MapReduce papers circa 2004 Created by Doug CuttingOriginally built to support distribution for Nutch search engine Named after a stuffed elephant
  • OK, So what exactly is Hadoop?
  • An open source... batch/offline oriented... data & I/O intensive... general purpose framework for creating distributed applications that process huge amounts of data.
  • One definition of "huge" 25,000 machines More than 10 clusters3 petabytes of data (compressed, unreplicated) 700+ users 10,000+ jobs/week
  • Had oopM ajor nts: C omp one Distributed File System (HDFS) Map/Reduce System
  • But first, what isnt Hadoop?
  • doop is NOT:Ha ...a relational database! ...an online transaction processing (OLTP) system! ...a structured data store of any kind!
  • Hadoop vs. Relational
  • Hadoop Relational Scale-out Scale-up(*) Key/value pairs TablesSay how to process Say what you want the data (SQL) Offline/batch Online/real-time (*) Sharding attempts to horizontally scale RDBMS, but is difficult at best
  • HDFS(Hadoop Distributed File System)
  • Data is distributed and replicated over multiple machines Designed for large files(where "large" means GB to TB) Block orientedLinux-style commands, e.g. ls, cp, mv, rm, etc.
  • NameNode File Block Mappings: /user/aaron/data1.txt -> 1, 2, 3 /user/aaron/data2.txt -> 4, 5 /user/andrew/data3.txt -> 6, 7DataNode(s)5 1 4 2 2 3 7 4 6 1 4 62 3 6 1 3 7 5 7 5
  • fault tolerant when nodes failSelf-healing rebalances files across clusterscalable just by adding new nodes!
  • Map/Reduce
  • Split input files (e.g. by HDFS blocks) Operate on key/value pairsMappers filter & transform input data Reducers aggregate mapper output
  • move code to data
  • map: (K1, V1) list(K2, V2)reduce: (K2, list(V2)) list(K3, V3)
  • Word Count(the canonical Map/Reduce example)
  • the quick brown fox jumped over the lazy brown dog
  • m ap phase - inputs (K1, V1) (0, "the quick brown fox") (20, "jumped over") (32, "the lazy brown dog")
  • map ph ase - list(K2, V2) outpu ts("the", 1) ("quick", 1)("brown", 1) ("fox", 1)("jumped", 1) ("over", 1)("the", 1) ("lazy", 1)("brown", 1) ("dog", 1)
  • redu ce phase - inputs (K2, list(V2)) ("brown", (1, 1)) ("dog", (1)) ("fox", (1)) ("jumped", (1)) ("lazy", (1)) ("over", (1)) ("quick", (1)) ("the", (1, 1))
  • reduce phase outpu - list(K3, V3) ts("brown", 2) ("dog", 1)("fox", 1) ("jumped", 1)("lazy", 1) ("over", 1)("quick", 1) ("the", 2)
  • WordCount in code..
  • public class SimpleWordCount extends Configured implements Tool { public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { ... } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ... } public int run(String[] args) throws Exception { ... } public static void main(String[] args) { ... }}
  • public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable ONE = new IntWritable(1L); private Text word = new Text(); @Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString()); while (st.hasMoreTokens()) { word.set(st.nextToken()); context.write(word, ONE); } }}
  • public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } count.set(sum); context.write(key, count); }}
  • public int run(String[] args) throws Exception { Configuration conf = getConf(); Job job = new Job(conf, "Counting Words"); job.setJarByClass(SimpleWordCount.class); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1;}
  • public static void main(String[] args) throws Exception { int result = ToolRunner.run(new Configuration(), new SimpleWordCount(), args); System.exit(result);}
  • aF low uce Dat p/Red M a(Image from Hadoop in Action...great book!)
  • Partitioning Deciding which keys go to which reducer Desire even distribution across reducersSkewed data can overload a single reducer!
  • Map/Reduce Partitioning & Shuffling(Image from Hadoop in Action...great book!)
  • CombinerEffectively a reduce in the mappers a.k.a. "Local Reduce"
  • Shuffling WordCount data # k/v pairs shuffledwithout combiner ("the", 1) 1000 with combiner ("the", 1000) 1 (looking at one mapper that sees the word "the" 1000 times)
  • Advanced Map/Reduce Hadoop Streaming Chaining Map/Reduce jobs Joining data Bloom filters
  • Architecture
  • HDFSNameNodeSecondaryNameNode Map/ReduceDataNode JobTracker TaskTracker
  • Secondary NameNode NameNode JobTracker DataNode1 DataNode2 DataNodeNTaskTracker1 TaskTracker2 TaskTrackerNmap map map reduce reduce reduce
  • NameNode Bookkeeper for HDFS Manages DataNodesShould not store data or run jobs Single point of failure!
  • DataNode Store actual file blocks on disk Does not store entire files! Report block info to NameNodeReceive instructions from NameNode
  • Secondary NameNode Snapshot of NameNodeNot a failover server for NameNode!Help minimize downtime/data loss if NameNode fails
  • JobTracker Partition tasks across HDFS cluster Track map/reduce tasksRe-start failed tasks on different nodes Speculative execution
  • TaskTrackerTrack individual map & reduce tasks Report progress to JobTracker
  • Monitoring/ Debugging
  • distributed processingdistributed debugging
  • Logs View task logs on machine where specific task was processed (or via web UI)$HADOOP_HOME/logs/userlogs on task tracker
  • Counters Define one or more countersIncrement counters during map/reduce tasks Counter values displayed in job tracker UI
  • IsolationRunnerRe-run failed tasks with original input data Must set keep.failed.tasks.files to true
  • Skipping Bad Records Data may not always be clean New data may have new interesting twistsCan you pre-process to filter & validate input?
  • Performance Tuning
  • Speculative execution Use a Combiner (on by default) Reduce amount of JVM Re-use input data (be careful) Refactor code/ Data compression algorithms
  • ManagingHadoop
  • Lots of knobs Trash can Needs active Add/remove management data nodes Network topology/"Fair" scheduling rack awarenessNameNode/SNN Permissions/quotas management
  • Hive
  • Simulate structure for data stored in HadoopQuery language analogous to SQL (Hive QL)Translates queries into Map/Reduce job(s)... ...so not for real-time processing!
  • Queries: Projection Joins (inner, outer, semi) Grouping Aggregation Sub-queries Multi-table insertCustomizable: User-defined functions Input/output formats with SerDe
  • /user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt"CITING","CITED"3858241,9562033858241,13242343858241,33984063858241,35573843858241,36348893858242,15157013858242,3319261 Patent citation dataset3858242,36687053858242,37070043858243,29496113858243,31464653858243,31569273858243,32213413858243,3574238... http://www.nber.org/patents
  • create external table patent_citations (citing string, cited string)row format delimited fields terminated by ,stored as textfilelocation /user/sleberkn/nber-patent/tables/patent_citation;create table citation_histogram (num_citations int, count int)stored as sequencefile;
  • insert overwrite table citation_histogramselect num_citations, count(num_citations) from (select cited, count(cited) as num_citations from patent_citations group by cited) citation_countsgroup by num_citationsorder by num_citations;
  • Hadoop in the clouds
  • Amazon EC2 + S3EC2 instances are compute nodes (Map/Reduce)Storage options: HDFS on EC2 nodes HDFS on EC2 nodes loading data from S3 Native S3 (bypasses HDFS)
  • Amazon Elastic MapReduce Interact via web-based console Submit Map/Reduce job (streaming, Hive, Pig, or JAR)EMR configures & launches Hadoop cluster for job Uses S3 for data input/output
  • Recap..
  • Hadoop = HDFS + Map/ReduceDistributed, parallel processing Designed for fault tolerance Horizontal scale-out Structure & queries via Hive
  • References
  • http://hadoop.apache.org/http://hadoop.apache.org/hive/Hadoop in Action http://www.manning.com/lam/Definitive Guide to Hadoop, 2nd ed. http://oreilly.com/catalog/0636920010388Yahoo! Hadoop blog http://developer.yahoo.net/blogs/hadoop/Cloudera http://www.cloudera.com/
  • http://lmgtfy.com/?q=hadoophttp://www.letmebingthatforyou.com/?q=hadoop
  • (my info)scott.leberknight@nearinfinity.comwww.nearinfinity.com/blogs/twitter: sleberknight