• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content




Introductory presentation on Apache Hadoop and Apache Hive.

Introductory presentation on Apache Hadoop and Apache Hive.



Total Views
Views on SlideShare
Embed Views



6 Embeds 4,582

http://mobicon.tistory.com 4577
http://www.docshut.com 1
http://blog.naver.com 1
https://www.google.co.kr 1
http://mail.naver.com 1
http://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Hadoop Hadoop Presentation Transcript

    • HadoopScott Leberknight
    • Yahoo! "Search Assist"
    • e Hadoop users. .Notabl Yahoo! LinkedIn Facebook New York Times Twitter Rackspace Baidu eHarmony eBay Powerset http://wiki.apache.org/hadoop/PoweredBy
    • Hadoop in the Real World..
    • Recommendation Financial analysis systems Natural Language Correlation engines Processing (NLP) Data warehousing Image/video processingMarket research/forecasting Log analysis
    • Finance Social networking Health & Academic researchLife SciencesGovernment Telecommunications
    • History..
    • Inspired by Google BigTable and MapReduce papers circa 2004 Created by Doug CuttingOriginally built to support distribution for Nutch search engine Named after a stuffed elephant
    • OK, So what exactly is Hadoop?
    • An open source... batch/offline oriented... data & I/O intensive... general purpose framework for creating distributed applications that process huge amounts of data.
    • One definition of "huge" 25,000 machines More than 10 clusters3 petabytes of data (compressed, unreplicated) 700+ users 10,000+ jobs/week
    • Had oopM ajor nts: C omp one Distributed File System (HDFS) Map/Reduce System
    • But first, what isnt Hadoop?
    • doop is NOT:Ha ...a relational database! ...an online transaction processing (OLTP) system! ...a structured data store of any kind!
    • Hadoop vs. Relational
    • Hadoop Relational Scale-out Scale-up(*) Key/value pairs TablesSay how to process Say what you want the data (SQL) Offline/batch Online/real-time (*) Sharding attempts to horizontally scale RDBMS, but is difficult at best
    • HDFS(Hadoop Distributed File System)
    • Data is distributed and replicated over multiple machines Designed for large files(where "large" means GB to TB) Block orientedLinux-style commands, e.g. ls, cp, mv, rm, etc.
    • NameNode File Block Mappings: /user/aaron/data1.txt -> 1, 2, 3 /user/aaron/data2.txt -> 4, 5 /user/andrew/data3.txt -> 6, 7DataNode(s)5 1 4 2 2 3 7 4 6 1 4 62 3 6 1 3 7 5 7 5
    • fault tolerant when nodes failSelf-healing rebalances files across clusterscalable just by adding new nodes!
    • Map/Reduce
    • Split input files (e.g. by HDFS blocks) Operate on key/value pairsMappers filter & transform input data Reducers aggregate mapper output
    • move code to data
    • map: (K1, V1) list(K2, V2)reduce: (K2, list(V2)) list(K3, V3)
    • Word Count(the canonical Map/Reduce example)
    • the quick brown fox jumped over the lazy brown dog
    • m ap phase - inputs (K1, V1) (0, "the quick brown fox") (20, "jumped over") (32, "the lazy brown dog")
    • map ph ase - list(K2, V2) outpu ts("the", 1) ("quick", 1)("brown", 1) ("fox", 1)("jumped", 1) ("over", 1)("the", 1) ("lazy", 1)("brown", 1) ("dog", 1)
    • redu ce phase - inputs (K2, list(V2)) ("brown", (1, 1)) ("dog", (1)) ("fox", (1)) ("jumped", (1)) ("lazy", (1)) ("over", (1)) ("quick", (1)) ("the", (1, 1))
    • reduce phase outpu - list(K3, V3) ts("brown", 2) ("dog", 1)("fox", 1) ("jumped", 1)("lazy", 1) ("over", 1)("quick", 1) ("the", 2)
    • WordCount in code..
    • public class SimpleWordCount extends Configured implements Tool { public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { ... } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ... } public int run(String[] args) throws Exception { ... } public static void main(String[] args) { ... }}
    • public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable ONE = new IntWritable(1L); private Text word = new Text(); @Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString()); while (st.hasMoreTokens()) { word.set(st.nextToken()); context.write(word, ONE); } }}
    • public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } count.set(sum); context.write(key, count); }}
    • public int run(String[] args) throws Exception { Configuration conf = getConf(); Job job = new Job(conf, "Counting Words"); job.setJarByClass(SimpleWordCount.class); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1;}
    • public static void main(String[] args) throws Exception { int result = ToolRunner.run(new Configuration(), new SimpleWordCount(), args); System.exit(result);}
    • aF low uce Dat p/Red M a(Image from Hadoop in Action...great book!)
    • Partitioning Deciding which keys go to which reducer Desire even distribution across reducersSkewed data can overload a single reducer!
    • Map/Reduce Partitioning & Shuffling(Image from Hadoop in Action...great book!)
    • CombinerEffectively a reduce in the mappers a.k.a. "Local Reduce"
    • Shuffling WordCount data # k/v pairs shuffledwithout combiner ("the", 1) 1000 with combiner ("the", 1000) 1 (looking at one mapper that sees the word "the" 1000 times)
    • Advanced Map/Reduce Hadoop Streaming Chaining Map/Reduce jobs Joining data Bloom filters
    • Architecture
    • HDFSNameNodeSecondaryNameNode Map/ReduceDataNode JobTracker TaskTracker
    • Secondary NameNode NameNode JobTracker DataNode1 DataNode2 DataNodeNTaskTracker1 TaskTracker2 TaskTrackerNmap map map reduce reduce reduce
    • NameNode Bookkeeper for HDFS Manages DataNodesShould not store data or run jobs Single point of failure!
    • DataNode Store actual file blocks on disk Does not store entire files! Report block info to NameNodeReceive instructions from NameNode
    • Secondary NameNode Snapshot of NameNodeNot a failover server for NameNode!Help minimize downtime/data loss if NameNode fails
    • JobTracker Partition tasks across HDFS cluster Track map/reduce tasksRe-start failed tasks on different nodes Speculative execution
    • TaskTrackerTrack individual map & reduce tasks Report progress to JobTracker
    • Monitoring/ Debugging
    • distributed processingdistributed debugging
    • Logs View task logs on machine where specific task was processed (or via web UI)$HADOOP_HOME/logs/userlogs on task tracker
    • Counters Define one or more countersIncrement counters during map/reduce tasks Counter values displayed in job tracker UI
    • IsolationRunnerRe-run failed tasks with original input data Must set keep.failed.tasks.files to true
    • Skipping Bad Records Data may not always be clean New data may have new interesting twistsCan you pre-process to filter & validate input?
    • Performance Tuning
    • Speculative execution Use a Combiner (on by default) Reduce amount of JVM Re-use input data (be careful) Refactor code/ Data compression algorithms
    • ManagingHadoop
    • Lots of knobs Trash can Needs active Add/remove management data nodes Network topology/"Fair" scheduling rack awarenessNameNode/SNN Permissions/quotas management
    • Hive
    • Simulate structure for data stored in HadoopQuery language analogous to SQL (Hive QL)Translates queries into Map/Reduce job(s)... ...so not for real-time processing!
    • Queries: Projection Joins (inner, outer, semi) Grouping Aggregation Sub-queries Multi-table insertCustomizable: User-defined functions Input/output formats with SerDe
    • /user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt"CITING","CITED"3858241,9562033858241,13242343858241,33984063858241,35573843858241,36348893858242,15157013858242,3319261 Patent citation dataset3858242,36687053858242,37070043858243,29496113858243,31464653858243,31569273858243,32213413858243,3574238... http://www.nber.org/patents
    • create external table patent_citations (citing string, cited string)row format delimited fields terminated by ,stored as textfilelocation /user/sleberkn/nber-patent/tables/patent_citation;create table citation_histogram (num_citations int, count int)stored as sequencefile;
    • insert overwrite table citation_histogramselect num_citations, count(num_citations) from (select cited, count(cited) as num_citations from patent_citations group by cited) citation_countsgroup by num_citationsorder by num_citations;
    • Hadoop in the clouds
    • Amazon EC2 + S3EC2 instances are compute nodes (Map/Reduce)Storage options: HDFS on EC2 nodes HDFS on EC2 nodes loading data from S3 Native S3 (bypasses HDFS)
    • Amazon Elastic MapReduce Interact via web-based console Submit Map/Reduce job (streaming, Hive, Pig, or JAR)EMR configures & launches Hadoop cluster for job Uses S3 for data input/output
    • Recap..
    • Hadoop = HDFS + Map/ReduceDistributed, parallel processing Designed for fault tolerance Horizontal scale-out Structure & queries via Hive
    • References
    • http://hadoop.apache.org/http://hadoop.apache.org/hive/Hadoop in Action http://www.manning.com/lam/Definitive Guide to Hadoop, 2nd ed. http://oreilly.com/catalog/0636920010388Yahoo! Hadoop blog http://developer.yahoo.net/blogs/hadoop/Cloudera http://www.cloudera.com/
    • http://lmgtfy.com/?q=hadoophttp://www.letmebingthatforyou.com/?q=hadoop
    • (my info)scott.leberknight@nearinfinity.comwww.nearinfinity.com/blogs/twitter: sleberknight