• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop
 

Hadoop

on

  • 11,772 views

Introductory presentation on Apache Hadoop and Apache Hive.

Introductory presentation on Apache Hadoop and Apache Hive.

Statistics

Views

Total Views
11,772
Views on SlideShare
7,190
Embed Views
4,582

Actions

Likes
16
Downloads
652
Comments
0

6 Embeds 4,582

http://mobicon.tistory.com 4577
http://www.docshut.com 1
http://blog.naver.com 1
https://www.google.co.kr 1
http://mail.naver.com 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop Hadoop Presentation Transcript

    • HadoopScott Leberknight
    • Yahoo! "Search Assist"
    • e Hadoop users. .Notabl Yahoo! LinkedIn Facebook New York Times Twitter Rackspace Baidu eHarmony eBay Powerset http://wiki.apache.org/hadoop/PoweredBy
    • Hadoop in the Real World..
    • Recommendation Financial analysis systems Natural Language Correlation engines Processing (NLP) Data warehousing Image/video processingMarket research/forecasting Log analysis
    • Finance Social networking Health & Academic researchLife SciencesGovernment Telecommunications
    • History..
    • Inspired by Google BigTable and MapReduce papers circa 2004 Created by Doug CuttingOriginally built to support distribution for Nutch search engine Named after a stuffed elephant
    • OK, So what exactly is Hadoop?
    • An open source... batch/offline oriented... data & I/O intensive... general purpose framework for creating distributed applications that process huge amounts of data.
    • One definition of "huge" 25,000 machines More than 10 clusters3 petabytes of data (compressed, unreplicated) 700+ users 10,000+ jobs/week
    • Had oopM ajor nts: C omp one Distributed File System (HDFS) Map/Reduce System
    • But first, what isnt Hadoop?
    • doop is NOT:Ha ...a relational database! ...an online transaction processing (OLTP) system! ...a structured data store of any kind!
    • Hadoop vs. Relational
    • Hadoop Relational Scale-out Scale-up(*) Key/value pairs TablesSay how to process Say what you want the data (SQL) Offline/batch Online/real-time (*) Sharding attempts to horizontally scale RDBMS, but is difficult at best
    • HDFS(Hadoop Distributed File System)
    • Data is distributed and replicated over multiple machines Designed for large files(where "large" means GB to TB) Block orientedLinux-style commands, e.g. ls, cp, mv, rm, etc.
    • NameNode File Block Mappings: /user/aaron/data1.txt -> 1, 2, 3 /user/aaron/data2.txt -> 4, 5 /user/andrew/data3.txt -> 6, 7DataNode(s)5 1 4 2 2 3 7 4 6 1 4 62 3 6 1 3 7 5 7 5
    • fault tolerant when nodes failSelf-healing rebalances files across clusterscalable just by adding new nodes!
    • Map/Reduce
    • Split input files (e.g. by HDFS blocks) Operate on key/value pairsMappers filter & transform input data Reducers aggregate mapper output
    • move code to data
    • map: (K1, V1) list(K2, V2)reduce: (K2, list(V2)) list(K3, V3)
    • Word Count(the canonical Map/Reduce example)
    • the quick brown fox jumped over the lazy brown dog
    • m ap phase - inputs (K1, V1) (0, "the quick brown fox") (20, "jumped over") (32, "the lazy brown dog")
    • map ph ase - list(K2, V2) outpu ts("the", 1) ("quick", 1)("brown", 1) ("fox", 1)("jumped", 1) ("over", 1)("the", 1) ("lazy", 1)("brown", 1) ("dog", 1)
    • redu ce phase - inputs (K2, list(V2)) ("brown", (1, 1)) ("dog", (1)) ("fox", (1)) ("jumped", (1)) ("lazy", (1)) ("over", (1)) ("quick", (1)) ("the", (1, 1))
    • reduce phase outpu - list(K3, V3) ts("brown", 2) ("dog", 1)("fox", 1) ("jumped", 1)("lazy", 1) ("over", 1)("quick", 1) ("the", 2)
    • WordCount in code..
    • public class SimpleWordCount extends Configured implements Tool { public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { ... } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ... } public int run(String[] args) throws Exception { ... } public static void main(String[] args) { ... }}
    • public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable ONE = new IntWritable(1L); private Text word = new Text(); @Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString()); while (st.hasMoreTokens()) { word.set(st.nextToken()); context.write(word, ONE); } }}
    • public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } count.set(sum); context.write(key, count); }}
    • public int run(String[] args) throws Exception { Configuration conf = getConf(); Job job = new Job(conf, "Counting Words"); job.setJarByClass(SimpleWordCount.class); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1;}
    • public static void main(String[] args) throws Exception { int result = ToolRunner.run(new Configuration(), new SimpleWordCount(), args); System.exit(result);}
    • aF low uce Dat p/Red M a(Image from Hadoop in Action...great book!)
    • Partitioning Deciding which keys go to which reducer Desire even distribution across reducersSkewed data can overload a single reducer!
    • Map/Reduce Partitioning & Shuffling(Image from Hadoop in Action...great book!)
    • CombinerEffectively a reduce in the mappers a.k.a. "Local Reduce"
    • Shuffling WordCount data # k/v pairs shuffledwithout combiner ("the", 1) 1000 with combiner ("the", 1000) 1 (looking at one mapper that sees the word "the" 1000 times)
    • Advanced Map/Reduce Hadoop Streaming Chaining Map/Reduce jobs Joining data Bloom filters
    • Architecture
    • HDFSNameNodeSecondaryNameNode Map/ReduceDataNode JobTracker TaskTracker
    • Secondary NameNode NameNode JobTracker DataNode1 DataNode2 DataNodeNTaskTracker1 TaskTracker2 TaskTrackerNmap map map reduce reduce reduce
    • NameNode Bookkeeper for HDFS Manages DataNodesShould not store data or run jobs Single point of failure!
    • DataNode Store actual file blocks on disk Does not store entire files! Report block info to NameNodeReceive instructions from NameNode
    • Secondary NameNode Snapshot of NameNodeNot a failover server for NameNode!Help minimize downtime/data loss if NameNode fails
    • JobTracker Partition tasks across HDFS cluster Track map/reduce tasksRe-start failed tasks on different nodes Speculative execution
    • TaskTrackerTrack individual map & reduce tasks Report progress to JobTracker
    • Monitoring/ Debugging
    • distributed processingdistributed debugging
    • Logs View task logs on machine where specific task was processed (or via web UI)$HADOOP_HOME/logs/userlogs on task tracker
    • Counters Define one or more countersIncrement counters during map/reduce tasks Counter values displayed in job tracker UI
    • IsolationRunnerRe-run failed tasks with original input data Must set keep.failed.tasks.files to true
    • Skipping Bad Records Data may not always be clean New data may have new interesting twistsCan you pre-process to filter & validate input?
    • Performance Tuning
    • Speculative execution Use a Combiner (on by default) Reduce amount of JVM Re-use input data (be careful) Refactor code/ Data compression algorithms
    • ManagingHadoop
    • Lots of knobs Trash can Needs active Add/remove management data nodes Network topology/"Fair" scheduling rack awarenessNameNode/SNN Permissions/quotas management
    • Hive
    • Simulate structure for data stored in HadoopQuery language analogous to SQL (Hive QL)Translates queries into Map/Reduce job(s)... ...so not for real-time processing!
    • Queries: Projection Joins (inner, outer, semi) Grouping Aggregation Sub-queries Multi-table insertCustomizable: User-defined functions Input/output formats with SerDe
    • /user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt"CITING","CITED"3858241,9562033858241,13242343858241,33984063858241,35573843858241,36348893858242,15157013858242,3319261 Patent citation dataset3858242,36687053858242,37070043858243,29496113858243,31464653858243,31569273858243,32213413858243,3574238... http://www.nber.org/patents
    • create external table patent_citations (citing string, cited string)row format delimited fields terminated by ,stored as textfilelocation /user/sleberkn/nber-patent/tables/patent_citation;create table citation_histogram (num_citations int, count int)stored as sequencefile;
    • insert overwrite table citation_histogramselect num_citations, count(num_citations) from (select cited, count(cited) as num_citations from patent_citations group by cited) citation_countsgroup by num_citationsorder by num_citations;
    • Hadoop in the clouds
    • Amazon EC2 + S3EC2 instances are compute nodes (Map/Reduce)Storage options: HDFS on EC2 nodes HDFS on EC2 nodes loading data from S3 Native S3 (bypasses HDFS)
    • Amazon Elastic MapReduce Interact via web-based console Submit Map/Reduce job (streaming, Hive, Pig, or JAR)EMR configures & launches Hadoop cluster for job Uses S3 for data input/output
    • Recap..
    • Hadoop = HDFS + Map/ReduceDistributed, parallel processing Designed for fault tolerance Horizontal scale-out Structure & queries via Hive
    • References
    • http://hadoop.apache.org/http://hadoop.apache.org/hive/Hadoop in Action http://www.manning.com/lam/Definitive Guide to Hadoop, 2nd ed. http://oreilly.com/catalog/0636920010388Yahoo! Hadoop blog http://developer.yahoo.net/blogs/hadoop/Cloudera http://www.cloudera.com/
    • http://lmgtfy.com/?q=hadoophttp://www.letmebingthatforyou.com/?q=hadoop
    • (my info)scott.leberknight@nearinfinity.comwww.nearinfinity.com/blogs/twitter: sleberknight