Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

on

  • 1,507 views

Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M ...

Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 2 September 1, 2011
Jason Baldridge and Matt Lease
https://sites.google.com/a/utcompling.com/dicta-f11/

Statistics

Views

Total Views
1,507
Views on SlideShare
1,504
Embed Views
3

Actions

Likes
3
Downloads
44
Comments
0

2 Embeds 3

http://twitter.com 2
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011) Presentation Transcript

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 2 September 1, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at AustinJasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2. Acknowledgments Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College ParkSome figures courtesy of• Chuck Lam’s Hadoop In Action (2011)• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3. Roots in Functional Programming Map f f f f f Fold g g g g g
  • 4. Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine
  • 5. MapReduce
  • 6. “Big Ideas”  Scale “out”, not “up”  Limits of SMP and large shared-memory machines  Move processing to the data  Cluster have limited bandwidth  Process data sequentially, avoid random access  Seeks are expensive, disk throughput is reasonable  Seamless scalability  From the mythical man-month to the tradable machine-hour
  • 7. Typical Large-Data Problem  Iterate over a large number of records  Compute something of interest from each  Shuffle and sort intermediate results  Aggregate intermediate results  Generate final output Key idea: provide a functional abstraction for these two operations(Dean and Ghemawat, OSDI 2004)
  • 8. MapReduce Data FlowCourtesy of Chuck Lam’s Hadoop In Action(2011), pp. 45, 52
  • 9. MapReduce “Runtime” Handles scheduling  Assigns workers to map and reduce tasks Handles “data distribution”  Moves processes to data Handles synchronization  Gathers, sorts, and shuffles intermediate data Handles errors and faults  Detects worker failures and restarts Built on a distributed file system
  • 10. MapReduceProgrammers specify two functions map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3)Note correspondence of types map output → reduce inputData Flow  Input → “input splits”: each a sequence of logical (K1,V1) “records”  Map • Each split processed by same map node • map invoked iteratively: once per record in the split • For each record processed, map may emit 0-N (K2,V2) pairs  Reduce • reduce invoked iteratively for each ( K2, list(V2) ) intermediate value • For each processed, reduce may emit 0-N (K3,V3) pairs  Each reducer’s output written to a persistent file in HDFS
  • 11. Input File Input File InputSplit InputSplit InputSplit InputSplit InputSplit InputFormat RecordReader RecordReader RecordReader RecordReader RecordReader Mapper Mapper Mapper Mapper Mapper Intermediates Intermediates Intermediates Intermediates IntermediatesSource: redrawn from a slide by Cloduera, cc-licensed
  • 12. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30Data Flow  Input → “input splits”: each a sequence of logical (K1,V1) “records”  For each split, for each record, do map(K1,V1) (multiple calls)  Each map call may emit any number of (K2,V2) pairs (0-N)Run-time  Groups all values with the same key into ( K2, list(V2) )  Determines which reducer will process this  Copies data across network as needed for reducer  Ensures intra-node sort of keys processed by each reducer • No guarantee by default of inter-node total sort across reducers
  • 13. “Hello World”: Word Countmap ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum);
  • 14. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map mapa 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52
  • 15. Partition Given: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3)partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]  Each distinct key (with associated values) sent to a single reducer • Same reduce node may process multiple keys in separate reduce() calls  Balances workload across reducers: equal number of keys to each • Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)  Customizable • Some keys require more computation than others • e.g. value skew, or key-specific computation performed • For skew, sampling can dynamically estimate distribution & set partition • Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?
  • 16. Secondary Sorting (Lin 57, White 241)  How to output sorted bigrams (1st word, then list of 2nds)?  What if we use word1 as the key, word 2 as the value?  What if we use <first>--<second> as the key?  Pattern  Create a composite key of (first, second)  Define a Key Comparator based on both words • This will produce the sort order we want (aa ab ac ba bb bc ca cb…)  Define a partition function based only on first word • All bigrams with the same first word go to same reducer • How do you know when the first word changes across invocations?  Preserve state in the reducer across invocations • Will be called separately for each bigram, but we want to remember the current first word across bigrams seen  Hadoop also provides Group Comparator
  • 17. Combine Given: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3)combine ( K2, list(V2) ) → list ( K2, V2 ) Optional optimization  Local aggregation to reduce network traffic  No guarantee it will be used, how many times it will be called  Semantics of program cannot depend on its use Signature: same input as reduce, same output as map  Combine may be run repeatedly on its own output  Lin: Associative & Commutative  combiner = reducer • See next slide
  • 18. Functional Properties  Associative: f( a, f(b,c) ) = f( f(a,b), c )  Grouping of operations doesn’t matter  YES: Addition, multiplication, concatenation  NO: division, subtraction, NAND  NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )  Commutative: f(a,b) = f(b,a)  Ordering of arguments doesn’t matter  YES: addition, multiplication, NAND  NO: division, subtraction, concatenation  Concatenate(“a,”b”) != concatenate(“b”,a”)  Distributive  White (p. 32) and Lam (p. 84) mention with regard to combiners  But really, go with associative + commutative in Lin (pp. 20, 27)
  • 19. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map mapa 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combinea 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 8 3 6 reduce reduce reduce r1 s1 r2 s2 r3 s3
  • 20. User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase filesAdapted from (Dean and Ghemawat, OSDI 2004)
  • 21. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178 Shuffle and 2 Sorts As map emits values, local sorting runs in tandem (1st sort) Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output) Partition determines which (logical) reducer Rj each key will go to Node’s TaskTracker tells JobTracker it has keys for Rj JobTracker determines node to run Rj based on data locality When local map/combine/sort finishes, sends data to Rj’s node Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort) For each (K, list(V)) tuple in merged output, call reduce(…)
  • 22. Distributed File System  Don’t move data… move computation to the data!  Store data on the local disks of nodes in the cluster  Start up the workers on the node that has the data local  Why?  Not enough RAM to hold all the data in memory  Disk access is slow, but disk throughput is reasonable  A distributed file system is the answer  GFS (Google File System) for Google’s MapReduce  HDFS (Hadoop Distributed File System) for Hadoop
  • 23. GFS: Assumptions  Commodity hardware over “exotic” hardware  Scale “out”, not “up”  High component failure rates  Inexpensive commodity components fail all the time  “Modest” number of huge files  Multi-gigabyte files are common, if not encouraged  Files are write-once, mostly appended to  Perhaps concurrently  Large streaming reads over random access  High sustained throughput over low latencyGFS slides adapted from material by (Ghemawat et al., SOSP 2003)
  • 24. GFS: Design Decisions Files stored as chunks  Fixed size (64MB) Reliability through replication  Each chunk replicated across 3+ chunkservers Single master to coordinate access, keep metadata  Simple centralized management No data caching  Little benefit due to large datasets, streaming reads Simplify the API  Push some of the issues onto the client (e.g., data layout) HDFS = GFS clone (same basic ideas)
  • 25. Basic Cluster Components 1 “Manager” node (can be split onto 2 nodes)  Namenode (NN)  Jobtracker (JT) 1-N “Worker” nodes  Tasktracker (TT)  Datanode (DN) Optional Secondary Namenode  Periodic backups of Namenode in case of failure
  • 26. Hadoop Architecture Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25
  • 27. Namenode Responsibilities Managing the file system namespace:  Holds file/directory structure, metadata, file-to-block mapping, access permissions, etc. Coordinating file operations:  Directs clients to datanodes for reads and writes  No data is moved through the namenode Maintaining overall health:  Periodic communication with the datanodes  Block re-replication and rebalancing  Garbage collection
  • 28. Putting everything together… namenode job submission node namenode daemon jobtracker tasktracker tasktracker tasktracker datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system … … … slave node slave node slave node
  • 29. Anatomy of a Job MapReduce program in Hadoop = Hadoop job  Jobs are divided into map and reduce tasks (+ more!)  An instance of running a task is called a task attempt  Multiple jobs can be composed into a workflow Job submission process  Client (i.e., driver program) creates a job, configures it, and submits it to job tracker  JobClient computes input splits (on client end)  Job data (jar, configuration XML) are sent to JobTracker  JobTracker puts job data in shared location, enqueues tasks  TaskTrackers poll for tasks  Off to the races…
  • 30. Why have 1 API when you can have 2?White pp. 25-27, Lam pp. 77-80 Hadoop 0.19 and earlier had “old API” Hadoop 0.21 and forward has “new API” Hadoop 0.20 has both!  Old API most stable, but deprecated  Current books use old API predominantly, but discuss changes • Example code using new API available online from publisher  Some old API classes/methods not yet ported to new API  Cloud9 uses both, and you can too
  • 31. Old API Mapper (interface)  void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)  void configure(JobConf job)  void close() throws IOException Reducer/Combiner  void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter)  void configure(JobConf job)  void close() throws IOException Partitioner  void getPartition(K2 key, V2 value, int numPartitions)
  • 32. New API org.apache.hadoop.mapred now deprecated; instead use org.apache.hadoop.mapreduce & org.apache.hadoop.mapreduce.lib Mapper, Reducer now abstract classes, not interfaces Use Context instead of OutputCollector and Reporter  Context.write(), not OutputCollector.collect() Reduce takes value list as Iterable, not Iterator  Can use java’s foreach syntax for iterating Can throw InterruptedException as well as IOException JobConf & JobClient replaced by Configuration & Job