Hadoop and beyond: power tools for data mining

8,478 views

Published on

A brief survey of great tools for dealing with big datasets. Given as an invited lecture for students taking the Cloud Computing module at Birkbeck and UCL.

0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,478
On SlideShare
0
From Embeds
0
Number of Embeds
2,034
Actions
Shares
0
Downloads
123
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Hadoop and beyond: power tools for data mining

  1. 1. Hadoop and beyond:power tools for data mining Mark Levy, 13 March 2013 Cloud Computing Module Birkbeck/UCL
  2. 2. Hadoop and beyondOutline: • the data I work with • Hadoop without Java • Map-Reduce unfriendly algorithms • Hadoop without Map-Reduce • alternatives in the cloud • alternatives on your laptop
  3. 3. NB• all software mentioned is Open Source• wont cover key-value stores• I dont use all of these tools
  4. 4. Last.fm: scrobbling
  5. 5. Last.fm: scrobbling
  6. 6. Last.fm: tagging
  7. 7. Last.fm: personalised radio
  8. 8. Last.fm: recommendations
  9. 9. Last.fm: recommendations
  10. 10. Last.fm datasetsCore datasets: • 45M users, many active • 60M artists • 100M audio fingerprints • 600M tracks (hmm...) • 19M physical recordings • 3M distinct tags •  2.5M <user,item,tag> taggings per month •  1B <user,time,track> scrobbles per month • full user-track graph has ~50B edges  (more often work with ~500M edges)
  11. 11. Problem Scenario 1Need Hadoop, dont want Java: • need to build prototypes, fast • need to do interactive data analysis • want terse, highly readable code • improve maintainability • improve correctness
  12. 12. Hadoop without Java Some options: • Hive (Yahoo!) • Pig (Yahoo!) • Cascading (ok its still Java...) • Scalding (Twitter) • Hadoop streaming (various)not to mention 11 more listed here:http://blog.matthewrathbone.com/2013/01/05/a-quick-guide-to-hadoop-map-reduce-frameworks.html
  13. 13. Apache HiveSQL access to data on Hadooppros: • minimal learning curve • interactive shell • easy to check correctness of codecons: • can be inefficient • hard to fix when it is
  14. 14. Word count in HiveCREATE TABLE input (line STRING);LOAD DATA LOCAL INPATH /input OVERWRITE INTO TABLE input;SELECT word, COUNT(*) FROM inputLATERAL VIEW explode(split(text, )) wTable as wordGROUP BY word;[but would you use SQL to count words?]
  15. 15. Apache PigHigh level scripting language for Hadooppros: • more primitive operations than Hive (and UDFs) • more flexible than Hive • interactive shellcons: • harder learning curve than Hive • tempting to write longer programs but no code modularity beyond functions
  16. 16. Word count in PigA = load /input;B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = filter B by word matches w+;D = group C by word;E = foreach D generate COUNT(C), group;store E into /output/wordcount;[ apply operations to "relations" (tuples) ]
  17. 17. CascadingJava data pipelining for Hadooppros: • as flexible as Pig • uses a real programming langauge • ideal for longer workflowscons: • new concepts to learn ("spout","sink","tap",...) • still verbose (full wordcount ex. code > 150 lines)
  18. 18. Word count in CascadingScheme sourceScheme = new TextLine(new Fields("line"));Tap source = new Hfs(sourceScheme, "/input");Scheme sinkScheme = new TextLine(new Fields("word", "count"));Tap sink = new Hfs(sinkScheme, "/output/wordcount", SinkMode.REPLACE);Pipe assembly = new Pipe("wordcount");String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";Function function = new RegexGenerator(new Fields("word"), regex);assembly = new Each(assembly, new Fields("line"), function);assembly = new GroupBy(assembly, new Fields("word"));Aggregator count = new Count(new Fields("count"));assembly = new Every(assembly, count);Properties properties = new Properties();FlowConnector.setApplicationJarClass(properties, Main.class);FlowConnector flowConnector = new FlowConnector(properties);Flow flow = flowConnector.connect("word-count", source, sink, assembly);flow.complete();
  19. 19. ScaldingScala data pipelining for Hadooppros: • as flexible as Pig • uses a real programming language • much terser than Javacons: • community still small (but in use at Twitter) • ???
  20. 20. Word count in Scaldingimport com.twitter.scalding._class WordCountJob(args : Args) extends Job(args) { TextLine(args("input")) .flatMap(line -> word){ line: String => line.split("""s+""") } .groupBy(word){ _.size } .write(Tsv(args("output")))}[and a one-liner to run it]
  21. 21. Hadoop streamingMap-reduce in any languagee.g. Dumbo wrapper for Pythonpros: • use your favourite language for map-reduce • easy to mix local and cloud processingcons: • limited community • limited functionality beyond map-reduce
  22. 22. Word count in Dumbodef map(key,text): for word in text.split(): yield word,1 # ignore keydef reduce(word,counts): yield word,sum(counts)import dumbodumbo.run(map,reduce,combiner=reduce)[and a one-liner to run it]
  23. 23. Problem Scenario 1bNeed Hadoop, dont want Java: • drive native code in parallelE.g. audio analysis for: • beat locations, bpm • key estimation • chord sequence estimation • energy • music/speech? • ...
  24. 24. Audio AnalysisProblem: • millions of audio tracks on own dfs • long-running C++ analysis code • depends on numerous libraries • verbose output
  25. 25. Audio AnalysisSolution: • bash + Dumbo Hadoop streamingOutline: • build C++ code • zip up binary and libs • send zipfile and some track IDs to each machine • extract and run binary in map task with  subprocess.Popen()
  26. 26. Audio Analysisclass AnalysisMapper: init(): extract(analyzer.tar.bz2,”bin”) map(key,trackID): file = fetch_audio_file(trackID) proc = subprocess.Popen( [“bin/analyzer”,file], stdout = subprocess.PIPE) (out,err) = proc.communicate() yield trackID,out
  27. 27. Problem Scenario 2Map-reduce unfriendly computation: • iterative algorithms on same data • huge mapper output ("map-increase") • curse of slowest reducer
  28. 28. Graph RecommendationsRandom walk on user-item graph  4  4 4   4  4   4  4 4 4   t 4  4  4 4   4 4  4   4 U  4 4   4 4   4 4  4  
  29. 29. Graph RecommendationsMany short routes from U to t ⇒ recommend!  4  4 4   4  4   4  4 4 4   t 4  4  4 4   4 4  4   4 U  4 4   4 4   4 4  4  
  30. 30. Graph Recommendationsrandom walk is equivalent to • Label Propagation (Baluja et al., 2008) • belongs to family of algorithms that  are easy to code in map-reduce
  31. 31. Label PropagationUser-track graph, edge weights = scrobbles: 2 4a   4 4  4 U 4 b 4  1 1 4 c  4  V 2 3  4 d 4  5 W 3 4  4 e  3 4  f4  4 X
  32. 32. Label Propagation User nodes are labelled with scrobbled tracks: 2 4  4 a (a,0.2)(b,0.4) 4(c,0.4)  4 U 4 b 4  1(b,0.5)(d,0.5) 1 c4  4  V 2(b,0.2) 3  4 d 4 (d,0.3) 5(e,0.5) W 3  4 e4  3(a,0.3)(d,0.3) 4(e,0.4)  f4  4 X
  33. 33. Label Propagation Propagate, accumulate, normalise: 2 4  4 a (a,0.2)(b,0.4) 4(c,0.4)  4 U 4 b 4  1(b,0.5)(d,0.5) 1 c4  4  V 2 1 x (b,0.5),(d,0.5)(b,0.2) 3  4 d 4  x (b,0.2),(d,0.3),(e,0.5) 3(d,0.3) 5 Þ(b,0.37),d(0.47),(e,0.17)(e,0.5) W 3  4 e4  3 next iteration e will(a,0.3) propagate to user V(d,0.3) 4(e,0.4)  f4  4 X
  34. 34. Label PropagationAfter some iterations: •  labels at item nodes = similar items •  new labels at user nodes = recommendations
  35. 35. Map-Reduce GraphAlgorithmsgeneral approach assuming: • no global state • state at node recomputed from scratch  from incoming messages on each iterationother examples: • breadth-first search • page rank
  36. 36. Map-Reduce GraphAlgorithmsinputs: • adjacency lists, state at each nodeoutput: • updated state at each node 2  4 a4  4 U 4  U,[(a,2),(b,4),(c,4)]  b4 4  c4  4 adjacency list for node U
  37. 37. Label Propagationclass PropagatingMapper: map(nodeID,value): # value holds label-weight pairs # and adjacency list for node labels,adj_list = value for node,weight in adj_list: # send a “stripe” of label-weight # pairs to each neighbouring node msg = [(label,prob*weight) for label,prob in labels] yield node,msg
  38. 38. Label Propagationclass Reducer: reduce(nodeID,msgs): # accumulate labels = defaultdict(lambda:0) for msg in msgs: for label,w in msg: labels[label] += w # normalise, prune normalise(labels,MAX_LABELS_PER_NODE) yield nodeID,labels
  39. 39. Label PropagationNot map-reduce friendly: •  send graph over network on every iteration •  huge mapper output: • mappers soon send MAX_LABELS_PER_NODE updates along every edge •  some reducers receive huge input: • too slow if reducer streams the data, OOM otherwise •  NB cant partition real graphs to avoid this • many natural graphs are scale-free e.g. AltaVista web graph top 1% of nodes adjacent to 53% of edges
  40. 40. Problem Scenario 2bMap-reduce unfriendly computation: • shared memoryExamples: • almost all machine learning: • split training examples between machines • all machines need to read/write many shared parameter values
  41. 41. Hadoop without map-reduceGraph processing • Apache Giraph (Facebook)Hadoop YARN • Knitting Boar, Iterative Reducehttp://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html • ???
  42. 42. Alternatives in the cloudGraph Processing: • GraphLab (CMU)Task-specific: • Yahoo! LDAGeneral: • HPCC • Spark (Berkeley)
  43. 43. Spark and SharkIn-memory cluster computingpros: •  fast!! (Shark is 100x faster than Hive) •  code in Scala or Java or Python •  can run on Hadoop YARN or Apache Mesos •  ideal for iterative algorithms, nearline analytics •  includes a Pregel clone & stream processingcons: •  hardware requirements???
  44. 44. GraphLabDistributed graph processingpros: •  vertex-centric programming model •  handles true web-scale graphs •  many toolkits already: • collaborative filtering, topic modelling, graphical models, machine vision, graph analysiscons: •  new applications require non-trivial C++ coding
  45. 45. Word count in Sparkval file = spark.textFile(“hdfs://input”)val counts = file.flatMap(line => line.split(”“)) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile(“hdfs://output/wordcount”)
  46. 46. Logistic regression in Sparkval points = spark.textFile(…).map(parsePoint).cache()var w = Vector.random(D) // current separating planefor (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x ).reduce(_ + _) w -= gradient }println(“Final separating plane: “ + w)[ points remain in memory for all iterations ]
  47. 47. Alternatives on your laptopGraph processing • GraphChi (CMU)Machine learning • Sophia-ML (Google) • vowpal wabbit (Yahoo!, Microsoft)
  48. 48. GraphChiGraph processing on your laptoppros: •  still handles graphs with billions of edges •  graph structure can be modified at runtime •  Java/Scala ports under active development •  some toolkits available: • collaborative filtering, graph analysiscons: •  existing C++ toolkit code is hard to extend
  49. 49. vowpal wabbitclassification, regression, LDA, bandits, ...pros: •  handles huge ("terafeature") training datasets •  very fast •  state of the art algorithms •  can run in distributed mode on Hadoop streamingcons: •  hard-core documentation
  50. 50. Take homesThink before you use Hadoop •  use your laptop for most problems •  use a graph framework for graph dataKeep your Hadoop code simple •  if youre just querying data use Hive •  if not use a workflow frameworkCheck out the competition •  Spark and HPCC look impressive
  51. 51. Thanks for listening!Goodbye Hellogamboviol@gmail.com@gamboviol

×