Successfully reported this slideshow.
Your SlideShare is downloading. ×

Modern technologies in data science

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 75 Ad

Modern technologies in data science

Download to read offline

Dr. Hsieh is teaching how to use the state-of-the-art libraries, Spark by Apache, to conduct data analysis on hadoop platform in ISSNIP 2015, Singapore. He started with teaching the basic operations like “map, reduce, flatten, and more,” followed by explaining the extension of Spark, including MLib, GraphX, and SparkSQL.

Dr. Hsieh is teaching how to use the state-of-the-art libraries, Spark by Apache, to conduct data analysis on hadoop platform in ISSNIP 2015, Singapore. He started with teaching the basic operations like “map, reduce, flatten, and more,” followed by explaining the extension of Spark, including MLib, GraphX, and SparkSQL.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Modern technologies in data science (20)

Advertisement

Recently uploaded (20)

Modern technologies in data science

  1. 1. Chu-Cheng Hsieh Modern technologies in data science
  2. 2. www.linkedin.com/in/chucheng
  3. 3. 3 Task: Find “Error” in /var/log/system.log
  4. 4. 4 /*Java*/ BufferedReader br = new BufferedReader( new FileReader(”system.log")); try { StringBuilder sb = new StringBuilder(); String line = br.readLine(); while (line != null) { if (line.contains(“Error”) { System.out.println(line); } String everything = sb.toString(); } finally { br.close(); }
  5. 5. 5 # Python with open(”system.log", "r") as ins: for line in ins: if “Error” in line: print line[:-1]
  6. 6. 6 # Bash grep "Error" system.log # !! Best !! grep –B 3 –A 2 "Error" system.log|less
  7. 7. 7 Task: Find “Error” in /var/log/system.log What if, system.log > 1TB? 10TB? … 1PB?
  8. 8. 8 More CPUsMore Data
  9. 9. 9credit: http://www.cse.wustl.edu/~jain/
  10. 10. 10 (1) Multi-thread coding is really really painful. (2) Max # of cores in a machines in limited
  11. 11. 11
  12. 12. 12 $US 390 billion
  13. 13. Map Reduce 13 13
  14. 14. 14 An easy way of “multi-core” programming.
  15. 15. Map 15 15
  16. 16. 16 Q. How to eat a pizza in 1 minute?
  17. 17. 17 Ans.
  18. 18. 18 Map = “things can be divide-and-conquer”
  19. 19. Reduce 19 19
  20. 20. 20 Reduce = “merge pieces to make result meaningful”
  21. 21. 21 “Word Count” 4 Nodes
  22. 22. 22 “map”
  23. 23. 23 “reduce”
  24. 24. 24 Map-reduce - an easy way to “collaborate”.
  25. 25. 25 import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { //////////// MAPPER function ////////////// public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); Text word = new Text(); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, new IntWritable(1)); } } } //////////// REDUCER function ///////////// public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { /////////// JOB description /////////// JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } }
  26. 26. 26 a = load ’input.txt'; #map b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; #reduce c = group b by word; d = foreach c generate COUNT(b), group; store d into ‘wordcount.txt';
  27. 27. 27 What’s wrong with PIG?
  28. 28. 28 Word count + “case insensitive”
  29. 29. 29 REGISTER lib.jar; a = load ’input.txt'; #map b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; b2 = foreach b generate lib.UPPER(word) #reduce c = group b2 by word; d = foreach c generate COUNT(b), group; store d into ‘wordcount.txt'; Case In sensitive
  30. 30. 30 package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0 || input.get(0) == null) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw new IOException(”Error in input row ", e); } } }
  31. 31. 31
  32. 32. 32 https://spark.apache.org/
  33. 33. 33 chsieh@dev-sfear01:~$ spark-shell --master local[4] Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/26 14:32:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.3.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala>
  34. 34. 34 chsieh@dev-sfear01:~/SparkTutorial$ cat wc.txt one two two three three three four four four four five five five five five
  35. 35. 35 // RDD[String] = MapPartitionsRDD val file = sc.textFile("wc.txt") val counts = file.flatMap(line => line.split(" ")) // RDD[String] = MapPartitionsRDD .map(word => (word, 1)) // RDD[(String, Int)] = MapPartitionsRDD .reduceByKey( (l, r) => l + r) // RDD[(String, Int)] = ShuffledRDD counts.collect() // Array[(String, Int)] = // Array((two,2), (one,1), (three,2), (five,5), (four,4)) counts.saveAsTextFile(”file:///path/to/wc_result") // Use hdfs:// for hadoop file system
  36. 36. 36 mapFlat( … ) map( … ) and then flatten( ) scala> List(List(1, 2), Set(3, 4)).flatten res0: List[Int] = List(1, 2, 3, 4) // create an empty list // travel each item once val l = List(List(1,2), List(3,4,List(4, 5))).flatten l: List[Any] = List(1, 2, 3, 4, List(4, 5))
  37. 37. 37 // ReduceByKey (”two”, 1) (”two”, 1) ^^^^^ ^^^^^ key key l r (”two”, 2)
  38. 38. 38 Why Spark?
  39. 39. 39 It’s super fast !!
  40. 40. 40 How can it so fast?
  41. 41. 41 RDD (Resilient Distributed Datasets)
  42. 42. 42 How to create an RDD?
  43. 43. 43 Method #1 (Parallelized) scala> val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala> val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize
  44. 44. 44 Method #2 (External Dataset) scala> val distFile = sc.textFile("data.txt") distFile: RDD[String] = MappedRDD@1d4cee08  Local Path  Amazon S3 => s3n://  Hadoop => hdfs://  etc. URI
  45. 45. Desired Properties for map-reduce  Distributed  Lazy (Optimize as much as you can)  Persistence (Caching) 45
  46. 46. 46 Input (disk) Tuples (disk) Tuples (disk) Tuples (disk) Output (disk) MR 1 MR 2 MR 3 MR 4 Input (disk) RDD1 (in memory) RDD2 (in disk) RDD3 (in memory) Output (disk) t1 t2 t3 t4
  47. 47. 47 We have RDD, than how to “operate” it?
  48. 48. 48 # Transformation (can wait) RDD1 => “map” => RDD2 # Action (cannot wait) RDD1 => “reduce” => ??
  49. 49. 49 scala> val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala> val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:23 scala> val addone = distData.map(x => x + 1) // Transformation addone: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at map at <console>:25 scala> val back = addone.map(x => x - 1) // Transformation back: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at map at <console>:27 scala> val sum = back.reduce( (l, r) => l + r) // Action sum: Int = 15
  50. 50. 50 Passing functions Object Util { def addOne(x:Int) = { x + 1 } } val addone_v1 = distData.map(x => x + 1) // or val addone_v2 = distData.map(Util.addOne)
  51. 51. 51 Popular Transformations map(func): run func(x) filter(func) return x if func(x) is true sample(withReplacement, fraction, seed) union(otherDataset) intersection(otherDataset)
  52. 52. 52 Popular Transformations (cont.) Assuming RDD[(K, V)] groupByKey([numTasks]) return a dataset of (K, Iterable<V>) pairs. reduceByKey(func, [numTasks]) groupByKey and then reduce by “func” join(otherDataset, [numTasks]) (K, V) join (K, W) => (K, (V, W))
  53. 53. 53 Popular Actions reduce(func): (left, right) => func(left, right) collect() force computing transformations count() first() take(n) persist()
  54. 54. 54 More about “persist” -- reuse Input (disk) RDD1 (in memory) RDD2 (in disk) RDD3 (in memory) Output (disk) t1 t2 t3 t4 RDD4 (in memory) persist() t5
  55. 55. 55 You are master now.
  56. 56. Journey Continue 56 56
  57. 57. 57
  58. 58. 58 val file = sc.textFile("wc.txt") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (l, r) => l + r) // Using SQL syntax is possible import sqlContext.implicits._ val sq = new org.apache.spark.sql.SQLContext(sc) case class WC(word: String, count: Int) val wordcount = counts.map(col => WC(col._1, col._2)) val df = wordcount.toDF() df.registerTempTable("tbl") val avg = sq.sql("SELECT AVG(count) FROM tbl")
  59. 59. 59 Why Spark SQL?
  60. 60. 60 Guess what I’m doing here? // row: (country, city, profit) data .filter( _._1 == “us”) .map ( (a, b, c) => (b, c)) .groupBy( _._1 ) .mapValues( v => v.reduce( (a, b) => (a._1, a._2+b._2) ) ) .values .sortBy (x => x._2, false) .take(3)
  61. 61. 61 sq.sql(” SELECT city, SUM(profit) as p FROM data WHERE country=‘us’ GROUP BY city ORDER BY p DESC LIMIT 3 ") //(country, city, profit) data .filter( _._1 == “us”) .map ( (a, b, c) => (b, c)) .groupBy( _._1 ) .mapValues( v => v.reduce( (a, b) => (a._1, a._2+b._2) ) ) .values .sortBy (x => x._2, false) .take(3) Find top 3 cities in US with highest profit
  62. 62. 62
  63. 63. 63
  64. 64. 64 chsieh@dev-sfear01:~/SparkTutorial$ cat kmeans_data.txt 0.0 0.0 0.0 0.1 0.1 0.1 0.5 0.5 0.8 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2
  65. 65. 65 import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("kmeans_data.txt") val parsedData = data.map(s => Vectors.dense(s.split(' ') .map(_.toDouble))).cache() // Cluster the data into two classes using KMeans val numClusters = 2 val numIterations = 20 val clusters = KMeans.train( parsedData, numClusters, numIterations ) // Show results scala> clusters.clusterCenters res0: Array[org.apache.spark.mllib.linalg.Vector] = Array([9.099999999999998,9.099999999999998,9.099999999999998], [0.19999999999999998,0.19999999999999998,0.3])
  66. 66. 66
  67. 67. 67
  68. 68. 68
  69. 69. PageRank: Random Surfer Model The probability of a Web surfer to reach a page after many clicks, following random links Random Click
  70. 70. 70 Credit: http://en.wikipedia.org/wiki/PageRank
  71. 71. PageRank  PR(p) = PR(p1)/c1 + … + PR(pk)/ck pi : page pointing to p, ci : number of links in pi  One equation for every page  N equations, N unknown variables Credit: Prof. John Cho / CS144 (UCLA)
  72. 72. Users.txt 1,BarackObama,Barack Obama 2,ladygaga,Goddess of Love 3,jeresig,John Resig 4,justinbieber,Justin Bieber 6,matei_zaharia,Matei Zaharia 7,odersky,Martin Odersky 8,anonsys 72
  73. 73. followers.txt 2 1 4 1 1 2 6 3 7 3 7 6 6 7 3 7 73
  74. 74. 74 // Load the edges as a graph val graph = GraphLoader.edgeListFile(sc, "followers.txt") // Run PageRank val ranks = graph.pageRank(0.0001).vertices // id, rank // Join the ranks with the usernames val users = sc.textFile("users.txt") .map { line => val fields = line.split(",") (fields(0).toLong, fields(1)) // id, username } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank) } // Print the result println(ranksByUsername.collect().mkString("n"))
  75. 75. 75

×