Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探

5,068 views

Published on

此課程專為 Spark 入門者設計,在六小時帶您從無到有建置 Spark 開發環境,並以實作方式帶領您了解 Spark 機器學習函式庫 (MLlib) 的應用及開發。課程實作將以 Spark 核心之實作語言 - Scala 為主,搭配 Scala IDE eclipse 及相關 Library 建置本機開發環境,透過 IDE 強大的開發及偵錯功能加速開發流程;並介紹如何佈置至 Spark 平台,透過 Spark-submit 執行資料分析工作。本課程涵蓋機器學習中最常使用之分類、迴歸及分群方法,歡迎對 Spark 感興趣,卻不知從何下手;或想快速的對 Spark 機器學習有初步的了解的您參與!

Published in: Data & Analytics
  • Sex in your area is here: ♥♥♥ http://bit.ly/2u6xbL5 ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ❤❤❤ http://bit.ly/2u6xbL5 ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE Format, ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE Format, ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE Format, ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探

  1. 1. – 
 
 
 Yung-Chuan Lee 2016.12.18 1
  2. 2. 2 Law[Data applications] are like sausages. It is better not to see them being made. —Otto von Bismarck
  3. 3. ! Spark ◦ ● Spark ● Scala ● RDD ! LAB ◦ ~ ● Spark Scala IDE ! Spark MLlib ◦ … ● Scala + lambda + Spark MLlib ● Clustering Classification Regression 3
  4. 4. ! github page: https://github.com/yclee0418/sparkTeach ◦ installation: Spark ◦ codeSample: Spark ● exercise - ● https://github.com/yclee0418/sparkTeach/tree/master/ codeSample/exercise ● final - ● https://github.com/yclee0418/sparkTeach/tree/master/ codeSample/final 4
  5. 5. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib 5 Outline
  6. 6. ! ◦ 2020 44ZB(IDC 2013~2014) ◦ ! ◦ MapReduce by Google(2004) ◦ Hadoop HDFS MapReduce by Yahoo!(2005) ◦ Spark Hadoop 10~1000 by AMPLab (2013) ! [ ]Spark Hadoop 6 –
  7. 7. ! AMPLab ! ! API ◦ Java Scala Python R ! One Stack to rule them all ◦ SQL Streaming ◦ RDD 7 Spark
  8. 8. ! Cluster Manager ◦ Standalone – Spark Manager ◦ Apache Mesos ◦ Hadoop YARN 8 Spark
  9. 9. ! [exercise]Spark ◦ JDK 1.8 ◦ spark-2.0.1.tgz(http://spark.apache.org/downloads.html) ◦ Terminal (for Mac) ● cd /Users/xxxxx/Downloads ( ) ● tar -xvzf spark-2.0.1.tgz ( ) ● sudo mv spark-2.0.1 /usr/local/spark (spark /usr/local) ● cd /usr/local/spark ● ./build/sbt package( spark 1 ) ● ./bin/spark-shell ( Spark shell pwd / usr/local/spark) 9 Spark (2.0.1) [Tips] https://goo.gl/oxNbIX ./bin/run-example org.apache.spark.examples.SparkPi
  10. 10. ! Spark Shell Spark command line ◦ Spark ! spark-shell ◦ [ ] Spark binspark-shell ! ◦ var res1: Int = 3 + 5 ◦ import org.apache.spark.rdd._ ◦ val intRdd: RDD[Int]=sc.parallelize(List(1,2,3,4,5)) ◦ intRdd.collect ◦ val txtRdd=sc.textFile(file:///Spark /README.md) ◦ txtRdd.count ! spark-shell ◦ [ ] :quit Ctrl D 10 Spark Shell Spark Scala [Tips]: ➢ var val ? ➢ intRdd txtRdd ? ➢ org. [Tab] ? ➢ http://localhost:4040
  11. 11. ! Spark ! RDD(Resilient Distributed Dataset) ! Scala ! Spark MLlib 11 Outline
  12. 12. ! Google ! Map Reduce ! MapReduce ◦ Map (K1, V1) ! list(K2, V2) ◦ Reduce (K2, list(V2))!list(K3, V3) ! ( Word Count ) 12 RDD MapReduce
  13. 13. ! MapReduce on Hadoop Word Count … ◦ iteration iteration ( ) … 13 Hadoop … HDFS
  14. 14. ! Spark – RDD(Resilient Distribute Datasets) ◦ In-Memory Data Processing and Sharing ◦ (tolerant) (efficient) ! ◦ (lineage) – RDD ◦ lineage ! ◦ Transformations: In memory lazy lineage RDD ◦ Action: return Storage ◦ Persistence: RDD 14 Spark … : 1+2+3+4+5 = 15 Transformation Action
  15. 15. 15 RDD RDD Ref: http://spark.apache.org/docs/latest/programming-guide.html#transformations
  16. 16. ! SparkContext.textFile – RDD ! map: RDD RDD ! filter: RDD RDD ! reduceByKey: RDD Key RDD Key ! groupByKey: RDD Key RDD ! join cogroup: RDD Key RDD ! sortBy reverse: RDD ! take(N): RDD N RDD ! saveAsTextFile: RDD 16 RDD
  17. 17. ! count: RDD ! collect: RDD Collection(Seq ! head(N): RDD N ! mkString: Collection 17 [Tips] • • Transformation
  18. 18. ! [Exercise] spark-shell ◦val intRDD = sc.parallelize(List(1,2,3,4,5,6,7,8,9,0)) ◦intRDD.map(x => x + 1).collect() ◦intRDD.filter(x => x > 5).collect() ◦intRDD.stats ◦val mapRDD=intRDD.map{x=>(g+(x%3), x)} ◦mapRDD.groupByKey.foreach{x=>println(key: %s, vals=%s.format(x._1, x._2.mkString(,)))} ◦mapRDD.reduceByKey(_+_).foreach(println) ◦mapRDD.reduceByKey{case(a,b) => a+b}.foreach(println) 18 RDD
  19. 19. ! [Exercise] (The Gettysburg Address) ◦ (The Gettysburg Address)(https:// docs.google.com/file/d/0B5ioqs2Bs0AnZ1Z1TWJET2NuQlU/ view) gettysburg.txt ◦ gettysburg.txt ( ) ● ◦ ◦ ◦ 19 RDD (Word Count ) sc.textFile flatMap split toLowerCase, filter sortBy foreach https://github.com/yclee0418/sparkTeach/blob/master/codeSample/exercise/ WordCount_Rdd.txt take(5) foreach reduceByKey
  20. 20. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib 20 Outline
  21. 21. ! Scala Scalable Language ( ) ! Scala ◦ lambda expression Scala Scala: List(1,2,3,4,5).foreach(x=>println(item %d.format(x))) Java: Int[] intArr = new Array[] {1,2,3,4,5}; for (int x: intArr) println(String.format(item %d, x)); ! scala Java .NET ! ( actor model akka) ! Spark
  22. 22. ! import ◦ import org.apache.spark.SparkContext ◦ import org.apache.spark.rdd._ ( rdd class) ◦ import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel } ( clustering class) ! ◦ val int1: Int = 5 ( error) ◦ var int2: Int = 5 ( ) ◦ val int = 5 ( ) ! ( ) ◦ def voidFunc(param1: Type, param2: Type2) = { … } 22 Scala def setLogger = { Logger.getLogger(com).setLevel(Level.OFF) Logger.getLogger(io).setLevel(Level.OFF) }
  23. 23. ! ( ) ◦ def rtnFunc1(param1: Type, param2: Type2): Type3 = { val v1:Type3 = … v1 // } ! ( ) ◦ def rtnFunc2(param1: Type, param2: Type2): (Type3, Type4) = { val v1: Type3 = … val v2: Type4= … (v1, v2) // } 23 Scala def getMinMax(intArr: Array[Int]):(Int,Int) = { val min=intArr.min val max=intArr.max (min, max) }
  24. 24. ! ◦ val res = rtnFunc1(param1, param2) ( res ) ◦ val (res1, res2) = rtnFunc2(param1, param2) ( res1,res2 ) ◦ val (_, res2) = rtnFunc2(param1, param2) ( ) ! For Loop ◦ for (i <- collection) { … } ! For Loop ( yield ) ◦ val rtnArr = for (i <- collection) yield { … } 24 Scala val intArr = Array(1,2,3,4,5,6,7,8,9) val multiArr= for (i <- intArr; j <- intArr) yield { i*j } //multiArr 81 99 val (min,max)=getMinMax(intArr) val (_, max)=getMinMax(intArr)
  25. 25. ! Tuple ◦ Tuple ◦ val v=(v1,v2,v3...) v._1, v._2, v._3… ◦ lambda ◦ lambda (_) 25 Scala val intArr = Array(1,2,3,4,5,7,8,9) val res=getMinMax(intArr) //res=(1,9)=>tuple val min=res._1 // res val max=res._2 // res val intArr = Array((1,2,3),(4,5,6),(7,8,9)) //intArr Tuple val intArr2=intArr.map(x=> (x._1 * x._2 * x._3)) //intArr2: Array[Int] = Array(6, 120, 504) val intArr3=intArr.filter(x=> (x._1 + x._2 > x._3)) //intArr3: Array[(Int, Int, Int)] = Array((4,5,6), (7,8,9)) val intArr = Array((1,2,3),(4,5,6),(7,8,9)) //intArr Tuple def getThird(x:(Int,Int,Int)): Int = { (x._3) } val intArr2=intArr.map(getThird(_)) val intArr2=intArr.map(x=>getThird(x)) // //intArr2: Array[Int] = Array(3, 6, 9)
  26. 26. ! Class ◦ Scala Class JAVA Class ● private / protected public ● Class 26 Scala Scala: class Person(userID: Int, name: String) // private class Person(val userID: Int, var name: String) // public userID val person = new Person(102, John Smith)// person.userID // 102 Person class Java : public Class Person { private final int userID; private final String name; public Person(int userID, String name) { this.userID = userID; this.name = name; }}
  27. 27. ! Object ◦ Scala static instance ◦ Scala Object static ● Scala Object singleton class instance ! Scala Object vs Class ◦ object utility Spark Driver Program ◦ class Entity 27 Scala Scala Object: object Utility { def isNumeric(input: String): Boolean = input.trim() .matches(s[+-]?((d+(ed+)?[lL]?)|(((d+(.d*)?)|(.d+))(ed+)?[fF]?))) def toDouble(input: String): Double = { val rtn = if (input.isEmpty() || !isNumeric(input)) Double.NaN else input.toDouble rtn }} val d = Utility.toDouble(20) // new
  28. 28. ! ◦ val intArr = Array(1,2,3,4,5,7,8,9) ! ◦ val intArrExtra = intArr ++ Array(0,11,12) ! map: ! filter: ! join: Map Key Map ! sortBy reverse: ! take(N): N 28 scala val intArr = Array(1,2,3,4,5,7,8,9) val intArr2=intArr.map(_ * 2) //intArr2: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18) val intArr3=intArr.filter(_ > 5) //intArr3: Array[Int] = Array(6, 7, 8, 9) val intArr4=intArr.reverse //intArr4: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)
  29. 29. ! sum: ◦ val sum = Array(1,2,3,4,5,7,8,9).sum ! max: ◦ val max = Array(1,2,3,4,5,7,8,9).max ! min: ◦ val max = Array(1,2,3,4,5,7,8,9).min ! distinct: 29 scala val intArr = Array(1,2,3,4,5,7,8,9) val sum = intArr.sum //sum = 45 val max = intArr.max //max = 9 val min = intArr.min //min = 1 val disc = Array(1,1,1,2,2,2,3,3) //disc = Array(1,2,3)
  30. 30. ! spark-shell ! ScalaIDE for eclipse 4.4.1 ◦ http://scala-ide.org/download/sdk.html ◦ ◦ ( ) ◦ ◦ ScalaIDE 30 (IDE)
  31. 31. ! Driver Program(word complete breakpoint ) ! spark-shell jar ! ◦Eclipse 4.4.2 (Luna) ◦ Scala IDE 4.4.1 ◦ Scala 2.11.8 and Scala 2.10.6 ◦ Sbt 0.13.8 ◦ Scala Worksheet 0.4.0 ◦ Play Framework support 0.6.0 ◦ ScalaTest support 2.10.0 ◦ Scala Refactoring 0.10.0 ◦ Scala Search 0.3.0 ◦ Access to the full Scala IDE ecosystem 31 Scala IDE for eclipse
  32. 32. ! Scala IDE Driver Program ◦ Scala Project ◦ Build Path ● Spark ● Scala ◦ package ● package ( ) ◦ scala object ◦ ◦ debug ◦ Jar ◦ spark-submit Spark 32 Scala IDE Driver Program
  33. 33. ! Scala IDE ◦ FILE -> NEW -> Scala Project ◦ project FirstScalaProj ◦ JRE 1.8 (1.7 ) ◦ ◦ Finish 33 Scala Project
  34. 34. ! Package Explorer Project Explorer FirstScalaProj Build Path -> Configure Build Path 34 Build Path [Tips]: Q: Package Project Explorer A: ! Scala perspective ! Scala perspective -> Window -> Show View
  35. 35. ! Spark Driver Program Build Path ◦ jar ◦ Scala Library Container 2.11.8(IDE 2.11.8 ) ! Configure Build Path Java Build Path Libraries - >Add External JARs… ◦Spark Jar Spark /assembly/target/scala-2.11/jars/ ◦ jar ! Java Build Path Scala Library Container 2.11.8 35 Build Path
  36. 36. ! Package Explorer FirstScalaProj src package ◦ src ->New->Package( Ctrl N) ◦ bikeSharing Package ! FirstScalaProj data (Folder) input 36 Package
  37. 37. ! (gettysburg.txt)copy data ! bikeSharing Package Scala Object BikeCountSort ! 37 Scala Object
  38. 38. package practice1 //spark lib import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.rdd._ //log import org.apache.log4j.Logger import org.apache.log4j.Level object WordCount { def main(args: Array[String]): Unit = { // Log Console Logger.getLogger(org).setLevel(Level.ERROR) //mark for MLlib INFO msg val sc = new SparkContext(new SparkConf().setAppName(WordCount).setMaster(local[*])) val rawRdd = sc.textFile(data/gettysburg.txt).flatMap { x=>x.split( ) } // (toLowerCase ) ( filter ) val txtRdd = rawRdd.map { x => x.toLowerCase.trim }.filter { x => !x.equals() } val countRdd = txtRdd.map { x => (x, 1) } // 1) Map val resultRdd = countRdd.reduceByKey { case (a, b) => a + b } // ReduceByKey val sortResRdd = resultRdd.sortBy((x => x._2), false) // sortResRdd.take(5).foreach { println } // sortResRdd.saveAsTextFile(data/wc_output) } } 38 WordCount import Library object main saveAsTextFile
  39. 39. ! word complete ALT / word complete ! ( tuple ) 39 IDE
  40. 40. ! debug configuration ◦ icon Debug Configurations ◦ Scala Application Debug ● Name WordCount ● Project FirstScalaProj ● Main Class practice1.WordCount ◦ Launcher 40 Debug Configuration
  41. 41. ! icon Debug Configuration Debug console 41 [Tips] • data/output sortResRdd ( part-xxxx ) • Log Level console • output
  42. 42. ! Spark-Submit JAR ◦ Package Explorer FirstScalaProj - >Export...->Java/JAR file-> FirstScalaProj src JAR File 42 JAR
  43. 43. ! input output JAR File ◦ data JAR File 43 Spark-submit
  44. 44. ! spark-submit 44 Spark-submit 1.submit 2. lunch works 3. return status
  45. 45. ! Command Line JAR File ! Spark-submit ./bin/spark-submit --class <main-class> (package scala object ) --master <master-url> ( master URL local[Worker thread num]) --deploy-mode <deploy-mode> ( Worker Cluster Client Client) --conf <key>=<value> ( Spark ) ... # other options <application-jar> (JAR ) [application-arguments] ( Driver main ) 45 Spark-submit submit JOB Spark /bin/spark-submit --class practice1.WordCount -- master local[*] WordCount.jar [Tips]: ! spark-submit JAR data ! merge output ◦ linux: cat data/output/part-* > res.txt ◦ windows: type dataoutputpart-* > res.txt
  46. 46. ! Exercise wordCount Package WordCount2 Object ◦ gettysburg.txt ( ) ● ◦ ● Hint1: (index) ● val posRdd=txtRdd.zipWithIndex() ● Hint2: reduceByKey groupByKey index 46 Word Count
  47. 47. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics) ◦ Clustering ◦ Classification ◦ Regression 47 Outline
  48. 48. ! ◦ ◦ ! ◦ ◦ 48 Tasks Experience Performance
  49. 49. ! ! ! ! ! ! ! ! ! DNA ! ! 49
  50. 50. ! (Supervised learning) ◦ (Training Set) ◦ (Features) (Label) ◦ Regression Classification ( ) 50 http://en.proft.me/media/science/ml_svlw.jpg
  51. 51. ! (Unsupervised learning) ◦ ( Label) ◦ ◦ Clustering ( KMeans) 51http://www.cnblogs.com/shishanyuan/p/4747761.html
  52. 52. ! MLlib Machine Learning library Spark ! ◦ RDD ◦ 52 Spark MLlib http://www.cnblogs.com/shishanyuan/p/4747761.html
  53. 53. 53 Spark MLlib https://www.safaribooksonline.com/library/view/spark-for-python/9781784399696/graphics/B03986_04_02.jpg
  54. 54. ! Bike Sharing Dataset ( ) ! https://archive.ics.uci.edu/ml/datasets/ Bike+Sharing+Dataset ◦ ● hour.csv: 2011.01.01~2012.12.30 17,379 ● day.csv: hour.csv 54 Spark MLlib Let’s biking
  55. 55. 55 Bike Sharing Dataset Features Label (for hour.csv only) (0 to 6) (1 to 4)
  56. 56. ! ◦ (Summary Statistics): MultivariateStatisticalSummary Statistics ◦ Feature ( ) Label ( ) (correlation) Statistics ! ◦ Clustering KMeans ! ◦ Classification Decision Tree LogisticRegressionWithSGD ! ◦ Regression Decision Tree LinearRegressionWithSGD 56 Spark MLlib
  57. 57. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics) ◦ Clustering ◦ Classification ◦ Regression 57 Outline
  58. 58. 58 ! (Summary Statistics) ◦ ◦ ◦Spark ● 1: RDD[Double/ Float/Int] RDD stats ● 2: RDD[Vector] Statistics.colStats
  59. 59. 59 ! (correlation) ◦ (Correlation ) ◦ Spark Pearson Spearman ◦ r Statistics.corr ● 0 < | r | < 0.3 ( ) ● 0.3 <= | r | < 0.7 ( ) ● 0.7 <= | r | < 1 ( ) ● r = 1 ( )
  60. 60. 60 A. Scala B. Package Scala Object C. data Folder D. Library
  61. 61. ! ScalaIDE Scala folder package Object ◦ SummaryStat ( ) ● src ● bike (package ) ● BikeSummary (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 61
  62. 62. 62 A. import B. main Driver Program C. Log D. SparkContext
  63. 63. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import Statistics library import org.apache.spark.mllib.stat.{ MultivariateStatisticalSummary, Statistics } object BikeSummary { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeSummary").setMaster("local[*]")) } } ! spark-shell sparkContext sc ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 63
  64. 64. 64 ! prepare ◦ input file Features Label RDD
  65. 65. ! lines.map features( 3~14 ) label( 17 ) RDD ! RDD : ◦ RDD[Array] ◦ RDD[Tuple] ◦ RDD[BikeShareEntity] prepare def prepare(sc: SparkContext): RDD[???] = { val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) => { if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name) val lines:RDD[Array[String]] = rawDataNoHead.map { x => x.split(, ).map { x => x.trim() } } //split columns with comma val bikeData:RDD[???]=lines.map{ … } //??? depends on your impl } 65
  66. 66. ! RDD[Array]: ◦ val bikeData:RDD[Array[Double]] =lines. map{x=>(x.slice(3,13).map(_.toDouble) ++ Array(x(16).toDouble))} ◦ 利弊: prepare實作容易,後面用起來痛苦(要記欄位在Array中的 index),也容易出包 ! RDD[Tuple]: ◦ val bikeData:RDD[(Double, Double, Double, …, Double)] =lines.map{case(season,yr,mnth,…,cnt)=>(season.toDouble, yr.toDouble, mnth.toDouble,…cnt.toDouble)} ◦ 利弊: prepare實作較不易,後面用起來痛苦,比較不會出包(可用較 佳的變數命名來接回傳值) ◦ 例: val features = bikeData.map{case(season,yr,mnth,…,cnt)=> (season, yr, math, …, windspeed)} 66
  67. 67. ! RDD[ Class] : ◦ val bikeData:RDD[BikeShareEntity] = lines.map{ x=> BikeShareEntity(⋯)} ◦ 利弊: prepare實作痛苦,後面用起來快樂(用entity物件操作,不 用管欄位位置、抽象化),不易出包 ◦ 例: val labelRdd = bikeData.map{ ent => { ent.label }} Case Class Class case class BikeShareEntity(instant: String,dteday:String,season:Double, yr:Double,mnth:Double,hr:Double,holiday:Double,weekday:Double, workingday:Double,weathersit:Double,temp:Double, atemp:Double,hum:Double,windspeed:Double,casual:Double, registered:Double,cnt:Double) 67 map RDD[BikeShareEntity] val bikeData = rawData.map { x => BikeShareEntity(x(0), x(1), x(2).toDouble, x(3).toDouble,x(4).toDouble, x(5).toDouble, x(6).toDouble,x(7).toDouble,x(8).toDouble, x(9).toDouble,x(10).toDouble,x(11).toDouble,x(12).toDouble, x(13).toDouble,x(14).toDouble,x(15).toDouble,x(16).toDouble) }
  68. 68. 68 ! (Class) ! prepare ◦ input file Features Label RDD
  69. 69. Entity Class //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import Statistics library import org.apache.spark.mllib.stat. { MultivariateStatisticalSummary, Statistics } object BikeSummary { case class BikeShareEntity(⋯⋯) def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeSummary").setMaster("local[*]")) } } 69
  70. 70. 70 ! getFeatures ◦ ! printSummary ◦ console ! printCorrelation ◦ console
  71. 71. printSummary def printSummary(entRdd: RDD[BikeShareEntity]) = { val dvRdd = entRdd.map { x => Vectors.dense(getFeatures(x)) } // RDD[Vector] // Statistics.colStats Summary Statistics val summaryAll = Statistics.colStats(dvRdd) println(mean: + summaryAll.mean.toArray.mkString(,))) // println(variance: + summaryAll.variance.toArray.mkString(,))) // } 71 getFeatures def getFeatures(bikeData: BikeShareEntity): Array[Double] = { // val featureArr = Array(bikeData.casual, bikeData.registered,bikeData.cnt) featureArr }
  72. 72. 72 printCorrelation def printCorrelation(entRdd: RDD[BikeShareEntity]) = { // RDD[Double] val cntRdd = entRdd.map { x => x.cnt } val yrRdd = entRdd.map { x => x.yr } // val yrCorr = Statistics.corr(yrRdd, cntRdd)// println(correlation: %s vs %s: %f.format(yr, cnt, yrCorr)) val seaRdd = entRdd.map { x => x.season }// season val seaCorr = Statistics.corr(seaRdd, cntRdd) println(correlation: %s vs %s: %f.format(season, cnt, seaCorr)) }
  73. 73. A. ◦ BikeSummary.scala SummaryStat ◦ hour.csv data ◦ BikeSummary ( TODO B. ◦ getFeatures printSummary ● console (temp) (hum) (windspeed) ● yr mnth (temp) (hum) (windspeed) (cnt) console 73 for (yr <- 0 to 1) for (mnth <- 1 to 12) { val yrMnRdd = entRdd.filter { ??? }.map { x => Vectors.dense(getFeatures(x)) } val summaryYrMn = Statistics.colStats( ??? ) println(====== summary yr=%d, mnth=%d ==========.format(yr,mnth)) println(mean: + ???) println(variance: + ???) }
  74. 74. A. ◦ BikeSummary printCorrelation ◦ hour.csv [yr~windspeed] cnt console B. feature ◦ printCorrelation ● yr mnth feature( yrmo yrmo=yr*12+mnth) yrmo cnt 74
  75. 75. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics) ◦ Clustering ◦ Classification ◦ Regression 75 Outline
  76. 76. ! Traing Set ( Label) ! (cluster) ! ! ◦ 76 Clustering
  77. 77. ! ! (x1,x2,...,xn) K-Means n K (k≤n), (WCSS within-cluster sum of squares) ! A. K B. K C. ( ) D. B C 77 K-Means iteration RUN
  78. 78. 78 K-Means ref: http://mropengate.blogspot.tw/2015/06/ai-ch16-5-k-introduction-to-clustering.html
  79. 79. ! KMeans.train Model(KMeansModel ◦ val model=KMeans.train(data, numClusters, maxIterations, runs) ● data (RDD[Vector]) ● numClusters (K) ● maxIterations run Iteration iteration maxIterations model ● runs KMeans run model ! model.clusterCenters Feature ! model.computeCost WCSS model 79 K-Means in Spark MLlib
  80. 80. 80 K-Means BikeSharing ! hour.csv KMeans console ◦ Features yr, season, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed,cnt( cnt Label Feature ) ◦ numClusters 5 ( 5 ) ◦ maxIterations 20 ( run 20 iteration) ◦ runs 3 3 Run model)
  81. 81. 81 Model A. Scala B. Package Scala Object C. data Folder D. Library Model K
  82. 82. ! ScalaIDE Scala folder package Object ◦ Clustering ( ) ● src ● bike (package ) ● BikeShareClustering (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 82
  83. 83. 83 A. import B. main Driver Program C. Log D. SparkContext Model Model K
  84. 84. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import KMeans library import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel } object BikeShareClustering { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeClustering").setMaster("local[*]")) } } ! KMeans Library ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 84
  85. 85. 85 ! (Class) ! prepare ◦ input file Features Label RDD ! BikeSummary Model Model K
  86. 86. 86 ! getFeatures ◦ ! KMeans ◦ KMeans.train KMeansModel ! getDisplayString ◦ Model Model K
  87. 87. getFeatures getDisplayString getFeatures def getFeatures(bikeData: BikeShareEntity): Array[Double] = { val featureArr = Array(bikeData.cnt, bikeData.yr, bikeData.season, bikeData.mnth, bikeData.hr, bikeData.holiday, bikeData.weekday, bikeData.workingday, bikeData.weathersit, bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed, bikeData.casual, bikeData.registered) featureArr } 87 getDisplayString def getDisplayString(centers:Array[Double]): String = { val dispStr = """cnt: %.5f, yr: %.5f, season: %.5f, mnth: %.5f, hr: %.5f, holiday: %.5f, weekday: %.5f, workingday: %.5f, weathersit: %.5f, temp: %.5f, atemp: %.5f, hum: %.5f,windspeed: %.5f, casual: %.5f, registered: %.5f""" .format(centers(0), centers(1),centers(2), centers(3),centers(4), centers(5),centers(6), centers(7),centers(8), centers(9),centers(10), centers(11),centers(12), centers(13),centers(14)) dispStr }
  88. 88. KMeans // Features RDD[Vector] val featureRdd = bikeData.map { x => Vectors.dense(getFeatures(x)) } val model = KMeans.train(featureRdd, 5, 20, 3) // K 5 20 Iteration 3 Run 88 var clusterIdx = 0 model.clusterCenters.sortBy { x => x(0) }.foreach { x => { println(center of cluster %d n%s.format(clusterIdx, getDisplayString(x.toArray) )) clusterIdx += 1 } } // Cnt
  89. 89. 89 //K-Means import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel } object BikeShareClustering { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger // SparkContext val sc = new SparkContext(new SparkConf().setAppName(BikeClustering).setMaster(local[*])) println(============== preparing data ==================) val bikeData = prepare(sc) // hour.csv RDD[BikeShareEntity] bikeData.persist() println(============== clusting by KMeans ==================) // Features RDD[Vector] val featureRdd = bikeData.map { x => Vectors.dense(getFeatures(x)) } val model = KMeans.train(featureRdd, 5, 20, 3) // K 5 20 Iteration 3 Run var clusterIdx = 0 model.clusterCenters.sortBy { x => x(0) }.foreach { x => { println(center of cluster %d n%s.format(clusterIdx, getDisplayString(x.toArray) )) clusterIdx += 1 } } // Cnt bikeData.unpersist() }
  90. 90. 90 ! yr season mnth hr cnt ! weathersit cnt ( ) ! temp atemp cnt ( ) ! hum cnt ( ) ! correlation
  91. 91. 91 ! K Model WCSS ! WCSS K Model Model K
  92. 92. ! model.computeCost WCSS model (WCSS ) ! numClusters WCSS (K) ! WCSS 92 K-Means println(============== tuning parameters ==================) for (k <- Array(5,10,15,20, 25)) { // numClusters WCSS val iterations = 20 val tm = KMeans.train(featureRdd, k, iterations,3) println(k=%d, WCSS=%f.format(k, tm.computeCost(featureRdd))) } ============== tuning parameters ================== k=5, WCSS=89540755.504054 k=10, WCSS=36566061.126232 k=15, WCSS=23705349.962375 k=20, WCSS=18134353.720998 k=25, WCSS=14282108.404025
  93. 93. A. ◦ BikeShareClustering.scala Scala ◦ hour.csv data ◦ BikeShareClustering ( TODO B. feature ◦ BikeClustering ● yrmo getFeatures KMeans console yrmo ● numClusters (ex:50,75,100) 93 K-Means ! K-Means ! KMeans
  94. 94. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics) ◦ Clustering ◦ Classification ◦ Regression 94 Outline
  95. 95. ! (Binary Classification) (Multi-Class Classification) ! ! ◦ (logistic regression) (decision trees) (naive Bayes) ◦ 95
  96. 96. ! ! (Features) (Label) ! (Random Forest) ! 96
  97. 97. ! import org.apache.spark.mllib.tree.DecisionTree ! import org.apache.spark.mllib.tree.model.DecisionTreeModel ! DecisionTree.trainClassifier Model(DecisionTreeModel ◦ val model=DecisionTree.trainClassifier(trainData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) ● trainData RDD[LabeledPoint] ● numClasses 2 ● categoricalFeaturesInfo trainData categorical Map[ Index, ] continuous ● Map(0->2,4->10) 1,5 categorical 2,10 ● impurity (Gini Entropy) ● maxDepth overfit ● maxBins ● categoricalFeaturesInfo maxBins categoricalFeaturesInfo 97 Decision Tree in Spark MLlib
  98. 98. ! ( ) ! threshold( ) ◦ Features yr, season, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed ◦ Label cnt 200 1 0 ◦ numClasses 2 ◦ impurity gini ◦ maxDepth 5 ◦ maxBins 30 98
  99. 99. 99 Model A. Scala B. Package Scala Object C. data Folder D. Library Model
  100. 100. ! ScalaIDE Scala folder package Object ◦ Classification ( ) ● src ● bike (package ) ● BikeShareClassificationDT (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 100
  101. 101. 101 Model Model A. import B. main Driver Program C. Log D. SparkContext
  102. 102. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import decision tree library import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel object BikeShareClassificationDT { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeClassificationDT").setMaster("local[*]")) } } ! Decision Tree Library ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 102
  103. 103. 103 Model Model ! (Class) ◦ BikeSummary ! prepare ◦ input file Features Label RDD[LabeledPoint] ◦ RDD[LabeledPoint] ! getFeatures ◦ Model feature ! getCategoryInfo ◦ categroyInfoMap
  104. 104. 104 prepare def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) => { if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name) val lines:RDD[Array[String]] = rawDataNoHead.map { x => x.split(, ).map { x => x.trim() } } //split columns with comma val bikeData = lines.map{ x => BikeShareEntity(⋯) }//RDD[BikeShareEntity] val lpData=bikeData.map { x => { val label = if (x.cnt > 200) 1 else 0 //大於200為1,否則為0 val features = Vectors.dense(getFeatures(x)) new LabeledPoint(label, features ) //LabeledPoint由label及Vector組成 } //以6:4的比例隨機分割,將資料切分為訓練及驗證用資料 val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4)) (trainData, validateData) }
  105. 105. 105 getFeatures getCategoryInfo getFeatures方法 def getFeatures(bikeData: BikeShareEntity): Array[Double] = { val featureArr = Array(bikeData.yr, bikeData.season - 1, bikeData.mnth - 1, bikeData.hr,bikeData.holiday, bikeData.weekday, bikeData.workingday, bikeData.weathersit - 1, bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed) featureArr } // season Feature 1 getCategoryInfo方法 def getCategoryInfo(): Map[Int, Int]= { val categoryInfoMap = Map[Int, Int]( (/*"yr"*/ 0, 2), (/*season*/ 1, 4),(/*"mnth"*/ 2, 12), (/*"hr"*/ 3, 24), (/*"holiday"*/ 4, 2), (/*"weekday"*/ 5, 7), (/*"workingday"*/ 6, 2), (/*"weathersit"*/ 7, 4)) categoryInfoMap } //( featureArr index, distinct )
  106. 106. 106 Model Model ! trainModel ◦ DecisionTree.trainClassifier Model ! evaluateModel ◦ AUC trainModel Model
  107. 107. 107 trainModel evaluateModel def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int,cateInfo: Map[Int, Int]): (DecisionTreeModel, Double) = { val startTime = new DateTime() // val model = DecisionTree.trainClassifier(trainData, 2, cateInfo, impurity, maxDepth, maxBins) // Model val endTime = new DateTime() // val duration = new Duration(startTime, endTime) // //MyLogger.debug(model.toDebugString) // Decision Tree (model, duration.getMillis) } def evaluateModel(validateData: RDD[LabeledPoint], model: DecisionTreeModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] AUC } val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auc = metrics.areaUnderROC()// areaUnderROC auc auc }
  108. 108. 108 Model Model! tuneParameter ◦ impurity Max Depth Max Bin trainModel evaluateModel AUC
  109. 109. 109 AUC(Area under the Curve of ROC) Positive (Label 1) Negative (Label 0) Positive (Label 1) true positive(TP) false negative(FN) Negative (Label 0) false positive(FP) true negative(TN) ! (True Pos Rate)TPR 1 1 ◦ TPR=TP/(TP+FN) ! (False Pos Rate)FPR 0 1 ◦ FPR FP/(FP+TN)
  110. 110. ! FPR TPR X Y ROC ! AUC ROC 110 AUC AUC 1 100% 0.5 < AUC < 1 AUC 0.5 AUC < 0.5 AUC(Area under the Curve of ROC)
  111. 111. 111 tuneParameter def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val impurityArr = Array(gini, entropy) val depthArr = Array(3, 5, 10, 15, 20, 25) val binsArr = Array(50, 100, 200) val evalArr = for (impurity <- impurityArr; maxDepth <- depthArr; maxBins <- binsArr) yield { // model AUC val (model, duration) = trainModel(trainData, impurity, maxDepth, maxBins, cateInfo) val auc = evaluateModel(validateData, model) println(parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f .format(impurity, maxDepth, maxBins, auc)) (impurity, maxDepth, maxBins, auc) } val bestEval = (evalArr.sortBy(_._4).reverse)(0) // AUC println(best parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
  112. 112. 112 Decision Tree //MLlib lib import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.evaluation._ import org.apache.spark.mllib.linalg.Vectors //decision tree import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel object BikeShareClassificationDT { case class BikeShareEntity(…) // case class def main(args: Array[String]): Unit = { MyLogger.setLogger val doTrain = (args != null && args.length > 0 && "Y".equals(args(0))) val sc = new SparkContext(new SparkConf().setAppName("ClassificationDT").setMaster("local[*]")) println("============== preparing data ==================") val (trainData, validateData) = prepare(sc) val cateInfo = getCategoryInfo() if (!doTrain) { println("============== train Model (CateInfo)==================") val (modelC, durationC) = trainModel(trainData, "gini", 5, 30, cateInfo) val aucC = evaluateModel(validateData, modelC) println("validate auc(CateInfo)=%f".format(aucC)) } else { println("============== tuning parameters(cateInfo) ==================") tuneParameter(trainData, validateData, cateInfo) } } }
  113. 113. A. ◦ BikeShareClassificationDT.scala Scala ◦ hour.csv data ◦ BikeShareClassificationDT ( TODO B. feature ◦ BikeShareClassificationDT ● category AUC ● feature ( |correlation| > 0.1 ) Model AUC 113 Decision Tree ============== tuning parameters(cateInfo) ================== parameter: impurity=gini, maxDepth=3, maxBins=50, auc=0.835524 parameter: impurity=gini, maxDepth=3, maxBins=100, auc=0.835524 parameter: impurity=gini, maxDepth=3, maxBins=200, auc=0.835524 parameter: impurity=gini, maxDepth=5, maxBins=50, auc=0.851846 parameter: impurity=gini, maxDepth=5, maxBins=100, auc=0.851846 parameter: impurity=gini, maxDepth=5, maxBins=200, auc=0.851846
  114. 114. ! (simple linear regression, :y=ax+b) (y) ◦ (x) (y) ! (Logistic regression) ◦ ! S (sigmoid) p(probability) 0.5 [ ] [ ] 114
  115. 115. ! import org.apache.spark.mllib.classification.{ LogisticRegressionWithSGD, LogisticRegressionModel } ! LogisticRegressionWithSGD.train(trainData, numIterations, stepSize, miniBatchFraction) Model(LogisticRegressionModel ◦val model=LogisticRegressionWithSGD.train(trainData,numIterations, stepSize, miniBatchFraction) ● trainData RDD[LabeledPoint] ● numIterations (SGD) 100 ● stepSize SGD 1 ● miniBatchFraction 0~1 1 115 Logistic Regression in Spark http://www.csie.ntnu.edu.tw/~u91029/Optimization.html
  116. 116. ! LogisticRegression train Categorical Feature one-of- k(one-hot) encoding ! One-of-K encoding: ◦ N (N= ) ◦ index 1 0 116 Categorical Features weather Value Clear 1 Mist 2 Light Snow 3 Heavy Rain 4 weathersit Index 1 0 2 1 3 2 4 3 INDEX Map Index Encode 0 1000 1 0100 2 0010 3 0001 Encoding
  117. 117. 117 Model A. Scala B. Package Scala Object C. data Folder D. Library Model
  118. 118. ! ScalaIDE Scala folder package Object ◦ Classification ( ) ● src ● bike (package ) ● BikeShareClassificationLG (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 118
  119. 119. 119 Model Model A. import B. main Driver Program C. Log D. SparkContext
  120. 120. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import Logistic library //Logistic import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.feature.StandardScaler import org.apache.spark.mllib.classification.{ LogisticRegressionWithSGD, LogisticRegressionModel } object BikeShareClassificationLG { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeClassificationLG").setMaster("local[*]")) } } ! Logistic Regression Library ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 120
  121. 121. 121 Model Model ! (Class) ◦ BikeSummary ! prepare ◦ input file Features Label RDD[LabeledPoint] ◦ RDD[LabeledPoint] ! getFeatures ◦ Model feature ! getCategoryFeature ◦ 1-of-k encode Array[Double]
  122. 122. One-Of-K def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { … val bikeData = lines.map{ x => new BikeShareEntity(x) }//RDD[BikeShareEntity] val weatherMap=bikeData. .map { x => x.getField(weathersit) } .distinct().collect().zipWithIndex.toMap //產生Index Map val lpData=bikeData.map { x => { val label = x.getLabel() val features = Vectors.dense(x.getFeatures(weatherMap)) new LabeledPoint(label, features } //LabeledPoint由label及Vector組成 } … } def getFeatures (weatherMap: Map[Double, Int])= { var rtnArr: Array[Double] = Array() var weatherArray:RDD[Double] = Array.ofDim[Double](weatherMap.size) //weatherArray=Array(0,0,0,0) val index = weatherMap(getField(weathersit)) //weathersit=2; index=1 weatherArray(index) = 1 //weatherArray=Array(0,1,0,0) rtnArr = rtnArr ++ weatherArray …. }
  123. 123. ! (Standardizes) (variance) / ! StandardScaler def prepare(sc): RDD[LabeledPoint] = { … val featureRdd = bikeData.map { x => Vectors.dense(x.getFeatures(weatherMap)) } //用整個Feature的RDD取得StandardScaler val stdScaler = new StandardScaler(withMean=true, withStd=true).fit(featureRdd ) val lpData2= bikeData.map { x => { val label = x.getLabel() //在建立LabeledPoint前,先對feature作標準化轉換動作 val features = stdScaler.transform(Vectors.dense(x.getFeatures(weatherMap))) new LabeledPoint(label, features) } } …
  124. 124. prepare def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { … val bikeData = lines.map{ x => BikeShareEntity( ) }//RDD[BikeShareEntity] val weatherMap=bikeData.map { x => x.weathersit }.distinct().collect().zipWithIndex.toMap // Index Map //Standardize val featureRddWithMap = bikeData.map { x => Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))} val stdScalerWithMap = new StandardScaler(withMean = true, withStd = true).fit(featureRddWithMap) // Category feature val lpData = bikeData.map { x => { val label = if (x.cnt > 200) 1 else 0 // 200 1 0 val features = stdScalerWithMap.transform(Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))) new LabeledPoint(label, features) }} // 6:4 val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4)) (trainData, validateData) }
  125. 125. 125 getFeatures getFeatures方法 def getFeatures(bikeData: BikeShareEntity, yrMap: Map[Double, Int], seasonMap: Map[Double, Int], mnthMap: Map[Double, Int], hrMap: Map[Double, Int], holidayMap: Map[Double, Int], weekdayMap: Map[Double, Int], workdayMap: Map[Double, Int], weatherMap: Map[Double, Int]): Array[Double] = { var featureArr: Array[Double] = Array() // featureArr ++= getCategoryFeature(bikeData.yr, yrMap) featureArr ++= getCategoryFeature(bikeData.season, seasonMap) featureArr ++= getCategoryFeature(bikeData.mnth, mnthMap) featureArr ++= getCategoryFeature(bikeData.holiday, holidayMap) featureArr ++= getCategoryFeature(bikeData.weekday, weekdayMap) featureArr ++= getCategoryFeature(bikeData.workingday, workdayMap) featureArr ++= getCategoryFeature(bikeData.weathersit, weatherMap) featureArr ++= getCategoryFeature(bikeData.hr, hrMap) // featureArr ++= Array(bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed) featureArr }
  126. 126. 126 getCategoryFeature getCategoryFeature方法 def getCategoryFeature(fieldVal: Double, categoryMap: Map[Double, Int]): Array[Double] = { var featureArray = Array.ofDim[Double](categoryMap.size) val index = categoryMap(fieldVal) featureArray(index) = 1 featureArray }
  127. 127. 127 Model Model ! trainModel ◦ DecisionTree.trainClassifier Model ! evaluateModel ◦ AUC trainModel Model
  128. 128. 128 trainModel evaluateModel def trainModel(trainData: RDD[LabeledPoint], numIterations: Int, stepSize: Double, miniBatchFraction: Double): (LogisticRegressionModel, Double) = { val startTime = new DateTime() // LogisticRegressionWithSGD.train val model = LogisticRegressionWithSGD.train(trainData, numIterations, stepSize, miniBatchFraction) val endTime = new DateTime() val duration = new Duration(startTime, endTime) //MyLogger.debug(model.toPMML()) // model debug (model, duration.getMillis) } def evaluateModel(validateData: RDD[LabeledPoint], model: LogisticRegressionModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] AUC } val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auc = metrics.areaUnderROC()// areaUnderROC auc auc }
  129. 129. 129 Model Model! tuneParameter ◦ iteration stepSize miniBatchFraction trainModel evaluateModel AUC
  130. 130. 130 tuneParameter def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val iterationArr: Array[Int] = Array(5, 10, 20, 60,100) val stepSizeArr: Array[Double] = Array(10, 50, 100, 200) val miniBatchFractionArr: Array[Double] = Array(0.5,0.8, 1) val evalArr = for (iteration <- iterationArr; stepSize <- stepSizeArr; miniBatchFraction <- miniBatchFractionArr) yield { // model AUC val (model, duration) = trainModel(ttrainData, iteration, stepSize, miniBatchFraction) val auc = evaluateModel(validateData, model) println(parameter: iteraion=%d, stepSize=%f, batchFraction=%f, auc=%f .format(iteration, stepSize, miniBatchFraction, auc)) (iteration, stepSize, miniBatchFraction, auc) } val bestEval = (evalArr.sortBy(_._4).reverse)(0) // AUC println(best parameter: iteraion=%d, stepSize=%f, batchFraction=%f, auc=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
  131. 131. A. ◦ BikeShareClassificationLG.scala Scala ◦ hour.csv data ◦ BikeShareClassificationLG ( TODO B. feature ◦ BikeShareClassificationLG ● category AUC ● feature ( |correlation| > 0.1 ) Model AUC 131 Logistic Regression ============== tuning parameters(Category) ================== parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=0.500000, auc=0.857073 parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=0.800000, auc=0.855904 parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=1.000000, auc=0.855685 parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=0.500000, auc=0.852388 parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=0.800000, auc=0.852901 parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=1.000000, auc=0.853237 parameter: iteraion=5, stepSize=100.000000, miniBatchFraction=0.500000, auc=0.852087
  132. 132. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics) ◦ Clustering ◦ Classification ◦ Regression 132 Outline
  133. 133. ! ! ! ◦ (Least Squares) Lasso (ridge regression) 133
  134. 134. ! import org.apache.spark.mllib.tree.DecisionTree ! import org.apache.spark.mllib.tree.model.DecisionTreeModel ! DecisionTree.trainRegressor Model(DecisionTreeModel ◦ val model=DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo, impurity, maxDepth, maxBins) ● trainData RDD[LabeledPoint] ● categoricalFeaturesInfo trainData categorical Map[ Index, ] continuous ● Map(0->2,4->10) 1,5 categorical 2,10 ● impurity ( variance) ● maxDepth overfit ● maxBins ● categoricalFeaturesInfo maxBins categoricalFeaturesInfo 134 Decision Tree Regression in Spark
  135. 135. ! Model ◦ Features yr, season, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed ◦ Label cnt ◦ impurity gini ◦ maxDepth 5 ◦ maxBins 30 135
  136. 136. 136 Model A. Scala B. Package Scala Object C. data Folder D. Library Model
  137. 137. ! ScalaIDE Scala folder package Object ◦ Regression ( ) ● src ● bike (package ) ● BikeShareRegressionDT (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 137
  138. 138. 138 Model Model A. import B. main Driver Program C. Log D. SparkContext
  139. 139. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import decision tree library import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel object BikeShareRegressionDT { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeRegressionDT").setMaster("local[*]")) } } ! Decision Tree Library ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 139
  140. 140. 140 Model Model ! (Class) ◦ BikeSummary ! prepare ◦ input file Features Label RDD[LabeledPoint] ◦ RDD[LabeledPoint] ! getFeatures ◦ Model feature ! getCategoryInfo ◦ categroyInfoMap
  141. 141. 141 prepare def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) => { if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name) val lines:RDD[Array[String]] = rawDataNoHead.map { x => x.split(, ).map { x => x.trim() } } //split columns with comma val bikeData = lines.map{ x => BikeShareEntity(⋯) }//RDD[BikeShareEntity] val lpData=bikeData.map { x => { val label = x.cnt //預測目標為租借量欄位 val features = Vectors.dense(getFeatures(x)) new LabeledPoint(label, features ) //LabeledPoint由label及Vector組成 } //以6:4的比例隨機分割,將資料切分為訓練及驗證用資料 val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4)) (trainData, validateData) }
  142. 142. 142 getFeatures getCategoryInfo getFeatures方法 def getFeatures(bikeData: BikeShareEntity): Array[Double] = { val featureArr = Array(bikeData.yr, bikeData.season - 1, bikeData.mnth - 1, bikeData.hr,bikeData.holiday, bikeData.weekday, bikeData.workingday, bikeData.weathersit - 1, bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed) featureArr } // season Feature 1 getCategoryInfo方法 def getCategoryInfo(): Map[Int, Int]= { val categoryInfoMap = Map[Int, Int]( (/*"yr"*/ 0, 2), (/*season*/ 1, 4),(/*"mnth"*/ 2, 12), (/*"hr"*/ 3, 24), (/*"holiday"*/ 4, 2), (/*"weekday"*/ 5, 7), (/*"workingday"*/ 6, 2), (/*"weathersit"*/ 7, 4)) categoryInfoMap } //( featureArr index, distinct )
  143. 143. 143 Model Model ! trainModel ◦ DecisionTree.trainRegressor Model ! evaluateModel ◦ RMSE trainModel Model
  144. 144. ! (root-mean-square deviation) (root- mean-square error) ! (sample standard deviation) ! 144 RMSE(root-mean-square error)
  145. 145. 145 trainModel evaluateModel def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int, cateInfo: Map[Int,Int]): (DecisionTreeModel, Double) = { val startTime = new DateTime() // val model = DecisionTree.trainRegressor(trainData, cateInfo, impurity, maxDepth, maxBins) // Model val endTime = new DateTime() // val duration = new Duration(startTime, endTime) // //MyLogger.debug(model.toDebugString) // Decision Tree (model, duration.getMillis) } def evaluateModel(validateData: RDD[LabeledPoint], model: DecisionTreeModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] RMSE } val metrics = new RegressionMetrics(scoreAndLabels) val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse rmse }
  146. 146. 146 Model Model! tuneParameter ◦ Max Depth Max Bin trainModel evaluateModel RMSE
  147. 147. 147 tuneParameter def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val impurityArr = Array(variance) val depthArr = Array(3, 5, 10, 15, 20, 25) val binsArr = Array(50, 100, 200) val evalArr = for (impurity <- impurityArr; maxDepth <- depthArr; maxBins <- binsArr) yield { // model RMSE val (model, duration) = trainModel(trainData, impurity, maxDepth, maxBins, cateInfo) val rmse = evaluateModel(validateData, model) println(parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f .format(impurity, maxDepth, maxBins, rmse)) (impurity, maxDepth, maxBins, rmse) } val bestEvalAsc = (evalArr.sortBy(_._4)) val bestEval = bestEvalAsc(0) //RMSE println(best parameter: impurity=%s, maxDepth=%d, maxBins=%d, rmse=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
  148. 148. A. ◦ BikeShareRegressionDT.scala Scala ◦ hour.csv data ◦ BikeShareRegressionDT.scala ( TODO B. feature ◦ BikeShareRegressionDT ● feature dayType(Double ) dayType ● holiday=0 workingday=0 dataType=0 ● holiday=1 dataType=1 ● holiday=0 workingday=1 dataType=2 ● dayType feature Model( getFeatures getCategoryInfo) ◦ Categorical Info 148 Decision Tree ============== tuning parameters(CateInfo) ================== parameter: impurity=variance, maxDepth=3, maxBins=50, rmse=118.424606 parameter: impurity=variance, maxDepth=3, maxBins=100, rmse=118.424606 parameter: impurity=variance, maxDepth=3, maxBins=200, rmse=118.424606 parameter: impurity=variance, maxDepth=5, maxBins=50, rmse=93.138794 parameter: impurity=variance, maxDepth=5, maxBins=100, rmse=93.138794 parameter: impurity=variance, maxDepth=5, maxBins=200, rmse=93.138794
  149. 149. ! Least Squares ! 149
  150. 150. ! import org.apache.spark.mllib.regression.{LinearRegressionWithSGD, LinearRegressionModel} ! LinearRegressionWithSGD.train(trainData, numIterations, stepSize) Model(LinearRegressionModel ◦ val model=LinearRegressionWithSGD.train(trainData, numIterations, stepSize) ● trainData RDD[LabeledPoint] ● numIterations (SGD) ● stepSize SGD 1 stepSize ● miniBatchFraction 0~1 1 150 Least Squares Regression in Spark
  151. 151. 151 Model A. Scala B. Package Scala Object C. data Folder D. Library Model
  152. 152. ! ScalaIDE Scala folder package Object ◦ Regression ( ) ● src ● bike (package ) ● BikeShareRegressionLR (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 152
  153. 153. 153 Model Model A. import B. main Driver Program C. Log D. SparkContext
  154. 154. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import linear regression library import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.feature.StandardScaler import org.apache.spark.mllib.classification.{ LinearRegressionWithSGD, LinearRegressionModel } object BikeShareRegressionLR { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeRegressionLR").setMaster("local[*]")) } } ! Linear Regression Library ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 154
  155. 155. 155 Model Model ! (Class) ◦ BikeSummary ! prepare ◦ input file Features Label RDD[LabeledPoint] ◦ RDD[LabeledPoint] ! getFeatures ◦ Model feature ! getCategoryFeature ◦ 1-of-k encode Array[Double]
  156. 156. One-Of-K def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { … val bikeData = lines.map{ x => new BikeShareEntity(x) }//RDD[BikeShareEntity] val weatherMap=bikeData. .map { x => x.getField(weathersit) } .distinct().collect().zipWithIndex.toMap //產生Index Map val lpData=bikeData.map { x => { val label = x.getLabel() val features = Vectors.dense(x.getFeatures(weatherMap)) new LabeledPoint(label, features } //LabeledPoint由label及Vector組成 } … } def getFeatures (weatherMap: Map[Double, Int])= { var rtnArr: Array[Double] = Array() var weatherArray:RDD[Double] = Array.ofDim[Double](weatherMap.size) //weatherArray=Array(0,0,0,0) val index = weatherMap(getField(weathersit)) //weathersit=2; index=1 weatherArray(index) = 1 //weatherArray=Array(0,1,0,0) rtnArr = rtnArr ++ weatherArray …. }
  157. 157. ! (Standardizes) (variance) / ! StandardScaler def prepare(sc): RDD[LabeledPoint] = { … val featureRdd = bikeData.map { x => Vectors.dense(x.getFeatures(weatherMap)) } //用整個Feature的RDD取得StandardScaler val stdScaler = new StandardScaler(withMean=true, withStd=true).fit(featureRdd ) val lpData2= bikeData.map { x => { val label = x.getLabel() //在建立LabeledPoint前,先對feature作標準化轉換動作 val features = stdScaler.transform(Vectors.dense(x.getFeatures(weatherMap))) new LabeledPoint(label, features) } } …
  158. 158. prepare def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { … val bikeData = lines.map{ x => BikeShareEntity( ) }//RDD[BikeShareEntity] val weatherMap=bikeData.map { x => x.weathersit }.distinct().collect().zipWithIndex.toMap // Index Map //Standardize val featureRddWithMap = bikeData.map { x => Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))} val stdScalerWithMap = new StandardScaler(withMean = true, withStd = true).fit(featureRddWithMap) // Category feature val lpData = bikeData.map { x => { val label = x.cnt // val features = stdScalerWithMap.transform(Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))) new LabeledPoint(label, features) }} // 6:4 val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4)) (trainData, validateData) }
  159. 159. 159 getFeatures getFeatures方法 def getFeatures(bikeData: BikeShareEntity, yrMap: Map[Double, Int], seasonMap: Map[Double, Int], mnthMap: Map[Double, Int], hrMap: Map[Double, Int], holidayMap: Map[Double, Int], weekdayMap: Map[Double, Int], workdayMap: Map[Double, Int], weatherMap: Map[Double, Int]): Array[Double] = { var featureArr: Array[Double] = Array() // featureArr ++= getCategoryFeature(bikeData.yr, yrMap) featureArr ++= getCategoryFeature(bikeData.season, seasonMap) featureArr ++= getCategoryFeature(bikeData.mnth, mnthMap) featureArr ++= getCategoryFeature(bikeData.holiday, holidayMap) featureArr ++= getCategoryFeature(bikeData.weekday, weekdayMap) featureArr ++= getCategoryFeature(bikeData.workingday, workdayMap) featureArr ++= getCategoryFeature(bikeData.weathersit, weatherMap) featureArr ++= getCategoryFeature(bikeData.hr, hrMap) // featureArr ++= Array(bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed) featureArr }
  160. 160. 160 getCategoryFeature getCategoryFeature方法 def getCategoryFeature(fieldVal: Double, categoryMap: Map[Double, Int]): Array[Double] = { var featureArray = Array.ofDim[Double](categoryMap.size) val index = categoryMap(fieldVal) featureArray(index) = 1 featureArray }
  161. 161. 161 Model Model ! trainModel ◦ DecisionTree.trainRegressor Model ! evaluateModel ◦ RMSE trainModel Model
  162. 162. 162 trainModel evaluateModel def trainModel(trainData: RDD[LabeledPoint], numIterations: Int, stepSize: Double, miniBatchFraction: Double): (LinearRegressionModel, Double) = { val startTime = new DateTime() // LinearRegressionWithSGD.train val model = LinearRegressionWithSGD.train(trainData, numIterations, stepSize, miniBatchFraction) val endTime = new DateTime() val duration = new Duration(startTime, endTime) //MyLogger.debug(model.toPMML()) // model debug (model, duration.getMillis) } def evaluateModel(validateData: RDD[LabeledPoint], model: LinearRegressionModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] RMSE } val metrics = new RegressionMetrics(scoreAndLabels) val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse rmse }
  163. 163. 163 Model Model! tuneParameter ◦ iteration stepSize miniBatchFraction trainModel evaluateModel RMSE
  164. 164. 164 tuneParameter def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val iterationArr: Array[Int] = Array(5, 10, 20, 60,100) val stepSizeArr: Array[Double] = Array(10, 50, 100, 200) val miniBatchFractionArr: Array[Double] = Array(0.5,0.8, 1) val evalArr = for (iteration <- iterationArr; stepSize <- stepSizeArr; miniBatchFraction <- miniBatchFractionArr) yield { // model RMSE val (model, duration) = trainModel(ttrainData, iteration, stepSize, miniBatchFraction) val rmse = evaluateModel(validateData, model) println(parameter: iteraion=%d, stepSize=%f, batchFraction=%f, rmse=%f .format(iteration, stepSize, miniBatchFraction, rmse)) (iteration, stepSize, miniBatchFraction, rmse) } val bestEvalAsc = (evalArr.sortBy(_._4)) val bestEval = bestEvalAsc(0) //RMSE println(best parameter: iteraion=%d, stepSize=%f, batchFraction=%f, rmse=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
  165. 165. A. ◦ BikeShareRegressionLR.scala Scala ◦ hour.csv data ◦ BikeShareRegressionLR.scala ( TODO B. feature ◦ BikeShareRegressionLR ● feature dayType(Double ) dayType ● holiday=0 workingday=0 dataType=0 ● holiday=1 dataType=1 ● holiday=0 workingday=1 dataType=2 ● dayType feature Model( getFeatures getCategoryInfo) ◦ 165 Linear Regression ============== tuning parameters(Category) ================== parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=0.500000, rmse=256.370620 parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=0.800000, rmse=256.376770 parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=1.000000, rmse=256.407185 parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=0.500000, rmse=250.037095 parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=0.800000, rmse=250.062817 parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=1.000000, rmse=250.126173
  166. 166. ! Random Forest (multitude) (Decision Tree) ◦ (mode) ◦ (mean) ! ◦ overfit ◦ (missing value) ◦ 166 (RandomForest)
  167. 167. ! import org.apache.spark.mllib.tree.RandomForest ! import org.apache.spark.mllib.tree.model.RandomForestModelimport ! RandomForest.trainRegressor Model(RandomForestModel ◦ val model=RandomForest.trainRegressor(trainData, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) ● trainData RDD[LabeledPoint] ● categoricalFeaturesInfo trainData categorical Map[ Index, ] continuous ● Map(0->2,4->10) 1,5 categorical 2,10 ● numTrees ( Model ) ● impurity ( variance) ● featureSubsetStrategy Feature ( auto ) ● maxDepth ● overfit ● ● maxBins ● categoricalFeaturesInfo maxBins categoricalFeaturesInfo 167 Random Forest Regression in Spark
  168. 168. 168 trainModel evaluateModel def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int,): (RandomForestModel, Double) = { val startTime = new DateTime() // val cateInfo = BikeShareEntity.getCategoryInfo(true) // categoricalFeaturesInfo val model = RandomForest.trainRegressor(trainData, cateInfo, 3, auto,impurity, maxDepth, maxBins) // Model val endTime = new DateTime() // val duration = new Duration(startTime, endTime) // //MyLogger.debug(model.toDebugString) // Decision Tree (model, duration.getMillis) } def evaluateModel(validateData: RDD[LabeledPoint], model: RandomForestModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] RMSE } val metrics = new RegressionMetrics(scoreAndLabels) val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse rmse }
  169. 169. ! [Exercise] ◦ Regression.zip Package Object data Build Path Scala IDE ◦ BikeShareRegressionRF ◦ RandomForest Decision Tree 169 RandomForest Regression
  170. 170. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ! 170 Outline
  171. 171. ! Etu & udn Hadoop Competition 2016 
 ! ETU udn udn ( Open Data) 

  172. 172. ! EHC 2015/6 ~ 2015/10 (View) (Order) (Member) 2015/11 (Storeid) (Cat_id1) (Cat_id2) 172
  173. 173. 1) Data Feature LabeledPoint Data ◦ Feature : 6~9 View/Order ◦ Label : 10 Order ( 1 0) ◦ Feature : … 2) LabeledPoint Data Training Set Validating Set( 6:4 Split) 3) Training Set Validating Set Machine Learning Model 4) Testing Set ◦ Feature : 6~10 View/Order ◦ Features : 1) 5) 3) Model Testing Set 6) 7) 1) ~ 6) 173
  174. 174. ! View/Order uid-storeid-cat_id1-cat_id2 Features ! ( RFM 6~9 View/Order ) ◦ View – viewRecent, viewCnt, viewLast1MCnt, viewLast2MCnt( ,6~9 , , ) ◦ Order – orderRecent, orderCnt, orderLast1MCnt, orderLast2MCnt( ,6~9 , , ) ◦ avgDaySpan, avgViewCnt, lastViewDate, lastViewCnt ( , , , ) 174 – Features(I)
  175. 175. ! ◦ gender, ageScore, cityScore( , encoding, encoding) ◦ ageScore: 1~11 ● EX: if (ages.equals(20 )) ageScore = 1 ◦ cityScore: 1~24 ● EX: if (livecity.equals( )) cityScore = 24 ! Miss Value ◦ N ● Gender: 2( ) ● Ages: 35-39 ● City: 175 – Features(II)
  176. 176. ! ( ) ◦ http://www.cwb.gov.tw/V7/climate/monthlyData/mD.htm ◦ 6~10 ◦ ◦ ◦ : https://drive.google.com/file/d/0B- b4FvCO9SYoN2VBSVNjN3F3a0U/view?usp=sharing ! 35 Features( uid-storeid-cat_id1-cat_id2) 176 – Features(III)
  177. 177. 177 – LabeledPoint Data Sort N Encoding EX: viewCnt( 5 Encoding) 7 3 2 1 viewCnt =5 viewCnt =4 viewCnt =3 viewCnt =2 viewCnt =1
  178. 178. ! Xgboost (Extreme Gradient Boosting, ) ◦ Input: LabeledPoint Data(Training Set) ● 35 Features ● Label (1/0 Label=1 0) ◦ Parameter: ● max_depth: Tree ● nround: ● Objective: binary:logistic( ) ◦ Implement: 178 – Machine Learning(I) val param = List(objective -> binary:logistic, max_depth -> 6) val model = XGBoost.train(trainSet, param, nround, 2, null, null)
  179. 179. ! Xgboost ◦ Evaluate(with validating Set): ● val predictRes = model.predict(validateSet) ● F_measure ◦ Parameter Tuning: ● max_depth=(5~10) nround=(10~25) ( ) ● : max_depth=6, nround=10 179 – Machine Learning(II) Precision = 0.16669166166766647 F1 measure = 0.15969926394341 Accuracy = 0.15065655700028824 Micro recall = 0.21370309951060 Micro precision = 0.3715258082813 Micro F1 measure = 0.271333885
  180. 180. ! Performance Improvement ◦ model N Feature Feature 180 – Machine Learning(III) : 90000ms -> 72000ms(local mode)
  181. 181. ! yarn resource manager ◦ spark-submit JOB Worker 181 spark-submit --class ehc.RecommandV4 --deploy-mode cluster -- master yarn ehcFinalV4.jar ! new SparkContext master URL new SparkContext(new SparkConf().setAppName(ehcFinal051).setMaster(local[4])) ➔ SetMaster ( spark-submit )
  182. 182. 182 Spark-submit Run Script Sample ###### Script Spark ( yarn Manager) Spark Submit Driver Program ###### ###### for linux-like system ######### # delete output on hdfs first `hadoop fs -rm -R -f /user/team007/data/output` # submit spark job echo -e processing spark job spark-submit --deploy-mode cluster --master yarn --jars lib/jcommon-1.0.23.jar,lib/ joda-time-2.2.jar --class --class ehc.RecommandV4 ehcFinalV4.jar Y # write to result_yyyyMMddHHmmss.txt echo -e write to outFile hadoop fs -cat /user/team007/data/output/part-* > 'result_'`date +%Y%m%d%H%M%S`'.txt'
  183. 183. ! Feature ! Feature ◦ 183 –
  184. 184. ! Input Single Node ◦ Worker merge ◦ uid-storeid-cat_id1-cat_id2 Sort ! F-Measure ◦ Model ◦ Spark MultilabelMetrics ! ◦ 184 – val scoreAndLabels: RDD[(Array[Double], Array[Double])] = … val metrics = new MultilabelMetrics(scoreAndLabels) println(sF1 measure = ${metrics.f1Measure})
  185. 185. ! ! Spark MLlib ◦ Feature Engineering ! Spark MLlib ◦ 185
  186. 186. 186

×