Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Spark

2,280 views

Published on

Introduction to Spark, dataframes, machine learning

Published in: Software
  • Be the first to comment

Introduction to Spark

  1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies An Overview of Apache Spark
  2. 2. © 2014 MapR Technologies 2 Find my presentation and other related resources here: http://events.mapr.com/JacksonvilleJava (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
  3. 3. © 2014 MapR Technologies 3 Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Examples and Resources
  4. 4. © 2014 MapR Technologies 4© 2014 MapR Technologies MapReduce Refresher
  5. 5. © 2014 MapR Technologies 5 MapReduce: A Programming Model • MapReduce: Simplified Data Processing on Large Clusters (published 2004) • Parallel and Distributed Algorithm: • Data Locality • Fault Tolerance • Linear Scalability
  6. 6. © 2014 MapR Technologies 6 The Hadoop Strategy http://developer.yahoo.com/hadoop/tutorial/module4.html Distribute data (share nothing) Distribute computation (parallelization without synchronization) Tolerate failures (no single point of failure) Node 1 Mapping process Node 2 Mapping process Node 3 Mapping process Node 1 Reducing process Node 2 Reducing process Node 3 Reducing process
  7. 7. © 2014 MapR Technologies 7 Chunks are replicated across the cluster Distribute Data: HDFS User process NameNode . . . network HDFS splits large data files into chunks (64 MB) metadata access physical data access Location metadata DataNodes store & retrieve data data
  8. 8. © 2014 MapR Technologies 8 Distribute Computation MapReduce Program Data Sources Hadoop Cluster Result
  9. 9. © 2014 MapR Technologies 9 MapReduce Execution and Data Flow • Map – Loading data and defining a key values • Shuffle – Sorts,collects key-based data • Reduce – Receives (key, list)’s to process and output Files loaded from HDFS stores file file Files loaded from HDFS stores Node 1 InputFormat InputFormat OutputFormat OutputFormat Final (k, v) pairs Final (k, v) pairs reduce reduce (sort) (sort) Input (k, v) pairs map map map RR RR RR RecordReaders: Split Split Split Writeback to Local HDFS store file Writeback to Local HDFS store file SplitSplitSplit RRRRRR RecordReaders: Input (k, v) pairs mapmapmap Node2 “Shuffle” process Intermediate (k, v) Pairs exchanged By all nodes Partitioner Intermediate (k, v) pairs Partitioner Intermediate (k, v) pairs
  10. 10. © 2014 MapR Technologies 10 MapReduce Example: Word Count Output "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … and, 1 … and, 1 … and, [1, 1, 1] come, [1,1,1] has, [1,1] the, [1,1,1] time, [1,1,1,1] … and, 3 come, 3 has, 2 the, 3 time, 4 … Input Map Shuffle and Sort Reduce Output Reduce
  11. 11. © 2014 MapR Technologies 11 Tolerate Failures Hadoop Cluster Failures are expected & managed gracefully DataNode fails -> name node will locate replica MapReduce task fails -> job tracker will schedule another one Data
  12. 12. © 2014 MapR Technologies 12 MapReduce Processing Model • For complex work, chain jobs together – Use a higher level language or DSL that does this for you
  13. 13. © 2014 MapR Technologies 13 Typical MapReduce Workflows Input to Job 1 SequenceFile Last Job Maps Reduces SequenceFile Job 1 Maps Reduces SequenceFile Job 2 Maps Reduces Output from Job 1 Output from Job 2 Input to last job Output from last job HDFS Iteration is slow because it writes/reads data to disk
  14. 14. © 2014 MapR Technologies 14 Iterations Step Step Step Step Step In-memory Caching • Data Partitions read from RAM instead of disk
  15. 15. © 2014 MapR Technologies 15 MapReduce Design Patterns • Summarization – Inverted index, counting • Filtering – Top ten, distinct • Aggregation • Data Organziation – partitioning • Join – Join data sets • Metapattern – Job chaining
  16. 16. © 2014 MapR Technologies 16 Inverted Index Example come, (alice.txt) do, (macbeth.txt) has, (alice.txt) time, (alice.txt, macbeth.txt) . . . "The time has come," the Walrus said alice.txt tis time to do it macbeth.txt time, alice.txt has, alice.txt come, alice.txt .. tis, macbeth.txt time, macbeth.txt do, macbeth.txt …
  17. 17. © 2014 MapR Technologies 20 MapReduce: The Good • Optimized to Read and write large data files quickly – Process data on node – Minimize movement of disk head • Scalable • Built in fault tolerance • Developer focuses on Map/Reduce, not infrastructure • simple? API
  18. 18. © 2014 MapR Technologies 21 MapReduce: The Bad • Optimized for disk IO – Doesn’t leverage memory well – Iterative algorithms go through disk IO path again and again • Primitive API – simple abstraction – Key/Value in/out
  19. 19. © 2014 MapR Technologies 22 Free Hadoop MapReduce On Demand Training • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  20. 20. © 2014 MapR Technologies 23 What is Hive? • Data Warehouse on top of Hadoop – analytical querying of data • without programming • SQL like execution for Hadoop – SQL evaluates to MapReduce code • Submits jobs to your cluster
  21. 21. © 2014 MapR Technologies 24 Job Tracker Name Node HADOOP (MAP-REDUCE + HDFS) Data Node + Task Tracker Hive Metastore Driver (compiler, Optimizer, Executor) Command Line Interface Web Interface JDBC Thrift Server ODBC Metastore Hive The schema metadata is stored in the Hive metastore Hive Table definition HBase trades_tall Table
  22. 22. © 2014 MapR Technologies 26 Hive HBase – External Table key cf1:price cf1:vol AMZN_986186008 12.34 1000 AMZN_986186007 12.00 50 Selection WHERE key like SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ; Projection select price Aggregation Avg( price)
  23. 23. © 2014 MapR Technologies 27 Hive HBase – Hive Query SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ; HBase Tables Queries Parser Planner Execution
  24. 24. © 2014 MapR Technologies 28 Hive Query Plan • EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Filter Operator predicate: (key like 'GOOG%') (type: boolean) Select Operator Group By Operator Reduce Operator Tree: Group By Operator Select Operator File Output Operator
  25. 25. © 2014 MapR Technologies 29 Hive Query Plan – (2) output hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; col0 Trades table group aggregations: avg(price) scan filter Select key like 'GOOG% Select price Group by map() map() map() reduce() reduce()
  26. 26. © 2014 MapR Technologies 32 What is a Directed Acylic Graph (DAG) ? • Graph – vertices (points) and edges (lines) • Directed – Only in a single direction • Acyclic – No looping BA BA
  27. 27. © 2014 MapR Technologies 33 Hive Query Plan Map Reduce Execution FS1 AGG2 RS4 JOIN1 RS2 AGG1 RS1 t1 RS3 t1 Job 3 Job 2 FS1 AGG2 JOIN1 AGG1 RS1 t1 RS3Job 1 Job 1 Optimize
  28. 28. © 2014 MapR Technologies 35 Free HBase On Demand Training (includes Hive and MapReduce with HBase) • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  29. 29. © 2014 MapR Technologies 40© 2014 MapR Technologies Apache Spark
  30. 30. © 2014 MapR Technologies 42 Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN Unified Platform
  31. 31. © 2014 MapR Technologies 43 Why Apache Spark? • Distributed Parallel Cluster computing – Scalable , Fault tolerant • Programming Framework – APIs in Java, Scala, Python – Less code than map reduce • Fast – Runs computations in memory – Tasks are threads
  32. 32. © 2014 MapR Technologies 46 Spark Use Cases • Iterative Algorithms on large amounts of data • Some Example Use Cases: – Anomaly detection – Classification – Predictions – Recommendations
  33. 33. © 2014 MapR Technologies 47 Why Iterative Algorithms • Algorithms that need iterations – Clustering (K-Means, Canopy, …) – Gradient descent (e.g., Logistic Regression, Matrix Factorization) – Graph Algorithms (e.g., PageRank, Line-Rank, components, paths, reachability, centrality, ) – Alternating Least Squares ALS – Graph communities / dense sub-components – Inference (believe propagation) – … 47
  34. 34. © 2014 MapR Technologies 48 Data Sources • Local Files • S3 • Hadoop Distributed Filesystem – any Hadoop InputFormat • HBase • other NoSQL data stores
  35. 35. © 2014 MapR Technologies 51© 2014 MapR Technologies How Spark Works
  36. 36. © 2014 MapR Technologies 52 Spark Programming Model sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program SparkContext cluster Worker Node Task Task Task Worker Node
  37. 37. © 2014 MapR Technologies 53 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements
  38. 38. © 2014 MapR Technologies 54 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements • operated on in parallel • Cached in memory – Or on disk • Fault tolerant
  39. 39. © 2014 MapR Technologies 55 Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)
  40. 40. © 2014 MapR Technologies 56 Working With RDDs RDD RDD RDD RDD Transformations linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line) linesRDD = sc.textFile(”LogFile.txt”)
  41. 41. © 2014 MapR Technologies 57 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithErrorRDD.count() 6 linesWithErrorRDD.first() # Error line textFile = sc.textFile(”SomeFile.txt”) linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
  42. 42. © 2014 MapR Technologies 58 Example Spark Word Count in Scala ...the ... "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax andtime and the, 1 time, 1 and, 1 and, 1 and, 12time, 4 ...the, 20 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) // Transform into pairs and count. val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} // Save the word count back out to a text file, counts.saveAsTextFile(outputFile) the, 20 time, 4 ….. and, 12 .........
  43. 43. © 2014 MapR Technologies 59 Example Spark Word Count in Scala 59 HadoopRDD textFile // Load input data. val input = sc.textFile(inputFile) RDD partitions MapPartitionsRDD
  44. 44. © 2014 MapR Technologies 60 Example Spark Word Count in Scala 60 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) HadoopRDD textFile flatmap MapPartitionsRDD MapPartitionsRDD
  45. 45. © 2014 MapR Technologies 61 FlatMap flatMap line => line.split(" ")) 1 to many mapping Ships Ships and wax and wax RDD<String> wordsRDD
  46. 46. © 2014 MapR Technologies 62 Example Spark Word Count in Scala 62 textFile flatmap map val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) // Transform into pairs val counts = words.map(word => (word, 1)) HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD
  47. 47. © 2014 MapR Technologies 63 Map map word => (word, 1)) 1 to 1 mapping PairRDD<String, Integer> word1s Ships and wax Ships,1 and, 1 wax, 1
  48. 48. © 2014 MapR Technologies 64 Example Spark Word Count in Scala 64 textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} HadoopRDD MapPartitionsRDD MapPartitionsRDD ShuffledRDD MapPartitionsRDD
  49. 49. © 2014 MapR Technologies 65 reduceByKey reduceByKey case (x, y) => x + y and, 1 , 1 wax, 1 and, 2 PairRDD<String, Integer> coun PairRDD Function : reduceByKey(func: (value, value) ⇒ value): RDD[(key, value) wax, 1 and, 1 and, 1 wax, 1 case (0, 1) => 0 + 1 case (1, 1) => 1 + 1
  50. 50. © 2014 MapR Technologies 66 Example Spark Word Count in Scala textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} val countArray = counts.collect() HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD collect ShuffledRDD Array
  51. 51. © 2014 MapR Technologies 67 Demo Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
  52. 52. © 2014 MapR Technologies 68© 2014 MapR Technologies Components Of Execution
  53. 53. © 2014 MapR Technologies 69 Spark RDD DAG -> Physical Execution plan HadoopRDD sc.textfile(…) MapPartitionsRDD flatmap flatmap reduceByKey RDD Graph Physical Plan collect MapPartitionsRDD ShuffledRDD MapPartitionsRDD Stage 1 Stage 2
  54. 54. © 2014 MapR Technologies 70 Physical Plan DAG Stage 1 Stage 2 Task Task Task Task Task Task Task Stage 1 Stage 2 Split into Tasks HFile HDFS Data Node Worker Node block cache partition Executor HFile block HFileHFile Task thread Task Set Task Scheduler Task Physical Execution plan -> Stages and Tasks
  55. 55. © 2014 MapR Technologies 71 Summary of Components • Task : unit of execution • Stage: Group of Tasks – Base on partitions of RDD – Tasks run in parallel • DAG : Logical Graph of RDD operations • RDD : Parallel dataset with partitions 71
  56. 56. © 2014 MapR Technologies 72 How Spark Application runs on a Hadoop cluster HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile SparkContext zookeeper YARN Resource Manager HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile Client node sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program Yarn Node Manger Yarn Node Manger
  57. 57. © 2014 MapR Technologies 73 Deploying Spark – Cluster Manager Types • Standalone mode • Mesos • YARN • EC2 • GCE
  58. 58. © 2014 MapR Technologies 74 RDD Transformations and Actions RDD RDD RDD RDDTransformations Action Value Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Actions (return a value) reduce collect count save lookupKey …
  59. 59. © 2014 MapR Technologies 75 MapR Tutorial: Getting Started with Spark on MapR Sandbox • https://www.mapr.com/products/mapr-sandbox- hadoop/tutorials/spark-tutorial
  60. 60. © 2014 MapR Technologies 76 MapR Blog: Getting Started with the Spark Web UI • https://www.mapr.com/blog/getting-started-spark-web-ui
  61. 61. © 2014 MapR Technologies 77© 2014 MapR Technologies Example: Log Mining
  62. 62. © 2014 MapR Technologies 78 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Based on slides from Pat McDonough at
  63. 63. © 2014 MapR Technologies 79 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver
  64. 64. © 2014 MapR Technologies 80 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”)
  65. 65. © 2014 MapR Technologies 81 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”) Base RDD
  66. 66. © 2014 MapR Technologies 82 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  67. 67. © 2014 MapR Technologies 83 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver Transformed RDD
  68. 68. © 2014 MapR Technologies 84 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
  69. 69. © 2014 MapR Technologies 85 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
  70. 70. © 2014 MapR Technologies 86 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
  71. 71. © 2014 MapR Technologies 87 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver tasks tasks tasks
  72. 72. © 2014 MapR Technologies 88 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
  73. 73. © 2014 MapR Technologies 89 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  74. 74. © 2014 MapR Technologies 90 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results
  75. 75. © 2014 MapR Technologies 91 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
  76. 76. © 2014 MapR Technologies 92 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks Driver
  77. 77. © 2014 MapR Technologies 93 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache
  78. 78. © 2014 MapR Technologies 94 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
  79. 79. © 2014 MapR Technologies 95 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data  Faster Results
  80. 80. © 2014 MapR Technologies 96© 2014 MapR Technologies Dataframes
  81. 81. © 2014 MapR Technologies 97 Scenario: Online Auction Data AuctionID Bid Bid Time Bidder Bidder Rate Open Bid Price Item Days To Live 8213034705 95 2.927373 jake7870 0 95 117.5 xbox 3 8213034705 115 2.943484 davidbresler2 1 95 117.5 xbox 3 8213034705 100 2.951285 gladimacowgirl 58 95 117.5 xbox 3 8213034705 117.5 2.998947 daysrus 95 95 117.5 xbox 3
  82. 82. © 2014 MapR Technologies 98 Creating DataFrames // Define schema using a Case class case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate: Int, openbid: Float, finprice: Float, itemtype: String, dtl: Int)
  83. 83. © 2014 MapR Technologies 99 Creating DataFrames // Create the RDD val auctionRDD=sc.textFile(“/user/user01/data/ebay.csv”) .map(_.split(”,")) // Map the data to the Auctions class val auctions=auctionRDD.map(a=>Auction(a(0), a(1).toFloat, a(2).toFloat, a(3), a(4).toInt, a(5).toFloat, a(6).toFloat, a(7), a(8).toInt))
  84. 84. © 2014 MapR Technologies 100 Creating DataFrames // Convert RDD to DataFrame val auctionsDF=auctions.toDF() //To see the data auctionsDF.show() auctionid bid bidtime bidder bidderrate openbid price item daystolive 8213034705 95.0 2.927373 jake7870 0 95.0 117.5 xbox 3 8213034705 115.0 2.943484 davidbresler2 1 95.0 117.5 xbox 3 …
  85. 85. © 2014 MapR Technologies 101 Inspecting Data: Examples //To see the schema auctionsDF.printSchema() root |-- auctionid: string (nullable = true) |-- bid: float (nullable = false) |-- bidtime: float (nullable = false) |-- bidder: string (nullable = true) |-- bidderrate: integer (nullable = true) |-- openbid: float (nullable = false) |-- price: float (nullable = false) |-- item: string (nullable = true) |-- daystolive: integer (nullable = true)
  86. 86. © 2014 MapR Technologies 102 Inspecting Data: Examples //Total number of bids val totbids=auctionsDF.count() //How many bids per item? auction.groupBy("auctionid", "item").count.show auctionid item count 3016429446 palm 10 8211851222 xbox 28 3014480955 palm 12 8214279576 xbox 4 3014844871 palm 18 3014012355 palm 35 1641457876 cartier 2
  87. 87. © 2014 MapR Technologies 103 Creating DataFrames // Register the DF as a table auctionsDF.registerTempTable(“auctionsDF”) val results =sqlContext.sql("SELECT auctionid, MAX(price) as maxprice FROM auction GROUP BY item,auctionid") results.show() auctionid maxprice 3019326300 207.5 8213060420 120.0 . . .
  88. 88. © 2014 MapR Technologies 104 The physical plan for DataFrames
  89. 89. © 2014 MapR Technologies 105 DataFrame Excecution plan // Print the physical plan to the console auction.select("auctionid").distinct.explain() == Physical Plan == Distinct false Exchange (HashPartitioning [auctionid#0], 200) Distinct true Project [auctionid#0] PhysicalRDD [auctionid#0,bid#1,bidtime#2,bidder#3, bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37
  90. 90. © 2014 MapR Technologies 106 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data • https://www.mapr.com/blog/using-apache-spark-dataframes- processing-tabular-data
  91. 91. © 2014 MapR Technologies 107© 2014 MapR Technologies Machine Learning
  92. 92. © 2014 MapR Technologies 108 Collaborative Filtering with Spark • Recommend Items – (filtering) • Based on User preferences data – (collaborative)
  93. 93. © 2014 MapR Technologies 109 Alternating Least Squares • approximates sparse user item rating matrix • as product of two dense matrices, User and Item factor matrices
  94. 94. © 2014 MapR Technologies 110 Parse Input // parse input UserID::MovieID::Rating def parseRating(str: String): Rating= { val fields = str.split("::") Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) } // create an RDD of Ratings objects val ratingsRDD = ratingText.map(parseRating).cache()
  95. 95. © 2014 MapR Technologies 111 ML Cross Validation Process Data Model Training/ Building Model Testing Test Set Train Test loop Training Set
  96. 96. © 2014 MapR Technologies 112 Create Model // Randomly split ratings RDD into training data RDD (80%) and test data RDD (20%) val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L) val trainingRatingsRDD = splits(0).cache() val testRatingsRDD = splits(1).cache() // build a ALS user product matrix model with rank=20, iterations=10 val model = (new ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))
  97. 97. © 2014 MapR Technologies 113 Get predictions // Get the top 5 movie predictions for user 4169 val topRecsForUser = model.recommendProducts(4169, 5) // get (user,product) pairs val testUserProductRDD = testRatingsRDD.map { case Rating(user, product, rating) => (user, product) } // get predictions for (user,product) pairs val predictionsForTestRDD = model.predict(testUserProductRDD)
  98. 98. © 2014 MapR Technologies 114 Test Model // prepare predictions for comparison val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } // prepare test for comparison val testKeyedByUserProductRDD = testRatingsRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } //Join the test with predictions val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD.join(predictionsKeyedByUserProductRDD)
  99. 99. © 2014 MapR Technologies 115 Test Model val falsePositives =(testAndPredictionsJoinedRDD.filter{ case ((user, product), (ratingT, ratingP)) => (ratingT <= 1 && ratingP >=4) }) falsePositives.take(2) Array[((Int, Int), (Double, Double))] = ((3842,2858),(1.0,4.106488210964762)), ((6031,3194),(1.0,4.790778049100913))
  100. 100. © 2014 MapR Technologies 116 Test Model //Evaluate the model using Mean Absolute Error (MAE) between test and predictions val meanAbsoluteError = testAndPredictionsJoinedRDD.map { case ((user, product), (testRating, predRating)) => val err = (testRating - predRating) Math.abs(err) }.mean() meanAbsoluteError: Double = 0.7244940545944053
  101. 101. © 2014 MapR Technologies 117 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/ • Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering – Spark Streaming
  102. 102. © 2014 MapR Technologies 118 Machine Learning Blog • https://www.mapr.com/blog/parallel-and-iterative-processing- machine-learning-recommendations-spark
  103. 103. © 2014 MapR Technologies 119© 2014 MapR Technologies Examples and Resources
  104. 104. © 2014 MapR Technologies 120 Coming soon Free Spark On Demand Training • https://www.mapr.com/services/mapr-academy/
  105. 105. © 2014 MapR Technologies 121 Find my presentation and other related resources here: http://events.mapr.com/JacksonvilleJava (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
  106. 106. © 2014 MapR Technologies 122 Spark on MapR • Certified Spark Distribution • Fully supported and packaged by MapR in partnership with Databricks – mapr-spark package with Spark, Shark, Spark Streaming today – Spark-python, GraphX and MLLib soon • YARN integration – Spark can then allocate resources from cluster when needed
  107. 107. © 2014 MapR Technologies 123 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  108. 108. © 2014 MapR Technologies 124 Q&A @mapr maprtech kbotzum@mapr.com Engage with us! MapR maprtech mapr-technologies

×