Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Large volume data analysis on the Typesafe Reactive Platform

3,104 views

Published on

Large volume data analysis on the Typesafe Reactive Platform. Scala Days 2015 Amsterdam slides

Published in: Software

Large volume data analysis on the Typesafe Reactive Platform

  1. 1. Martin Zapletal @zapletal_martin Cake Solutions @cakesolutions
  2. 2. ● Increasing importance of data analytics ● Current state ○ Destructive updates ○ Analytics tools with poor scalability and integration ○ Manual processes ○ Slow iterations ○ Not suitable for large amounts of data
  3. 3. ● Shared memory, disk, shared nothing, threads, mutexes, transactional memory, message passing, CSP, actors, futures, coroutines, evented, dataflow, ... We can think of two reasons for using distributed machine learning: because you have to (so much data), or because you want to (hoping it will be faster). Only the first reason is good. Zygmunt Z Elapsed times for 20 PageRank iterations [1, 2]
  4. 4. ● Microsoft's data centers average failure rate is 5.2 devices per day and 40.8 links per day, with a median time to repair of approximately five minutes (and a maximum of one week). ● Google new cluster over one year. Five times rack issues 40-80 machines seeing 50 percent packet loss. Eight network maintenance events (four of which might cause ~30-minute random connectivity losses). Three router failures (resulting in the need to pull traffic immediately for an hour). ● CENIC 500 isolating network partitions with median 2.7 and 32 minutes; 95th percentile of 19.9 minutes and 3.7 days, respectively for software and hardware problems [3]
  5. 5. ● MongoDB separated primary from its 2 secondaries. 2 hours later the old primary rejoined and rolled back everything on the new primary ● A network partition isolated the Redis primary from all secondaries. Every API call caused the billing system to recharge customer credit cards automatically, resulting in 1.1 percent of customers being overbilled over a period of 40 minutes. ● The partition caused inconsistency in the MySQL database. Because foreign key relationships were not consistent, Github showed private repositories to the wrong users' dashboards and incorrectly routed some newly created repositories. ● For several seconds, Elasticsearch is happy to believe two nodes in the same cluster are both primaries, will accept writes on both of those nodes, and later discard the writes to one side. ● RabbitMQ lost ~35% of acknowledged writes under those conditions. ● Redis threw away 56% of the writes it told us succeeded. ● In Riak, last-write-wins resulted in dropping 30-70% of writes, even with the strongest consistency settings ● MongoDB “strictly consistent” reads see stale versions of documents, but they can also return garbage data from writes that never should have occurred. [4]
  6. 6. ● Complementary ● Distributed data processing framework Apache Spark won Daytona Gray Sort 100TB Benchmark ● Distributed databases
  7. 7. ● Whole lifecycle of data ● Data processing - Futures, Akka, Akka Cluster, Reactive Streams, Spark, … ● Data stores ● Integration ● Distributed computing primitives ● Cluster managers and task schedulers ● Deployment, configuration management and DevOps ● Data analytics and machine learning
  8. 8. ACID Mutable State
  9. 9. CQRS Kappa architecture Batch-Pipeline Kafka Allyourdata NoSQL SQL Spark Client Client Client Views Stream processor Client QueryCommand DBDB Denormalise /Precompute Flume Scoop Hive Impala Serving DB Oozie HDFS Lambda Architecture Batch Layer Serving Layer Stream layer (fast) Query Query Allyourdata [5]
  10. 10. [6]
  11. 11. ● Basic building block of neural networks a = f(Σ(y * w) + b)
  12. 12. ● Multi Layer Perceptron (Feed Forward Neural Network) ● Network training ○ Many “optimal” solutions ○ Optimization and training techniques - LBFGS, Backpropagation, batch and online gradient descent, Downpour SGD, Sandblaster LBFGS ○ New methods for large networks - deep learning ● We will only need forward propagation
  13. 13. -10.895 1.195 1 0 0.999595 -24.584 -1.159 7.360 -40.119 1.991 35.369 -24.687 -53.197 -8.627 -57.122 2.616 61.488 -52.985 -22.904 -67.173 22.172 -53.706 27.098 -0.375 Output 2.613296075440797E-4 for input Vector(0, 0) Output 0.9989222606269823 for input Vector(0, 1) Output 0.9995952194411893 for input Vector(1, 0) Output 4.0074182099155245E-7 for input Vector(1, 1)
  14. 14. trait HasInput { var input: Node = _ def addInput(i: Node): Unit = input = i } trait HasOutput { var output: Node = _ def addOutput(o: Node): Unit = output = o } class Edge extends HasInput with HasOutput { var weight: Double = 0.3 def run(in: Input) = output.run(WeightedInput(in.feature, weight)) }
  15. 15. class Perceptron extends Neuron { override var activationFunction: Double => Double = Neuron.sigmoid override var bias: Double = 0.2 var inputs: Seq[Edge] = Seq() var outputs: Seq[Edge] = Seq() var weightsT: Seq[Double] = Vector() var featuresT: Seq[Double] = Vector() private def allInputsAvailable(w: Seq[Double], f: Seq[Double], in: Seq[Edge]) = w.length == in.length && f.length == in.length override def run(in: WeightedInput): Unit = { featuresT = featuresT :+ in.feature weightsT = weightsT :+ in.weight if(allInputsAvailable(weightsT, featuresT, inputs)) { val activation = activationFunction(weightsT.zip(featuresT).map(x => x._1 * x._2).sum + bias) featuresT = Vector() weightsT = Vector() outputs.foreach(_.run(Input(activation))) } } }
  16. 16. val hiddenLayer1 = new Perceptron() val edgei1h1 = new Edge() edgei1h1.addInput(inputLayer1) edgei1h1.addOutput(hiddenLayer1) hiddenLayer1.addInputs(Seq(edgei1h1, edgei2h1, edgei3h1)) hiddenLayer1.addOutputs(Seq(edgeh1o1)) Source.fromFile("src/main/resources/data2.csv") .getLines() .foreach{ l => val splits = l.split(",") inputLayer1.run( WeightedInput(splits(0).toDouble, 1)) inputLayer2.run( WeightedInput(splits(1).toDouble, 1)) inputLayer3.run( WeightedInput(splits(2).toDouble, 1)) } 00.00010.0002 0.00010.0002 0 0.00010.0002 0
  17. 17. Output 0 with result 0.6294598811729977 in 14:49:14.971 Output 1 with result 0.6294629986168121 in 14:49:14.975 Output 2 with result 0.6294661160218618 in 14:49:14.976 Output 3 with result 0.6294692333881344 in 14:49:14.976 Output 4 with result 0.6294723507156179 in 14:49:14.977 Output 5 with result 0.6294754680043 in 14:49:14.978 Output 6 with result 0.6294785852541688 in 14:49:14.978 Output 7 with result 0.6294817024652116 in 14:49:14.979 Output 8 with result 0.6294848196374169 in 14:49:14.979 Output 9 with result 0.6294879367707719 in 14:49:14.980 Output 10 with result 0.6294910538652648 in 14:49:14.980 Output 11 with result 0.6294941709208833 in 14:49:14.981 Output 12 with result 0.6294972879376152 in 14:49:14.981 Output 13 with result 0.6295004049154483 in 14:49:14.982 Output 14 with result 0.6295035218543708 in 14:49:14.982 Output 15 with result 0.6295066387543699 in 14:49:14.983 Output 16 with result 0.6295097556154339 in 14:49:14.983 Output 17 with result 0.6295128724375503 in 14:49:14.983 Output 18 with result 0.6295159892207073 in 14:49:14.984 Output 19 with result 0.6295191059648922 in 14:49:14.984 Output 20 with result 0.6295222226700935 in 14:49:14.985 ...
  18. 18. Source.fromFile("src/main/resources/data2.csv") .getLines() .toList .par .foreach { l => ... } Output 0 with result 0.6615020337700888 in 12:15:53.564 Output 0 with result 0.6622847063345205 in 12:15:53.564
  19. 19. object Perceptron { def activation(w: Vector[Double], f: Vector[Double], bias: Double, activationFunction: Double => Double) = activationFunction(w.zip(f).map(x => x._1 * x._2).sum + bias) } object Network { def feedForward(features: Vector[Double], network: Seq[Vector[Vector[Double] => Double]]): Vector[Double] = network.foldLeft(features)((b, a) => a.map(_(b))) } val network = Seq[Vector[Vector[Double] => Double]]( Vector( Perceptron.activation(Vector(0.3, 0.3, 0.3), _, 0.2, Neuron.sigmoid), Perceptron.activation(Vector(0.3, 0.3, 0.3), _, 0.2, Neuron.sigmoid)), Vector(Perceptron.activation(Vector(0.3, 0.3, 0.3), _, 0.2, Neuron.sigmoid)) ) Network.feedForward(Vector(splits(0).toDouble, splits(1).toDouble, splits(2).toDouble), network)
  20. 20. ● Actor framework for truly concurrent and distributed systems ● Thread safe mutable state ● Send messages, create new actors, change behaviour ● Multiple options how to express Neural network
  21. 21. def props() = Props(behaviour) def behaviour = addInput(addOutput(feedForward(_, _, 0.2, sigmoid, Vector(), Vector()), _)) private def allInputsAvailable(w: Vector[Double], f: Vector[Double], in: Seq[ActorRef[Nothing]]) = w.length == in.length && f.length == in.length def feedForward( inputs: Seq[ActorRef[Nothing]], outputs: Seq[ActorRef[Input]], bias: Double, activationFunction: Double => Double, weightsT: Vector[Double], featuresT: Vector[Double]): Behavior[NodeMessage] = Partial[NodeMessage] { case WeightedInput(f, w) => val featuresTplusOne = featuresT :+ f val weightsTplusOne = weightsT :+ w if (allInputsAvailable(featuresTplusOne, weightsTplusOne, inputs)) { val activation = activationFunction(weightsTplusOne.zip(featuresTplusOne).map(x => x._1 * x._2).sum + bias) outputs.foreach(_ ! Input(activation)) feedForward(inputs, outputs, bias, activationFunction, Vector(), Vector()) } else { feedForward(inputs, outputs, bias, activationFunction, weightsTplusOne, featuresTplusOne) } }
  22. 22. Activation 0.5498414227985574 using features Vector(0.0, 0.0, 1.0E-4) Activation 0.549856273704096 using features Vector(0.0, 1.0E-4, 2.0E-4) Activation 0.5498711245207856 using features Vector(1.0E-4, 2.0E-4, 2.0E-4) Activation 0.6294619594716266 using features Vector(0.5498414227985574, 0.549856273704096) Activation 0.5498859752486001 using features Vector(3.0E-4, 4.0E-4, 0.0) Activation 0.5499082511736245 using features Vector(3.0E-4, 4.0E-4, 3.0E-4) Activation 0.6294661160203068 using features Vector(0.5498711245207856, 0.5498859752486001) Activation 0.5499453772705898 using features Vector(4.0E-4, 5.0E-4, 6.0E-4) Activation 0.549893400579171 using features Vector(1.0E-4, 5.0E-4, 2.0E-4) Activation 0.5499453772705898 using features Vector(7.0E-4, 0.0, 8.0E-4) Activation 0.6294692333865788 using features Vector(0.5499082511736245, 0.549893400579171) Activation 0.5499231016791383 using features Vector(1.0E-4, 9.0E-4, 2.0E-4) Activation 0.5499231016791383 using features Vector(3.0E-4, 4.0E-4, 5.0E-4) Activation 0.549967652661781 using features Vector(6.0E-4, 5.0E-4, 7.0E-4) Activation 0.5500047778685057 using features Vector(6.0E-4, 8.0E-4, 9.0E-4) Activation 0.5500196277952787 using features Vector(7.0E-4, 8.0E-4, 0.001) Activation 0.5500716018368813 using features Vector(9.0E-4, 0.0011, 0.0012) Activation 0.5501458485714356 using features Vector(0.0013, 0.0014, 0.0015) Activation 0.5501532731220817 using features Vector(0.0016, 0.001, 0.0017)
  23. 23. ● Sequential program always one total order of operations ● No order guarantees in distributed system ● Akka messages sent directly from the first to the second will not be received out-of-order for a pair of actors (non transitive)
  24. 24. ● At-most-once. Messages may be lost. ● At-least-once. Messages may be duplicated but not lost. ● Exactly-once. Ack [8]
  25. 25. 1. 4. 7. 2. 3. 5. 6. 8. 9. 10. 11.
  26. 26. ? ? ? ? + 1 ? + 1 ? + 1 ? + 1 ? + 1 ? + 1 ? + 2 ? + 2
  27. 27. Output 76 with result 0.6298492571946717 in 2015-05-21 17:26:56.504 Output 77 with result 0.6298357692712147 in 2015-05-21 17:26:56.504 Output 78 with result 0.6298679316729997 in 2015-05-21 17:26:56.504 Output 79 with result 0.6298674125610421 in 2015-05-21 17:26:56.504 Output 80 with result 0.6298866035455875 in 2015-05-21 17:26:56.504 Output 81 with result 0.6298959406028078 in 2015-05-21 17:26:56.504 Output 82 with result 0.6299052760580531 in 2015-05-21 17:26:56.504 Output 83 with result 0.6299057948796922 in 2015-05-21 17:26:56.505 Output 84 with result 0.6299094252583786 in 2015-05-21 17:26:56.505 Output 85 with result 0.6299332807659122 in 2015-05-21 17:26:56.505 Output 86 with result 0.6299426148804811 in 2015-05-21 17:26:56.505 Output 87 with result 0.6299462447313531 in 2015-05-21 17:26:56.505 Output 88 with result 0.6299612820238325 in 2015-05-21 17:26:56.505 [INFO] [05/21/2015 17:26:56.504] [akka-akka.actor.default-dispatcher-13] [akka://akka/user/hiddenLayer1] Message [Node$WeightedInput] from Actor[akka://akka/deadLetters] to Actor[akka://akka/user/hiddenLayer1#162015581] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Output 89 with result 0.6299706150403437 in 2015-05-21 17:26:56.505 Output 90 with result 0.6299799476901787 in 2015-05-21 17:26:56.506 [INFO] [05/21/2015 17:26:56.504] [akka-akka.actor.default-dispatcher-13] [akka://akka/user/hiddenLayer1] Message [Node$WeightedInput] from Actor[akka://akka/deadLetters] to Actor[akka://akka/user/hiddenLayer1#162015581] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Output 91 with result 0.6299809846419518 in 2015-05-21 17:26:56.506 Output 92 with result 0.6299986118617977 in 2015-05-21 17:26:56.506
  28. 28. ● Model parallelism ● Actor creation manual or Cluster Sharding val idExtractor: ShardRegion.IdExtractor = { case i: AddInputs => } val shardResolver: ShardRegion.ShardResolver = { case i: AddInputs => } Machine1 Machine2 Machine3 Machine4 Machine1 Machine2 Machine3 Machine4 [9]
  29. 29. class Perceptron() extends Actor with Neuron { ... override def receive = run orElse addInput orElse addOutput val shardRegion = ClusterSharding(context.system).shardRegion(Edge.shardName) def run: Receive = { case WeightedInput(_, f, w) => featuresT = featuresT :+ f weightsT = weightsT :+ w if(allInputsAvailable(weightsT, featuresT, inputs)) { val activation = activationFunction(weightsT.zip(featuresT).map(x => x._1 * x._2).sum + bias) featuresT = Vector() weightsT = Vector() outputs.foreach(shardRegion ! Input(_, activation)) } } }
  30. 30. Output 18853 with result 0.6445355972059068 in 17:33:12.248 Output 18854 with result 0.6392081778097862 in 17:33:12.248 Output 18855 with result 0.6476549338361918 in 17:33:12.248 Output 18856 with result 0.6413832367161323 in 17:33:12.248 [17:33:12.353] [ClusterSystem-akka.actor.default-dispatcher-21] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@127.0.0.1:2551] - Leader is removing unreachable node [akka.tcp: //ClusterSystem@127.0.0.1:54495] [17:33:12.388] [ClusterSystem-akka.actor.default-dispatcher-22] [akka.tcp://ClusterSystem@127.0.0.1: 2551/user/sharding/PerceptronCoordinator] Member removed [akka.tcp://ClusterSystem@127.0.0.1:54495] [17:33:12.388] [ClusterSystem-akka.actor.default-dispatcher-35] [akka.tcp://ClusterSystem@127.0.0.1: 2551/user/sharding/EdgeCoordinator] Member removed [akka.tcp://ClusterSystem@127.0.0.1:54495] [17:33:12.415] [ClusterSystem-akka.actor.default-dispatcher-18] [akka://ClusterSystem/user/sharding/Edge/e-2- 1-3-1] null java.lang.NullPointerException [17:33:12.436] [ClusterSystem-akka.actor.default-dispatcher-2] [akka://ClusterSystem/user/sharding/Edge/e-2- 1-3-1] null java.lang.NullPointerException [17:33:12.436] [ClusterSystem-akka.actor.default-dispatcher-2] [akka://ClusterSystem/user/sharding/Edge/e-2- 1-3-1] null java.lang.NullPointerException
  31. 31. class Edge extends PersistentActor with HasInput with HasOutput { override def persistenceId: String = self.path.name var weight: Double = 0.3 override def receiveCommand: Receive = run orElse addInput orElse addOutput override def receiveRecover: Receive = recover orElse addInputRecover orElse addOutputRecover val shardRegion = ClusterSharding(context.system).shardRegion(Perceptron.shardName) def run: Receive = { case Input(r, f) => shardRegion ! WeightedInput(output, f, weight) case UpdateWeightCommand(r, w) => persist(UpdatedWeightEvent(r, w)) { event => weight = event.weight } } def recover: Receive = { case UpdatedWeightEvent(_, w) => weight = w } }
  32. 32. ● Data parallelism Data Data
  33. 33. ElasticSearch gives up on partition tolerance, it means, if enough nodes fail, cluster state turns red and ES does not proceed to operate on that index. ES is not giving up on availability. Every request will be responded, either true (with result) or false (error). ● Synchronous and asynchronout replication ● Avaiability and consistency during partition [4]
  34. 34. Clock: (r0 -> 1), Value: x r0 r1 r2
  35. 35. r0 r1 r2 (r0 -> 1), x (r0 -> 1), x (r0 -> 1), x (r2 -> 1), y
  36. 36. r0 r1 r2 (r0 -> 1, r2 -> 1), y(r0 -> 1, r2 -> 1), y(r0 -> 1, r2 -> 1), y
  37. 37. (r0 -> 1), x (r2 -> 1), y Conflict r0 r1 r2
  38. 38. class Edge( override val aggregateId: Option[String], override val replicaId: String, override val eventLog: ActorRef) extends EventsourcedActor with HasInput with HasOutput { var weight: Double = 0.3 override def onCommand: Receive = run orElse addInput orElse addOutput private var versionedState: ConcurrentVersions[Double, Double] = ConcurrentVersions(0.3, (s, a) => a) ... override def onEvent: Receive = { case UpdatedWeightEvent(w) => versionedState = versionedState.update(w, lastVectorTimestamp, lastEmitterReplicaId) if (versionedState.conflict) { val conflictingVersions = versionedState.all val avg = conflictingVersions.map(_.value).sum / conflictingVersions.size val newTimestamp = conflictingVersions.map(_.updateTimestamp).foldLeft(VectorTime())(_.merge(_)) versionedState.update(avg, newTimestamp, replicaId) versionedState = versionedState.resolve(newTimestamp) weight = versionedState.all.head.value } else { weight = versionedState.all.head.value } } }
  39. 39. ● Replica r0 - update weight to 0, 1, 2 ● Replica r1 - 3, 4, 5 ● Replica r2 - 6, 7, 8 Conflicting versions on replica 0 value 4.0 vector clock VectorTime(r1 -> 1) value 7.0 vector clock VectorTime(r2 -> 1) Conflicting versions on replica 0 resolved value 5.5 vector clock VectorTime(r1 -> 1,r2 -> 1) Conflicting versions on replica 0 value 5.5 vector clock VectorTime(r1 -> 1,r2 -> 1) value 0.0 vector clock VectorTime(r0 -> 1) Conflicting versions on replica 0 resolved value 2.75 vector clock VectorTime(r1 -> 1,r2 -> 1,r0 -> 1) Conflicting versions on replica 0 value 2.75 vector clock VectorTime(r1 -> 1,r2 -> 1,r0 -> 1) value 3.0 vector clock VectorTime(r1 -> 2) Conflicting versions on replica 0 resolved Vector(value 2.875 vector clock VectorTime(r1 -> 2,r2 -> 1,r0 -> 1) Conflicting versions on replica 0 value 5.0 vector clock VectorTime(1-e1 -> 5,r2 -> 1,r0 -> 1) value 6.0 vector clock VectorTime(r2 -> 3,1-e1 -> 1) Conflicting versions on replica 0 resolved Vector(value 5.5 vector clock VectorTime(1-e1 -> 5,r2 -> 3,r0 -> 1)
  40. 40. class Edge extends Actor with HasInput with HasOutput { var weight: Double = 0.3 val replicator = DataReplication(context.system).replicator implicit val cluster = Cluster(context.system) replicator ! Subscribe(self.path.name, self) override def receive: Receive = run orElse addInput orElse addOutput def run: Receive = { ... case UpdateWeight(w) => replicator ! Update(self.path.name, GCounter(), WriteLocal)(_ + w) case Changed(key, GCounter(mergedWeight)) if key == self.path.name => weight = mergedWeight } }
  41. 41. ● Replica r0 - update weight to 0, 1, 2 ● Replica r1 - 3, 4, 5 ● Replica r2 - 6, 7, 8 Weight on replica r2 changed to 21 Weight on replica r0 changed to 3 Weight on replica r1 changed to 12 Weight on replica r2 changed to 24 Weight on replica r0 changed to 36 Weight on replica r1 changed to 15 Weight on replica r2 changed to 36 Weight on replica r1 changed to 36
  42. 42. ● Publisher and subscriber ● Source[Circle].map(_.toSquare).filter(_.color == blue) ● Lazy topology definition Publisher Subscriber toSquare color == blue backpressure
  43. 43. input ~> network(topology, weights) ~> zipWithIndex ~> formatPrintSink def buildLayer( layer: Int, input: Outlet[DenseMatrix[Double]], topology: Array[Int], weights: DenseMatrix[Double]): Outlet[DenseMatrix[Double]] = { val currentLayer = builder.add(hiddenLayer(layer)) input ~> currentLayer.in0 hiddenLayerWeights(topology, layer, weights) ~> currentLayer.in1 if (layer < topology.length - 1) buildLayer(layer + 1, currentLayer.out, topology, weights) else currentLayer.out }
  44. 44. def hiddenLayer(layer: Int) = { def feedForward(features: DenseMatrix[Double], weightMatrices: DenseMatrix[Double]) = { val bias = 0.2 val activation: DenseMatrix[Double] = weightMatrices * features activation(::, *) :+= bias sigmoid.inPlace(activation) activation } FlowGraph.partial() { implicit builder: FlowGraph.Builder[Unit] => import akka.stream.scaladsl.FlowGraph.Implicits._ val zipInputAndWeights = builder.add(Zip[DenseMatrix[Double], DenseMatrix[Double]]()) val feedForwardFlow = builder.add(Flow[(DenseMatrix[Double], DenseMatrix[Double])] .map(x => feedForward(x._1, x._2))) zipInputAndWeights.out ~> feedForwardFlow new FanInShape2(zipInputAndWeights.in0, zipInputAndWeights.in1, feedForwardFlow.outlet) } }
  45. 45. Network weightsvectorn zip feedForward activation * zipWithIndex index Layer ZipWithIndex feature vector n+1 feature vector n
  46. 46. [10]
  47. 47. ● In memory dataflow distributed data processing framework, streaming and batch ● Distributes computation using a higher level API ● Moves computation to data ● Fault tolerant ● Caching ● Transformations ○ Lazy, form the DAG ○ map, filter, flatMap, mapPartitions, mapPartitionsWithIndex, sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey, join, cogroup, repatition, cartesian, glom, ... ● Actions ○ Execute DAG, retrieve result ○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...
  48. 48. textFile mapmap reduceByKey collect sc.textFile("counts") .map(line => line.split("t")) .map(word => (word(0), word(1).toInt)) .reduceByKey(_ + _) .collect() [11]
  49. 49. ● Accumulators ○ Processes can only add ○ Associative, commutative operation ○ Only driver program can read the value ○ Exactly once semantics only guaranteed for actions object DoubleAccumulatorParam extends AccumulatorParam[Double] { def zero(initialValue: Double): Double = 0 def addInPlace(d1: Double, d2: Double): Double = d1 + d2 }
  50. 50. def forwardRun( topology: Array[Int], data: DenseMatrix[Double], weightMatrices: Array[DenseMatrix[Double]]): DenseMatrix[Double] = { val bias = 0.2 val outArray = new Array[DenseMatrix[Double]](topology.size) val blas = BLAS.getInstance() outArray(0) = data for(i <- 1 until topology.size) { val weights = hiddenLayerWeights(topology, i, weightMatrices) val outputCurrent = new DenseMatrix[Double](weights.rows, data.cols) val outputPrevious = outArray(i - 1) blas.dgemm("N", "N", outputCurrent.rows, outputCurrent.cols, weights.cols, 1.0, weights.data, weights.offset, weights.majorStride, outputPrevious.data, outputPrevious.offset, outputPrevious.majorStride, 1.0, outputCurrent.data, outputCurrent.offset, outputCurrent.rows) outArray(i) = outputCurrent outArray(i)(::, *) :+= bias sigmoid.inPlace(outArray(i)) } outArray(topology.size - 1) }
  51. 51. val sc = new SparkContext("local", "Neural Network") val result = sc.textFile("src/main/resources/data.csv", 3) .map { l => val splits = l.split(",") val features = splits.map(_.toDouble) new DenseMatrix(3, 1, Array(features(0), features(1), features(2))) } .map(in => forwardRun(topology, in, weights))
  52. 52. Data feedForward feedForward feedForward collect()
  53. 53. val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val resultDF = result.toDF("result") resultDF .filter(resultDF("result") > "String") .select(resultDF("result") + "String") // StructType(StructField(result,DoubleType,true)) resultDF.registerTempTable("results") val filtered3 = sqlContext.sql( "SELECT result + "String" " + "FROM (" + "SELECT result " + "FROM results) r " + "WHERE r.result >= "String"")
  54. 54. ● Multiple phases ● Catalyst [12]
  55. 55. object PushPredicateThroughProject extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case filter @ Filter(condition, project @ Project(fields, grandChild)) => val sourceAliases = fields.collect { case a @ Alias(c, _) => (a.toAttribute: Attribute) -> c }.toMap project.copy(child = filter.copy( replaceAlias(condition, sourceAliases), grandChild)) } } case Divide(e1, e2) => val eval1 = expressionEvaluator(e1) val eval2 = expressionEvaluator(e2) eval1.code ++ eval2.code ++ q""" var $nullTerm = false var $primitiveTerm: ${termForType(e1.dataType)} = 0 if (${eval1.nullTerm} || ${eval2.nullTerm} ) { $nullTerm = true } else if (${eval2.primitiveTerm} == 0) $nullTerm = true else { $primitiveTerm = ${eval1.primitiveTerm} / ${eval2.primitiveTerm} } """.children
  56. 56. === Result of Batch Resolution === === Result of Batch Remove SubQueries === === Result of Batch ConstantFolding === === Result of Batch Filter Pushdown === == Parsed Logical Plan == 'Project [('result + String) AS c0#2] 'Filter ('r.result >= String) 'Subquery r 'Project ['result] 'UnresolvedRelation [results], None == Analyzed Logical Plan == Project [(CAST(result#1, DoubleType) + CAST(String, DoubleType)) AS c0#2] Filter (CAST(result#1, DoubleType) >= CAST(String, DoubleType)) Subquery r Project [result#1] Subquery results Project [_1#0 AS result#1] LogicalRDD [_1#0], MapPartitionsRDD[5] at map at SQLContext.scala:394 == Optimized Logical Plan == LocalRelation [c0#2], [] == Physical Plan == LocalTableScan [c0#2], []
  57. 57. case class Person(age: Int, height: Double) val people = sc.parallelize((0 to 100).map(x => Person(x, x))) people .map(p => Person(p.age, p.height * 2.54)) .filter(_.age < 35) people .filter(_.age < 35) .map(p => Person(p.age, p.height * 2.54)) people .map(p => Person(p.age, p.height * 2.54)) .filter(_.height < 170) people .filter(_.height < 170) .map(p => Person(p.age, p.height * 2.54))
  58. 58. Choose the best combination of tools for given use case. Understand the internals of selected tools. The environment often fully asynchronous and distributed. 1) 2) 3)
  59. 59. ● Jobs at www.cakesolutions.net/careers ● Code at https://github.com/zapletal-martin/reactive-deep-learning ● Twitter @zapletal_martin
  60. 60. [1] http://www.csie.ntu.edu.tw/~cjlin/talks/twdatasci_cjlin.pdf [2] http://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/ [3] https://queue.acm.org/detail.cfm?id=2655736 [4] https://aphyr.com/ [5] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/ [6] http://malteschwarzkopf.de/research/assets/google-stack.pdf [7] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf [8] http://en.wikipedia.org/wiki/Two_Generals%27_Problem [9] http://static.googleusercontent.com/media/research.google.com/en/us/archive/large_deep_networks_nips2012.pdf [10] http://www.smartjava.org/content/visualizing-back-pressure-and-reactive-streams-akka-streams-statsd-grafana-and-influxdb [11] http://www.slideshare.net/LisaHua/spark-overview-37479609 [12] https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

×