Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

6,809 views

Published on

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements.

In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are
the de facto batch-processing system.

In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is
used in this layer.

In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.

One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset.

Why Spark for lambda architecture? Traditionally, different
technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will
discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

  1. 1. Learn more about Advanced Analytics at http://www.alpinenow.com Lambda Architecture with DB Tsai dbtsai@alpinenow.com Machine Learning Engineering Lead @ Alpine Data Labs Next.ML Conference Jan 17, 2015
  2. 2. Learn more about Advanced Analytics at http://www.alpinenow.com •  Batch Layer, managing all available big dataset which is an immutable, append-only set of raw data using distributed processing system. •  Speed layer, processing data in streaming fashion with low latency, and the real-time views are provided by the most recent data. •  Serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low- latency and ad-hoc way. Lambda Architecture
  3. 3. Learn more about Advanced Analytics at http://www.alpinenow.com Lambda Architecture https://www.mapr.com/developercentral/lambda-architecture
  4. 4. Learn more about Advanced Analytics at http://www.alpinenow.com •  Different technologies are used in batch layer and speed layer traditionally. •  If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala •  This will very quickly becomes a maintenance nightmare. Traditional Lambda Architecture
  5. 5. Learn more about Advanced Analytics at http://www.alpinenow.com Unified Development Framework
  6. 6. Learn more about Advanced Analytics at http://www.alpinenow.com Batch Layer •  Empower users to iterate through the data by utilizing the in-memory cache. •  Logistic regression runs up to 100x faster than Hadoop M/R in memory. •  We’re able to train exact models without doing any approximation.
  7. 7. Learn more about Advanced Analytics at http://www.alpinenow.com Apache Spark Utilizing in-memory Cache for M/R job Iterative algorithms scan through the data each time With Spark, data is cached in memory after first iteration Quasi-Newton methods enhance in-memory benefits 921s 150m m rows 97s
  8. 8. Learn more about Advanced Analytics at http://www.alpinenow.com Speed Layer •  An extension of the core Spark API that enables scalable, high- throughput, fault-tolerant stream processing of live data stream. •  Spark Streaming receives streaming input, and divides the data into batches which are then processed by Spark engine. •  As a result, developers can maintain the same Java/Scala code in Batch and Speed layer.
  9. 9. Learn more about Advanced Analytics at http://www.alpinenow.com MapReduce Review •  MapReduce – Simplified Data Processing on Large Clusters, 2004. •  Scales Linearly •  Data Locality •  Fault Tolerance in Data and Computation
  10. 10. Learn more about Advanced Analytics at http://www.alpinenow.com Hard Disks Failures from Google’s 2007 Study •  1.7% of disks failed in the first year of their life. •  Three-year-old disks were failing at a rate of 8.6%. •  For the hypothetical eight-disk server, the probability that none of disks fail in first year will be 81%. •  The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved with commodity hardware.
  11. 11. Learn more about Advanced Analytics at http://www.alpinenow.com Hadoop MapReduce Review •  Mapper: Loads the data and emits a set of key-value pairs •  Reducer: Collects the key-value pairs with the same key to process, and output the result. •  Combiner: Can reduce shuffle traffic by combining key-value pairs locally before going to reducer. •  Good: Built in fault tolerance, scalable, and production proven in industry. •  Bad: Optimized for disk IO without leveraging memory well; iterative algorithms go through disk IO again and again; primitive API is not easy and clean to develop.
  12. 12. Learn more about Advanced Analytics at http://www.alpinenow.com Spark MapReduce •  Spark also uses MapReduce as a programming model but with much richer APIs in Java Scala, and Python. •  With Scala expressive APIs, 5-10x less code. •  Not just a distributed computation framework, Spark provides several pre-built components empowering users to implement application faster and easier. - Spark Streaming - Spark SQL - MLlib (Machine Learning) - GraphX (Graph Processing)
  13. 13. Learn more about Advanced Analytics at http://www.alpinenow.com Hadoop M/R vs Spark M/R •  Hadoop •  Spark
  14. 14. Learn more about Advanced Analytics at http://www.alpinenow.com Supervised Learning •  Binary Classification: linear SVMs (SGD), logistic regression (L- BFGS and SGD), decision trees, random forests (Spark 1.2), and naïve Bayes. •  Multiclass Classification: Decision trees, naïve Bayes (coming soon - multinomial logistic regression in GLMNET) •  Regression: linear least squares (SGD), Lasso (SGD + soft- threshold), ridge regression (SGD), decision trees, and random forests (Spark 1.2) •  Currently, the regularization in linear model will penalize all the weights including the intercept which is not desired in some use- cases. Alpine has GLMNET implementation using OWLQN which can exactly reproduce R’s GLMNET package result with scalability. We’re in the process of merging it into MLlib community.
  15. 15. Learn more about Advanced Analytics at http://www.alpinenow.com Unsupervised Learning •  K-Means, •  Collaborative filtering (ALS) •  SVD •  PCA •  Feature extraction and transformation http://spark.apache.org/docs/1.2.0/mllib-guide.html
  16. 16. Learn more about Advanced Analytics at http://www.alpinenow.com Resilient Distributed Datasets (RDDs) •  RDD is a fault-tolerant collection of elements that can be operated on in parallel. •  RDDs can be created by parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, HIVE, or any data source offering a Hadoop InputFormat. •  RDDs can be cached in memory or on disk
  17. 17. Learn more about Advanced Analytics at http://www.alpinenow.com RDD Persistence/Cache •  RDD can be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. •  Persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off- heap in Tachyon.
  18. 18. Learn more about Advanced Analytics at http://www.alpinenow.com RDD Operations - two types of operations •  Transformations: Creates a new dataset from an existing one. They are lazy, in that they do not compute their results right away. By default, each transformed RDD may be recomputed each time you run an action on it. You may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. (PS, after transformations, the dataset can be imbalanced in each executor, and this can be addressed by repartition.) •  Actions: Returns a value to the driver program after running a computation on the dataset.
  19. 19. Learn more about Advanced Analytics at http://www.alpinenow.com Transformations •  map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. •  filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. •  flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). •  mapPartitions(func) - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. http://spark.apache.org/docs/latest/programming- guide.html#transformations
  20. 20. Learn more about Advanced Analytics at http://www.alpinenow.com Actions •  reduce(func) - Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. •  collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. •  count(), first(), take(n), saveAsTextFile(path), etc. http://spark.apache.org/docs/latest/programming- guide.html#actions
  21. 21. Learn more about Advanced Analytics at http://www.alpinenow.com Computing the mean of data
  22. 22. Learn more about Advanced Analytics at http://www.alpinenow.com
  23. 23. Learn more about Advanced Analytics at http://www.alpinenow.com
  24. 24. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 1)
  25. 25. Learn more about Advanced Analytics at http://www.alpinenow.com Spark Streaming: Discretized Streams •  DStream is the basic abstraction provided by Spark Streaming over Spark’s RDDs. •  Each RDD in a DStream contains data from a certain interval. Any operation applied on a DStream translates to operations on the underlying RDDs internally.
  26. 26. Learn more about Advanced Analytics at http://www.alpinenow.com Word Count in Batch Processing
  27. 27. Learn more about Advanced Analytics at http://www.alpinenow.com Word Count in Streaming Processing
  28. 28. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 2)
  29. 29. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 2) •  Need another bash shell in docker to run Netcat as a data server. •  In production, people often use Kafka as data server. •  docker ps // to find the current docker PID •  docker exec –it <PID> bash // to lunch a new shell
  30. 30. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 2)
  31. 31. Learn more about Advanced Analytics at http://www.alpinenow.com UpdateStateByKey Operation The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. •  Define the state - The state can be of arbitrary data type. •  Define the state update function - Specify with a function how to update the state using the previous state and the new values from input stream.
  32. 32. Learn more about Advanced Analytics at http://www.alpinenow.com UpdateStateByKey Operation
  33. 33. Learn more about Advanced Analytics at http://www.alpinenow.com Computing the Mean of Streaming Data •  Current sum and count at time t has to be accessible at time (t + 1) to compute new mean of stream. •  Without UpdateSateByKey, the operations at time t and (t + 1) are independent. •  Checkpoint directory has to be configured for persistence of the state at different time.
  34. 34. Learn more about Advanced Analytics at http://www.alpinenow.com Computing the Mean of Streaming Data
  35. 35. Learn more about Advanced Analytics at http://www.alpinenow.com
  36. 36. Learn more about Advanced Analytics at http://www.alpinenow.com
  37. 37. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 3)
  38. 38. Learn more about Advanced Analytics at http://www.alpinenow.com Online Learning Example
  39. 39. Learn more about Advanced Analytics at http://www.alpinenow.com Thank you.

×