SlideShare a Scribd company logo
Learn more about Advanced Analytics at http://www.alpinenow.com
Lambda Architecture with
DB Tsai
dbtsai@alpinenow.com
Machine Learning Engineering Lead @ Alpine Data Labs
Next.ML Conference
Jan 17, 2015
Learn more about Advanced Analytics at http://www.alpinenow.com
•  Batch Layer, managing all available big dataset which is an
immutable, append-only set of raw data using distributed
processing system.
•  Speed layer, processing data in streaming fashion with low
latency, and the real-time views are provided by the most
recent data.
•  Serving layer, the result from batch layer and speed layer
will be stored here, and it responds to queries in a low-
latency and ad-hoc way.
Lambda Architecture
Learn more about Advanced Analytics at http://www.alpinenow.com
Lambda Architecture
https://www.mapr.com/developercentral/lambda-architecture
Learn more about Advanced Analytics at http://www.alpinenow.com
•  Different technologies are used in batch layer and speed
layer traditionally.
•  If your batch system is implemented with Apache Pig, and
your speed layer is implemented with Apache Storm, you
have to write and maintain the same logics in SQL and in
Java/Scala
•  This will very quickly becomes a maintenance nightmare.
Traditional Lambda Architecture
Learn more about Advanced Analytics at http://www.alpinenow.com
Unified Development Framework
Learn more about Advanced Analytics at http://www.alpinenow.com
Batch Layer
•  Empower users to iterate
through the data by utilizing
the in-memory cache.
•  Logistic regression runs up
to 100x faster than Hadoop
M/R in memory.
•  We’re able to train exact
models without doing any
approximation.
Learn more about Advanced Analytics at http://www.alpinenow.com
Apache Spark Utilizing in-memory Cache for M/R job
Iterative algorithms
scan through the
data each time
With Spark, data is
cached in memory
after first iteration
Quasi-Newton methods
enhance in-memory
benefits
921s
150m
m
rows
97s
Learn more about Advanced Analytics at http://www.alpinenow.com
Speed Layer
•  An extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data stream.
•  Spark Streaming receives streaming input, and divides the data
into batches which are then processed by Spark engine.
•  As a result, developers can maintain the same Java/Scala code
in Batch and Speed layer.
Learn more about Advanced Analytics at http://www.alpinenow.com
MapReduce Review
•  MapReduce – Simplified Data Processing on Large
Clusters, 2004.
•  Scales Linearly
•  Data Locality
•  Fault Tolerance in Data and Computation
Learn more about Advanced Analytics at http://www.alpinenow.com
Hard Disks Failures from Google’s 2007 Study
•  1.7% of disks failed in the first
year of their life.
•  Three-year-old disks were
failing at a rate of 8.6%.
•  For the hypothetical eight-disk server, the probability that
none of disks fail in first year will be 81%.
•  The key contributions of the MapReduce framework are not
the actual map and reduce functions, but the scalability and
fault-tolerance achieved with commodity hardware.
Learn more about Advanced Analytics at http://www.alpinenow.com
Hadoop MapReduce Review
•  Mapper: Loads the data and emits a set of key-value pairs
•  Reducer: Collects the key-value pairs with the same key to
process, and output the result.
•  Combiner: Can reduce shuffle traffic by combining key-value
pairs locally before going to reducer.
•  Good: Built in fault tolerance, scalable, and production proven
in industry.
•  Bad: Optimized for disk IO without leveraging memory well;
iterative algorithms go through disk IO again and again;
primitive API is not easy and clean to develop.
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark MapReduce
•  Spark also uses MapReduce as a programming model but
with much richer APIs in Java Scala, and Python.
•  With Scala expressive APIs, 5-10x less code.
•  Not just a distributed computation framework, Spark provides
several pre-built components empowering users to implement
application faster and easier.
- Spark Streaming
- Spark SQL
- MLlib (Machine Learning)
- GraphX (Graph Processing)
Learn more about Advanced Analytics at http://www.alpinenow.com
Hadoop M/R vs Spark M/R
•  Hadoop
•  Spark
Learn more about Advanced Analytics at http://www.alpinenow.com
Supervised Learning
•  Binary Classification: linear SVMs (SGD), logistic regression (L-
BFGS and SGD), decision trees, random forests (Spark 1.2), and
naïve Bayes.
•  Multiclass Classification: Decision trees, naïve Bayes (coming
soon - multinomial logistic regression in GLMNET)
•  Regression: linear least squares (SGD), Lasso (SGD + soft-
threshold), ridge regression (SGD), decision trees, and random
forests (Spark 1.2)
•  Currently, the regularization in linear model will penalize all the
weights including the intercept which is not desired in some use-
cases. Alpine has GLMNET implementation using OWLQN which
can exactly reproduce R’s GLMNET package result with scalability.
We’re in the process of merging it into MLlib community.
Learn more about Advanced Analytics at http://www.alpinenow.com
Unsupervised Learning
•  K-Means,
•  Collaborative filtering (ALS)
•  SVD
•  PCA
•  Feature extraction and transformation
http://spark.apache.org/docs/1.2.0/mllib-guide.html
Learn more about Advanced Analytics at http://www.alpinenow.com
Resilient Distributed Datasets (RDDs)
•  RDD is a fault-tolerant collection of elements that can be
operated on in parallel.
•  RDDs can be created by parallelizing an existing
collection in your driver program, or referencing a dataset
in an external storage system, such as a shared
filesystem, HDFS, HBase, HIVE, or any data source
offering a Hadoop InputFormat.
•  RDDs can be cached in memory or on disk
Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Persistence/Cache
•  RDD can be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it
will be kept in memory on the nodes. Spark’s cache is
fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations
that originally created it.
•  Persisted RDD can be stored using a different storage
level, allowing you, for example, to persist the dataset on
disk, persist it in memory but as serialized Java objects
(to save space), replicate it across nodes, or store it off-
heap in Tachyon.
Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Operations - two types of operations
•  Transformations: Creates a new dataset from an existing
one. They are lazy, in that they do not compute their
results right away. By default, each transformed RDD may
be recomputed each time you run an action on it. You
may also persist an RDD in memory using the persist (or
cache) method, in which case Spark will keep the
elements around on the cluster for much faster access
the next time you query it. (PS, after transformations, the
dataset can be imbalanced in each executor, and this can
be addressed by repartition.)
•  Actions: Returns a value to the driver program after
running a computation on the dataset.
Learn more about Advanced Analytics at http://www.alpinenow.com
Transformations
•  map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•  filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•  flatMap(func) - Similar to map, but each input item can be
mapped to 0 or more output items (so func should return a Seq
rather than a single item).
•  mapPartitions(func) - Similar to map, but runs separately on
each partition (block) of the RDD, so func must be of type
Iterator<T> => Iterator<U> when running on an RDD of type T.
http://spark.apache.org/docs/latest/programming-
guide.html#transformations
Learn more about Advanced Analytics at http://www.alpinenow.com
Actions
•  reduce(func) - Aggregate the elements of the dataset
using a function func (which takes two arguments and
returns one). The function should be commutative and
associative so that it can be computed correctly in
parallel.
•  collect() - Return all the elements of the dataset as an
array at the driver program. This is usually useful after a
filter or other operation that returns a sufficiently small
subset of the data.
•  count(), first(), take(n), saveAsTextFile(path), etc.
http://spark.apache.org/docs/latest/programming-
guide.html#actions
Learn more about Advanced Analytics at http://www.alpinenow.com
Computing the mean of data
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 1)
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark Streaming: Discretized Streams
•  DStream is the basic abstraction provided by Spark
Streaming over Spark’s RDDs.
•  Each RDD in a DStream contains data from a certain
interval. Any operation applied on a DStream translates
to operations on the underlying RDDs internally.
Learn more about Advanced Analytics at http://www.alpinenow.com
Word Count in Batch Processing
Learn more about Advanced Analytics at http://www.alpinenow.com
Word Count in Streaming Processing
Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 2)
Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 2)
•  Need another bash shell in docker to run Netcat as a
data server.
•  In production, people often use Kafka as data server.
•  docker ps // to find the current docker PID
•  docker exec –it <PID> bash // to lunch a new shell
Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 2)
Learn more about Advanced Analytics at http://www.alpinenow.com
UpdateStateByKey Operation
The updateStateByKey operation allows you to maintain
arbitrary state while continuously updating it with new
information.
•  Define the state - The state can be of arbitrary data type.
•  Define the state update function - Specify with a function
how to update the state using the previous state and the
new values from input stream.
Learn more about Advanced Analytics at http://www.alpinenow.com
UpdateStateByKey Operation
Learn more about Advanced Analytics at http://www.alpinenow.com
Computing the Mean of Streaming Data
•  Current sum and count at time t has to be accessible
at time (t + 1) to compute new mean of stream.
•  Without UpdateSateByKey, the operations at time t
and (t + 1) are independent.
•  Checkpoint directory has to be configured for
persistence of the state at different time.
Learn more about Advanced Analytics at http://www.alpinenow.com
Computing the Mean of Streaming Data
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 3)
Learn more about Advanced Analytics at http://www.alpinenow.com
Online Learning Example
Learn more about Advanced Analytics at http://www.alpinenow.com
Thank you.

More Related Content

What's hot

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Timothy Spann
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
Knoldus Inc.
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Container Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleContainer Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycle
Michael Mueller
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 

What's hot (20)

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Container Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleContainer Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycle
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 

Viewers also liked

∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
Pradeeban Kathiravelu, Ph.D.
 
powerpoint feb
powerpoint febpowerpoint feb
powerpoint feb
imu409
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning Classifiers
Patrick Nicolas
 
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
Pradeeban Kathiravelu, Ph.D.
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Yahoo Developer Network
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
Armando Vieira
 
Intrusion detection using data mining
Intrusion detection using data miningIntrusion detection using data mining
Intrusion detection using data mining
balbeerrawat
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
Sujee Maniyam
 
Ids presentation
Ids presentationIds presentation
Ids presentation
Solmaz Salehian
 
Analysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data MiningAnalysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data Mining
Pritesh Ranjan
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
Hortonworks
 
Using Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsUsing Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection Systems
Omar Shaya
 
Enterprise Mobility Transforming Public Service and Citizen Engagement
Enterprise Mobility Transforming Public Service and Citizen EngagementEnterprise Mobility Transforming Public Service and Citizen Engagement
Enterprise Mobility Transforming Public Service and Citizen Engagement
SAP Asia Pacific
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu, Ph.D.
 
Big Data Solutions Executive Overview
Big Data Solutions Executive OverviewBig Data Solutions Executive Overview
Big Data Solutions Executive Overview
RCG Global Services
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax
 
Data Mining and Intrusion Detection
Data Mining and Intrusion Detection Data Mining and Intrusion Detection
Data Mining and Intrusion Detection
amiable_indian
 

Viewers also liked (20)

∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
 
powerpoint feb
powerpoint febpowerpoint feb
powerpoint feb
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning Classifiers
 
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
Intrusion detection using data mining
Intrusion detection using data miningIntrusion detection using data mining
Intrusion detection using data mining
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Ids presentation
Ids presentationIds presentation
Ids presentation
 
Analysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data MiningAnalysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data Mining
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
 
Using Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsUsing Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection Systems
 
Enterprise Mobility Transforming Public Service and Citizen Engagement
Enterprise Mobility Transforming Public Service and Citizen EngagementEnterprise Mobility Transforming Public Service and Citizen Engagement
Enterprise Mobility Transforming Public Service and Citizen Engagement
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
 
Big Data Solutions Executive Overview
Big Data Solutions Executive OverviewBig Data Solutions Executive Overview
Big Data Solutions Executive Overview
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
 
Data Mining and Intrusion Detection
Data Mining and Intrusion Detection Data Mining and Intrusion Detection
Data Mining and Intrusion Detection
 

Similar to 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to Spark2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to Spark
DB Tsai
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
Knoldus Inc.
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Alpine innovation final v1.0
Alpine innovation final v1.0Alpine innovation final v1.0
Alpine innovation final v1.0
alpinedatalabs
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
samthemonad
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
Jags Ramnarayan
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 

Similar to 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference (20)

2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to Spark2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to Spark
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Alpine innovation final v1.0
Alpine innovation final v1.0Alpine innovation final v1.0
Alpine innovation final v1.0
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 

More from DB Tsai

2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
DB Tsai
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
DB Tsai
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
DB Tsai
 

More from DB Tsai (6)

2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 

Recently uploaded

Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
Severalnines
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
Luigi Fugaro
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
Jhone kinadey
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
Benefits of Artificial Intelligence in Healthcare!
Benefits of  Artificial Intelligence in Healthcare!Benefits of  Artificial Intelligence in Healthcare!
Benefits of Artificial Intelligence in Healthcare!
Prestware
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
Maitrey Patel
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
The Third Creative Media
 

Recently uploaded (20)

Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
Benefits of Artificial Intelligence in Healthcare!
Benefits of  Artificial Intelligence in Healthcare!Benefits of  Artificial Intelligence in Healthcare!
Benefits of Artificial Intelligence in Healthcare!
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
 

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

  • 1. Learn more about Advanced Analytics at http://www.alpinenow.com Lambda Architecture with DB Tsai dbtsai@alpinenow.com Machine Learning Engineering Lead @ Alpine Data Labs Next.ML Conference Jan 17, 2015
  • 2. Learn more about Advanced Analytics at http://www.alpinenow.com •  Batch Layer, managing all available big dataset which is an immutable, append-only set of raw data using distributed processing system. •  Speed layer, processing data in streaming fashion with low latency, and the real-time views are provided by the most recent data. •  Serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low- latency and ad-hoc way. Lambda Architecture
  • 3. Learn more about Advanced Analytics at http://www.alpinenow.com Lambda Architecture https://www.mapr.com/developercentral/lambda-architecture
  • 4. Learn more about Advanced Analytics at http://www.alpinenow.com •  Different technologies are used in batch layer and speed layer traditionally. •  If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala •  This will very quickly becomes a maintenance nightmare. Traditional Lambda Architecture
  • 5. Learn more about Advanced Analytics at http://www.alpinenow.com Unified Development Framework
  • 6. Learn more about Advanced Analytics at http://www.alpinenow.com Batch Layer •  Empower users to iterate through the data by utilizing the in-memory cache. •  Logistic regression runs up to 100x faster than Hadoop M/R in memory. •  We’re able to train exact models without doing any approximation.
  • 7. Learn more about Advanced Analytics at http://www.alpinenow.com Apache Spark Utilizing in-memory Cache for M/R job Iterative algorithms scan through the data each time With Spark, data is cached in memory after first iteration Quasi-Newton methods enhance in-memory benefits 921s 150m m rows 97s
  • 8. Learn more about Advanced Analytics at http://www.alpinenow.com Speed Layer •  An extension of the core Spark API that enables scalable, high- throughput, fault-tolerant stream processing of live data stream. •  Spark Streaming receives streaming input, and divides the data into batches which are then processed by Spark engine. •  As a result, developers can maintain the same Java/Scala code in Batch and Speed layer.
  • 9. Learn more about Advanced Analytics at http://www.alpinenow.com MapReduce Review •  MapReduce – Simplified Data Processing on Large Clusters, 2004. •  Scales Linearly •  Data Locality •  Fault Tolerance in Data and Computation
  • 10. Learn more about Advanced Analytics at http://www.alpinenow.com Hard Disks Failures from Google’s 2007 Study •  1.7% of disks failed in the first year of their life. •  Three-year-old disks were failing at a rate of 8.6%. •  For the hypothetical eight-disk server, the probability that none of disks fail in first year will be 81%. •  The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved with commodity hardware.
  • 11. Learn more about Advanced Analytics at http://www.alpinenow.com Hadoop MapReduce Review •  Mapper: Loads the data and emits a set of key-value pairs •  Reducer: Collects the key-value pairs with the same key to process, and output the result. •  Combiner: Can reduce shuffle traffic by combining key-value pairs locally before going to reducer. •  Good: Built in fault tolerance, scalable, and production proven in industry. •  Bad: Optimized for disk IO without leveraging memory well; iterative algorithms go through disk IO again and again; primitive API is not easy and clean to develop.
  • 12. Learn more about Advanced Analytics at http://www.alpinenow.com Spark MapReduce •  Spark also uses MapReduce as a programming model but with much richer APIs in Java Scala, and Python. •  With Scala expressive APIs, 5-10x less code. •  Not just a distributed computation framework, Spark provides several pre-built components empowering users to implement application faster and easier. - Spark Streaming - Spark SQL - MLlib (Machine Learning) - GraphX (Graph Processing)
  • 13. Learn more about Advanced Analytics at http://www.alpinenow.com Hadoop M/R vs Spark M/R •  Hadoop •  Spark
  • 14. Learn more about Advanced Analytics at http://www.alpinenow.com Supervised Learning •  Binary Classification: linear SVMs (SGD), logistic regression (L- BFGS and SGD), decision trees, random forests (Spark 1.2), and naïve Bayes. •  Multiclass Classification: Decision trees, naïve Bayes (coming soon - multinomial logistic regression in GLMNET) •  Regression: linear least squares (SGD), Lasso (SGD + soft- threshold), ridge regression (SGD), decision trees, and random forests (Spark 1.2) •  Currently, the regularization in linear model will penalize all the weights including the intercept which is not desired in some use- cases. Alpine has GLMNET implementation using OWLQN which can exactly reproduce R’s GLMNET package result with scalability. We’re in the process of merging it into MLlib community.
  • 15. Learn more about Advanced Analytics at http://www.alpinenow.com Unsupervised Learning •  K-Means, •  Collaborative filtering (ALS) •  SVD •  PCA •  Feature extraction and transformation http://spark.apache.org/docs/1.2.0/mllib-guide.html
  • 16. Learn more about Advanced Analytics at http://www.alpinenow.com Resilient Distributed Datasets (RDDs) •  RDD is a fault-tolerant collection of elements that can be operated on in parallel. •  RDDs can be created by parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, HIVE, or any data source offering a Hadoop InputFormat. •  RDDs can be cached in memory or on disk
  • 17. Learn more about Advanced Analytics at http://www.alpinenow.com RDD Persistence/Cache •  RDD can be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. •  Persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off- heap in Tachyon.
  • 18. Learn more about Advanced Analytics at http://www.alpinenow.com RDD Operations - two types of operations •  Transformations: Creates a new dataset from an existing one. They are lazy, in that they do not compute their results right away. By default, each transformed RDD may be recomputed each time you run an action on it. You may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. (PS, after transformations, the dataset can be imbalanced in each executor, and this can be addressed by repartition.) •  Actions: Returns a value to the driver program after running a computation on the dataset.
  • 19. Learn more about Advanced Analytics at http://www.alpinenow.com Transformations •  map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. •  filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. •  flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). •  mapPartitions(func) - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. http://spark.apache.org/docs/latest/programming- guide.html#transformations
  • 20. Learn more about Advanced Analytics at http://www.alpinenow.com Actions •  reduce(func) - Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. •  collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. •  count(), first(), take(n), saveAsTextFile(path), etc. http://spark.apache.org/docs/latest/programming- guide.html#actions
  • 21. Learn more about Advanced Analytics at http://www.alpinenow.com Computing the mean of data
  • 22. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 23. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 24. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 1)
  • 25. Learn more about Advanced Analytics at http://www.alpinenow.com Spark Streaming: Discretized Streams •  DStream is the basic abstraction provided by Spark Streaming over Spark’s RDDs. •  Each RDD in a DStream contains data from a certain interval. Any operation applied on a DStream translates to operations on the underlying RDDs internally.
  • 26. Learn more about Advanced Analytics at http://www.alpinenow.com Word Count in Batch Processing
  • 27. Learn more about Advanced Analytics at http://www.alpinenow.com Word Count in Streaming Processing
  • 28. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 2)
  • 29. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 2) •  Need another bash shell in docker to run Netcat as a data server. •  In production, people often use Kafka as data server. •  docker ps // to find the current docker PID •  docker exec –it <PID> bash // to lunch a new shell
  • 30. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 2)
  • 31. Learn more about Advanced Analytics at http://www.alpinenow.com UpdateStateByKey Operation The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. •  Define the state - The state can be of arbitrary data type. •  Define the state update function - Specify with a function how to update the state using the previous state and the new values from input stream.
  • 32. Learn more about Advanced Analytics at http://www.alpinenow.com UpdateStateByKey Operation
  • 33. Learn more about Advanced Analytics at http://www.alpinenow.com Computing the Mean of Streaming Data •  Current sum and count at time t has to be accessible at time (t + 1) to compute new mean of stream. •  Without UpdateSateByKey, the operations at time t and (t + 1) are independent. •  Checkpoint directory has to be configured for persistence of the state at different time.
  • 34. Learn more about Advanced Analytics at http://www.alpinenow.com Computing the Mean of Streaming Data
  • 35. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 36. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 37. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 3)
  • 38. Learn more about Advanced Analytics at http://www.alpinenow.com Online Learning Example
  • 39. Learn more about Advanced Analytics at http://www.alpinenow.com Thank you.