SlideShare a Scribd company logo
1 of 43
Download to read offline
MINING BIG DATA STREAMS
WITH APACHE SAMOA
Albert Bifet @abifet
#J_OnTheBeach
Malaga, 20 May 2016
MOTIVATION
REALTIME ANALYTICS
REALTIME ANALYTICS
eal time analytics
REALTIME ANALYTICS
real time analytics
APACHE SA(MOA)VISION
• Data Stream mining platform
• Library of state-of-the-art algorithms

for practitioners
• Development and collaboration framework

for researchers
• Algorithms & Systems
IMPORTANCE
• Example: spam detection in
comments onYahoo News
• Trends change in time
• Need to retrain model with
new data
Importance$of$O
•  As$spam$trends$change
retrain$the$model$with
INTERNET OF THINGS
• EMC Digital Universe, 2014
digital universe
Figure 3: EMC Digital Universe, 2014
7
BIG DATA STREAM
• Volume +Velocity (+Variety)
• Too large for single commodity
server main memory
• Too fast for single commodity
server CPU
• A solution should be:
• Distributed
• Scalable
BIG DATA
PROCESSING ENGINES
• Low latency
• High Latency (Not real time)
apache storm
Storm characteristics for real-time data processing workloads
1 Fast
2 Scalable
3 Fault-tolerant
4 Reliable
5 Easy to operate
apache samza from linkedin
Storm and Samza are fairly similar. Both systems provide:
1 a partitioned stream model,
2 a distributed execution environment,
3 an API for stream processing,
4 fault tolerance,
5 Kafka integration
real time computation: streaming computation
MapReduce Limitations
Example
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
3 sentiment analysis
14
apache spark streaming
MACHINE LEARNING
• Classification
• Regression
• Clustering
• Frequent Pattern Mining
WHAT IS MOA?
MOA
• {M}assive {O}nline {A}nalysis is a framework for online learning
from data streams.
• It is closely related to WEKA
• It includes a collection of offline and online as well as tools for
evaluation:
• classification, regression
• clustering, frequent pattern mining
• Easy to extend, design and run experiments
{M}assive {O}nline {A}
MOA (Bifet et al. 20
{M}assive {O}nline {A}nalysis is a framework
learning from data streams.
It is closely related to WEKA
STREAM SETTING
• Process an example at a time,and
inspect it only once (at most)
• Use a limited amount of memory
• Work in a limited amount of
time
• Be ready to predict at any point
STREAM EVALUATION
• Holdout Evaluation
• InterleavedTest-Then-Train or
Prequential
STREAM EVALUATION
Holdout an independent test set
• Apply the current decision model
to the test set, at regular time
intervals
• The loss estimated in the holdout
is an unbiased estimator
STREAM EVALUATION
Prequential Evaluation
• The error of a model is computed
from the sequence of examples.
• For each example in the stream, the
actual model makes a prediction based
only on the example attribute-values.
CLUSTERING
COMMAND LINE
• java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar
moa.DoTask "EvaluatePeriodicHeldOutTest -l
DecisionStump -s generators.WaveformGenerator -n
100000 -i 100000000 -f 1000000" > dsresult.csv
• This command creates a comma separated values file:
• training the DecisionStump classifier on the WaveformGenerator data,
• using the first 100 thousand examples for testing,
• training on a total of 100 million examples,
• and testing every one million examples
WHAT IS APACHE SAMOA?
STREAMING MODEL
• Sequence is potentially infinite
• High amount of data, high speed of arrival
• Change over time (concept drift)
• Approximation algorithms

(small error with high probability)
• Single pass, one data item at a time
• Sub-linear space and time per data item
TAXONOMY
Data
Mining
Distributed
Batch
Hadoop
Mahout
Stream
Storm, S4,
Samza
SAMOA
Non
Distributed
Batch
R,
WEKA,
…
Stream
MOA
ARCHITECTURE
An adapter for integrating Apache Flink into Apache SAMOA was implemente
n scope of this master thesis, with the main parts of its implementation bein
addressed in this section. With the use of our adapter, ML algorithms can b
executed on top of Apache Flink. The implemented adapter will be used for th
evaluation of the ML pipelines and HT algorithm variations.
Figure 20: Apache SAMOA’s high level architecture.
STATUSSTATUS
• Parallel algorithms
• Classification (Vertical HoeffdingTree)
• Clustering (CluStream)
• Regression (Adaptive Model Rules)
• Execution engines
IS SAMOA USEFUL FORYOU?
• Only if you need to deal with:
• Large fast data
• Evolving process (model updates)
• What is happening now?
• Use feedback in real-time
• Adapt to changes faster
ML DEVELOPER API
Processing Item
Processor
Stream
ML DEVELOPER API
TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);
Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);
Processor join = new JoinProcessor());
builder.addProcessor(join)
.connectInputShuffle(streamOne)
.connectInputKey(streamTwo);
VERTICAL HOEFFDINGTREE
(VHT)
DECISIONTREE
• Nodes are tests on attributes
• Branches are possible
outcomes
• Leafs are class assignments


 Class
Instance
Attributes
Road
Tested?
Mileage?
Age?
NoYes
High
✅
❌
Low
OldRecent
✅ ❌
Car deal?
HOEFFDINGTREE
• Sample of stream enough for near optimal decision
• Estimate merit of alternatives from prefix of stream
• Choose sample size based on statistical principles
• When to expand a leaf?
• Let x1 be the most informative attribute,

x2 the second most informative one
• Hoeffding bound: split if G(x1, x2) > ✏ =
r
R2 ln(1/ )
2n
P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
PARALLEL DECISIONTREES
• Which kind of parallelism?
• Task
• Data
• Horizontal
• Vertical
Data
Attributes
Instances
HORIZONTAL PARALLELISM
Y. Ben-Haim and E.Tom-Tov,“A Streaming Parallel DecisionTree Algorithm,” JMLR, vol. 11, pp.
849–872, 2010
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model Updates
Aggregation to
compute splits
Single attribute
tracked in
multiple node
32
HOEFFDINGTREE
PROFILING
Other
6 %
Split
24 %
Learn
70 %
CPU time for training

100 nominal and 100
numeric attributes
VERTICAL PARALLELISM
Single attribute tracked
in single node
Stats
Stats
Stats
Stream
Model
Attributes
Splits
ADVANTAGES OFVERTICAL
• High number of attributes => high level of parallelism

(e.g., documents)
• Vs task parallelism
• Parallelism observed immediately
• Vs horizontal parallelism
• Reduced memory usage (no model replication)
• Parallelized split computation
VERTICAL HOEFFDINGTREE
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
InstanceStream
Shuffle Grouping
Key Grouping
All Grouping
ACCURACY
No. Leaf Nodes VHT2 –
tree-100
30
Very close and
very high accuracy
PERFORMANCE
35
0
50
100
150
200
250
MHT VHT2-par-3
ExecutionTime(seconds)
Classifier
Profiling Results for text-10000
with 100000 instances
t_calc
t_comm
t_serial
Throughput
VHT2-par-3: 2631 inst/sec
MHT : 507 inst/sec
SUMMARY
• Streaming is an importantV of Big Data
• Mining big data streams is an open field
• MOA: Massive Online Analytics
• Available and open-source http://moa.cms.waikato.ac.nz/
• SAMOA:A Platform for Mining Big Data Streams
• Available and open-source (incubating @ASF)

http://samoa.incubator.apache.org
OPEN CHALLENGES
• Distributed stream mining algorithms
• Active & semi-supervised learning + crowdsourcing
• Millions of classes (e.g.,Wikipedia pages)
• Multi-target learning
• System issues (load balancing, communication)
• Programming paradigms and abstractions
SAMOATEAM
Albert

Bifet
Matthieu

Morel
Gianmarco

De Francisci Morales
Arinto

Murdopo
Nicolas

Kourtellis
Olivier

Van Laere
SUPPORTING
ORGANISATIONS
THANKS!
https://samoa.incubator.apache.org
@ApacheSAMOA

More Related Content

Viewers also liked

Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)Nicolas Kourtellis
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
 
次世代量子情報技術 量子アニーリングが拓く新時代 -- 情報処理と物理学のハーモニー --
次世代量子情報技術 量子アニーリングが拓く新時代 -- 情報処理と物理学のハーモニー --次世代量子情報技術 量子アニーリングが拓く新時代 -- 情報処理と物理学のハーモニー --
次世代量子情報技術 量子アニーリングが拓く新時代 -- 情報処理と物理学のハーモニー --Shu Tanaka
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
Twitterのリアルタイム分散処理システム「Storm」入門
Twitterのリアルタイム分散処理システム「Storm」入門Twitterのリアルタイム分散処理システム「Storm」入門
Twitterのリアルタイム分散処理システム「Storm」入門AdvancedTechNight
 
FIT2012招待講演「異常検知技術のビジネス応用最前線」
FIT2012招待講演「異常検知技術のビジネス応用最前線」FIT2012招待講演「異常検知技術のビジネス応用最前線」
FIT2012招待講演「異常検知技術のビジネス応用最前線」Shohei Hido
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 

Viewers also liked (10)

Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
次世代量子情報技術 量子アニーリングが拓く新時代 -- 情報処理と物理学のハーモニー --
次世代量子情報技術 量子アニーリングが拓く新時代 -- 情報処理と物理学のハーモニー --次世代量子情報技術 量子アニーリングが拓く新時代 -- 情報処理と物理学のハーモニー --
次世代量子情報技術 量子アニーリングが拓く新時代 -- 情報処理と物理学のハーモニー --
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
ストリームデータ分散処理基盤Storm
ストリームデータ分散処理基盤Stormストリームデータ分散処理基盤Storm
ストリームデータ分散処理基盤Storm
 
Twitterのリアルタイム分散処理システム「Storm」入門
Twitterのリアルタイム分散処理システム「Storm」入門Twitterのリアルタイム分散処理システム「Storm」入門
Twitterのリアルタイム分散処理システム「Storm」入門
 
FIT2012招待講演「異常検知技術のビジネス応用最前線」
FIT2012招待講演「異常検知技術のビジネス応用最前線」FIT2012招待講演「異常検知技術のビジネス応用最前線」
FIT2012招待講演「異常検知技術のビジネス応用最前線」
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 

Similar to Mining big data streams with APACHE SAMOA by Albert Bifet

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OSri Ambati
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsDatabricks
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learningjie cao
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesIsuru Suriarachchi
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
 
Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Ilya Ganelin
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...confluent
 
Incremental and Streaming Model Transformations
Incremental and Streaming Model TransformationsIncremental and Streaming Model Transformations
Incremental and Streaming Model TransformationsIstván Dávid
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha Talagala
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceChin Huang
 

Similar to Mining big data streams with APACHE SAMOA by Albert Bifet (20)

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
 
Incremental and Streaming Model Transformations
Incremental and Streaming Model TransformationsIncremental and Streaming Model Transformations
Incremental and Streaming Model Transformations
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 

More from J On The Beach

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayJ On The Beach
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t HaveJ On The Beach
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...J On The Beach
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoTJ On The Beach
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsJ On The Beach
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternJ On The Beach
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorJ On The Beach
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.J On The Beach
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EEJ On The Beach
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...J On The Beach
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorJ On The Beach
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTingJ On The Beach
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...J On The Beach
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysJ On The Beach
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to failJ On The Beach
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersJ On The Beach
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...J On The Beach
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every levelJ On The Beach
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesJ On The Beach
 

More from J On The Beach (20)

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t Have
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoT
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actors
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server pattern
 
Java, Turbocharged
Java, TurbochargedJava, Turbocharged
Java, Turbocharged
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial Sector
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EE
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and Blazor
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTing
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The Monkeys
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to fail
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good manners
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every level
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind Libraries
 

Recently uploaded

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 

Recently uploaded (20)

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 

Mining big data streams with APACHE SAMOA by Albert Bifet

  • 1. MINING BIG DATA STREAMS WITH APACHE SAMOA Albert Bifet @abifet #J_OnTheBeach Malaga, 20 May 2016
  • 6. APACHE SA(MOA)VISION • Data Stream mining platform • Library of state-of-the-art algorithms
 for practitioners • Development and collaboration framework
 for researchers • Algorithms & Systems
  • 7. IMPORTANCE • Example: spam detection in comments onYahoo News • Trends change in time • Need to retrain model with new data Importance$of$O •  As$spam$trends$change retrain$the$model$with
  • 8. INTERNET OF THINGS • EMC Digital Universe, 2014 digital universe Figure 3: EMC Digital Universe, 2014 7
  • 9. BIG DATA STREAM • Volume +Velocity (+Variety) • Too large for single commodity server main memory • Too fast for single commodity server CPU • A solution should be: • Distributed • Scalable
  • 10. BIG DATA PROCESSING ENGINES • Low latency • High Latency (Not real time) apache storm Storm characteristics for real-time data processing workloads 1 Fast 2 Scalable 3 Fault-tolerant 4 Reliable 5 Easy to operate apache samza from linkedin Storm and Samza are fairly similar. Both systems provide: 1 a partitioned stream model, 2 a distributed execution environment, 3 an API for stream processing, 4 fault tolerance, 5 Kafka integration real time computation: streaming computation MapReduce Limitations Example How compute in real time (latency less than 1 second): 1 predictions 2 frequent items as Twitter hashtags 3 sentiment analysis 14 apache spark streaming
  • 11. MACHINE LEARNING • Classification • Regression • Clustering • Frequent Pattern Mining
  • 13. MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. • It is closely related to WEKA • It includes a collection of offline and online as well as tools for evaluation: • classification, regression • clustering, frequent pattern mining • Easy to extend, design and run experiments {M}assive {O}nline {A} MOA (Bifet et al. 20 {M}assive {O}nline {A}nalysis is a framework learning from data streams. It is closely related to WEKA
  • 14. STREAM SETTING • Process an example at a time,and inspect it only once (at most) • Use a limited amount of memory • Work in a limited amount of time • Be ready to predict at any point
  • 15. STREAM EVALUATION • Holdout Evaluation • InterleavedTest-Then-Train or Prequential
  • 16. STREAM EVALUATION Holdout an independent test set • Apply the current decision model to the test set, at regular time intervals • The loss estimated in the holdout is an unbiased estimator
  • 17. STREAM EVALUATION Prequential Evaluation • The error of a model is computed from the sequence of examples. • For each example in the stream, the actual model makes a prediction based only on the example attribute-values.
  • 19. COMMAND LINE • java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask "EvaluatePeriodicHeldOutTest -l DecisionStump -s generators.WaveformGenerator -n 100000 -i 100000000 -f 1000000" > dsresult.csv • This command creates a comma separated values file: • training the DecisionStump classifier on the WaveformGenerator data, • using the first 100 thousand examples for testing, • training on a total of 100 million examples, • and testing every one million examples
  • 20. WHAT IS APACHE SAMOA?
  • 21. STREAMING MODEL • Sequence is potentially infinite • High amount of data, high speed of arrival • Change over time (concept drift) • Approximation algorithms
 (small error with high probability) • Single pass, one data item at a time • Sub-linear space and time per data item
  • 23. ARCHITECTURE An adapter for integrating Apache Flink into Apache SAMOA was implemente n scope of this master thesis, with the main parts of its implementation bein addressed in this section. With the use of our adapter, ML algorithms can b executed on top of Apache Flink. The implemented adapter will be used for th evaluation of the ML pipelines and HT algorithm variations. Figure 20: Apache SAMOA’s high level architecture.
  • 24. STATUSSTATUS • Parallel algorithms • Classification (Vertical HoeffdingTree) • Clustering (CluStream) • Regression (Adaptive Model Rules) • Execution engines
  • 25. IS SAMOA USEFUL FORYOU? • Only if you need to deal with: • Large fast data • Evolving process (model updates) • What is happening now? • Use feedback in real-time • Adapt to changes faster
  • 26. ML DEVELOPER API Processing Item Processor Stream
  • 27. ML DEVELOPER API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo);
  • 29. DECISIONTREE • Nodes are tests on attributes • Branches are possible outcomes • Leafs are class assignments
 
 Class Instance Attributes Road Tested? Mileage? Age? NoYes High ✅ ❌ Low OldRecent ✅ ❌ Car deal?
  • 30. HOEFFDINGTREE • Sample of stream enough for near optimal decision • Estimate merit of alternatives from prefix of stream • Choose sample size based on statistical principles • When to expand a leaf? • Let x1 be the most informative attribute,
 x2 the second most informative one • Hoeffding bound: split if G(x1, x2) > ✏ = r R2 ln(1/ ) 2n P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
  • 31. PARALLEL DECISIONTREES • Which kind of parallelism? • Task • Data • Horizontal • Vertical Data Attributes Instances
  • 32. HORIZONTAL PARALLELISM Y. Ben-Haim and E.Tom-Tov,“A Streaming Parallel DecisionTree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 Stats Stats Stats Stream Histograms Model Instances Model Updates Aggregation to compute splits Single attribute tracked in multiple node 32
  • 33. HOEFFDINGTREE PROFILING Other 6 % Split 24 % Learn 70 % CPU time for training
 100 nominal and 100 numeric attributes
  • 34. VERTICAL PARALLELISM Single attribute tracked in single node Stats Stats Stats Stream Model Attributes Splits
  • 35. ADVANTAGES OFVERTICAL • High number of attributes => high level of parallelism
 (e.g., documents) • Vs task parallelism • Parallelism observed immediately • Vs horizontal parallelism • Reduced memory usage (no model replication) • Parallelized split computation
  • 36. VERTICAL HOEFFDINGTREE Control Split Result Source (n) Model (n) Stats (n) Evaluator (1) InstanceStream Shuffle Grouping Key Grouping All Grouping
  • 37. ACCURACY No. Leaf Nodes VHT2 – tree-100 30 Very close and very high accuracy
  • 38. PERFORMANCE 35 0 50 100 150 200 250 MHT VHT2-par-3 ExecutionTime(seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  • 39. SUMMARY • Streaming is an importantV of Big Data • Mining big data streams is an open field • MOA: Massive Online Analytics • Available and open-source http://moa.cms.waikato.ac.nz/ • SAMOA:A Platform for Mining Big Data Streams • Available and open-source (incubating @ASF)
 http://samoa.incubator.apache.org
  • 40. OPEN CHALLENGES • Distributed stream mining algorithms • Active & semi-supervised learning + crowdsourcing • Millions of classes (e.g.,Wikipedia pages) • Multi-target learning • System issues (load balancing, communication) • Programming paradigms and abstractions