Jubatus Invited Talk at XLDB Asia

Distributed Online Machine Learning
Framework for Big Data

Shohei Hido
Preferred Infrastructure, Inc. Japan.
XLDB Asia, June 22nd, 2012

Preferred Infrastructure (PFI): to bring
cutting-edge research advances to products
l  Founded: March, 2006, located in Tokyo, Japan
l  Employees: 28
l  Top university graduates including ICPC world finalists
l  Mid-career engineers from Sony, IBM, Yahoo!, Sun

Information retrieval Distributed computing

Natural language
Machine learning
processing

2

Overview:
Big Data analytics will go real-time and deeper

1. Bigger data

2. More in real-time

3. Deep analysis

No storage
No data sharing
Only mix model

Jubatus: OSS platform for Big Data analytics

l  Joint development with NTT laboratory in Japan
l  Project started April 2011
l  Released as an open source software
l  Just released 0.3.0
l  You can download it from
l  http://github.com/jubatus/
l  Waiting for your contribution and collaboration

5

Agenda

l  What’s missing for Big Data analytics

l  Comparison with existing software

l  Inside Jubatus: Update, Analyze, and Mix

l  Jubatus demo

l  Summary

6

Increasing demand in Big Data applications:
Real-time deeper analysis
l  Current focus: aggregation and rule processing on bigger data
l  CEP (Complex Event Processing) for real-time processing

l  Hadoop/MapReduce for distributed computation

l  Future: deeper analysis for rapid decisions and actions
l  Ex. 1: Defect detection on NY power grid [Rubin+,TPAMI2012]

l  Ex. 2: Proactive algorithmic trading [ComputerWorldUK, 2011]

Data size

What will
Hadoop come?
CEP
Deep
Reference：http://web.mit.edu/rudin/www/TPAMIPreprint.pdf 
7
analysis

http://www.computerworlduk.com/news/networking/3302464/

Key technology: Machine learning

l  Examples need rapid decisions under uncertainty
l  Anomaly detection from M2M sensor data
l  Energy demand forecast / Smart grid optimization
l  Security monitoring on raw Internet traffic
l  What is missing for fast & deep analytics on Big Data?
l  Online/real-time machine learning platform
l  + Scale-out distributed machine learning platform

1. Bigger data


3. Deep analysis

Online machine learning in Jubatus
l  Batch learning
l  Scan all data before building a model
l  Data must be stored in memory or storage

Model

l  Online learning
l  Model will be updated by each data sample
l  Sometimes with theory that the online model
converges to the batch model

Model

9

Jubatus focuses on latest online algorithms

l  Advantage: fast and not memory-intensive
l  Low latency & high throughput
l  No need for storing large datasets

l  Eg. Linear classification algorithms
l  Perceptron (1958)
l  Passive Aggressive (PA) (2003) Very recent
progress
l  Confidence Weighted Learning (CW) (2008)
l  AROW (2009)
l  Normal HERD (NHERD) (2010)

10

Online learning or distributed learning:
No unified solution has been available
l  Jubatus combines them into a unified computation framework
Real-time/
Online
Online ML alg.: Jubatus
PA [2003] 2011-
CW[2008]

Large scale
Small scale &
Stand-alone Distributed/
Parallel
WEKA Mahout computing
　 1993- 2006-
SPSS
1988-
Batch
11

What Jubatus currently supports

l  Classification (multi-class)
l  Perceptron / PA / CW / AROW

l  Regression
l  PA-based regression

l  Nearest neighbor
l  LSH / MinHash / Euclid LSH

l  Recommendation
l  Based on nearest neighbor

l  Anomaly detection*
l  LOF based on nearest neighbor

l  Graph analysis*
l  Shortest path / Centrality (PageRank)

l  Some simple statistics
12

Agenda





l  Summary

13

Hadoop and Mahout: Not good for online learning

l  Hadoop
l  Advantage

l  Many extensions for a variety of applications
l  Good for distributed data storing and aggregation
l  Disadvantage
l  No direct support for machine learning and online processing
l  Mahout
l  Advantage

l  Popular machine learning algorithms are implemented
l  Disadvantage
l  Some implementation are less mature
l  Still not capable of online machine learning

14

Jubatus vs. Hadoop, RDB-based, and Storm:
Advantage in online AND distributed ML
l  Only Jubatus satisfies both of them at the same time

Jubatus Hadoop RDB Storm
Storing ✓ ✓✓ ✓
✓
Big Data External DB HDFS Ext. DB
Batch ✓ ✓✓
✓ ✕
learning Mahout SPSS, etc
Stream
✓ ✕ ✕ ✓✓
processing
Distributed ✓
✓✓ ✕ ✕
learning Mahout
High  Online
importance
✓✓ ✕ ✕ ✕
learning
15

Agenda





l  Summary

16

How to make online algorithms distributed?
=> No trivial!
Batch learning
Online learning

Learn Learn
Easy to
the update parallelize Model update
Learn
Model update Model update
Hard to Learn
Learn
parallelize Model update
the update
due to
Learn
frequent updates
Time
Model update Model update

l  Online learning requires frequent model updates
l  Naïve distributed architecture leads to too many
synchronization operations
l  It causes performance problems in terms of network
communications and accuracy
17

Solution: Loose model sharing

l  Jubatus only shares the local models in a loose manner
l  Model size << Data size

l  Jubatus DOES NOT share datasets
l  Unique approach compared to existing framework

l  Local models can be different on the servers
l  Different models will be gradually merged

Model Model Model

Mixed Mixed Mixed
model model model

Three fundamental operations on Jubatus:
UPDATE, ANALYZE, and MIX
1.  UPDATE
l  Receive a sample, learn and update the local model

2.  ANALYZE
l  Receive a sample, apply the local model, return result

3.  MIX (called automatically in backend)
l  Exchange and merge the local models between servers

l  C.f. Map-Shuffle-Reduce operations on Hadoop
l  Algorithms can be implemented independently from
l  Distribution logic
l  Data sharing
l  Failover

19

UPDATE

l  Each server starts from an initial model
l  Each data sample are sent to one (or two) servers
l  Local models updated based on the sample
l  Data samples are NEVER shared

Distributed 
randomly
Local
or consistently
Initial
model
model
1

Local
model Initial
model
2
20

MIX

l  Each server sends its model diff
l  Model diffs are merged and distributed
l  Only model diffs are transmitted

Local Model Model
Initial Merged Initial Mixed
model -
model =
diff diff
diff +
model =
model
1 1 1 Merged
+
=
diff
Local Model Model
Initial Merged Initial Mixed
model -
2
model =
diff diff
diff +
model =
model
2 2

21

UPDATE (iteration)

l  Locally updated models after MIX are discarded
l  Each server starts updating from the mixed model
l  The mixed model improves gradually thanks to all of the servers

Distributed 
randomly
Local
or consistently
Mixed
model
model
1

Local
model Mixed
model
2
22

ANALYZE

l  For prediction, each sample randomly goes to a server
l  Server applies the current mixed model to the sample
l  The prediction will be returned to the client

Distributed 
randomly
Mixed
model

Return prediction
Mixed
model
Return prediction
23

Why Jubatus can work in real-time?

l  Focus on online machine learning
l  Make online machine learning algorithms distributed

l  Update locally
l  Online training without communication with others

l  Mix only models globally
l  Small communication cost, low latency, good performance

l  Advantage compared to costly Shuffle in MapReduce

l  Analyze locally
l  Each server has mixed model

l  Low latency for making predictions

l  Everything in-memory
l  Process data on-the-fly

24

Agenda





l  Summary

25

Demo: Twitter analysis using natural language
processing and machine learning
Jubatus classifies each tweet from Twitter data stream into pre-defined
categories. Only one Jubatus server is enough to classify over 5,000 QPS,
which is close to the raw Twitter data. We provide a browser-based GUI.

26

Experiment: Estimation of power consumption
Jubatus learns the power usage and network data flow pattern of
certain servers. The power consumption of individual servers can be
estimated in real-time by monitoring and analyzing packets without
having to install power measurement modules on all servers.

Predicted value (W)
Data Center /
Office Estimation

Power
No power meter meter

Actual value (W)
TAP
(Packet data)
Consumption differs for
different types of packets

Agenda





l  Summary

28

Summary

l  Jubatus is the first OSS platform for online
distributed machine learning on Big Data streams.
l  Download it from http://github.com/jubatus/
l  We welcome your contribution and collaboration
1. Bigger data


3. Deep analysis
No storage
No data sharing
Only mix model

Jubatus Invited Talk at XLDB Asia

Recommended

Recommended

More Related Content

Similar to Jubatus Invited Talk at XLDB Asia

Similar to Jubatus Invited Talk at XLDB Asia (20)

More from Preferred Networks

More from Preferred Networks (20)

Recently uploaded

Recently uploaded (20)

Jubatus Invited Talk at XLDB Asia