Shogun 2.0 @ PyData NYC 2012

Introduction Machine Learning Dry is all theory: Live Demo SVMs and Kernels Beyond Binary Classiﬁcation Python integration

The SHOGUN Machine Learning Toolbox 2.0
(and its python interface)

S¨ren Sonnenburg, Gunnar R¨tsch, Sebastian Henschel,
o a
Christian Widmer,Jonas Behr, Alexander Zien, Fabio De Bona,
Alexander Binder, Christian Gehl, and Vojtech Franc
GSoC students: Sergey Lisitsyn, Heiko Strathmann, many more...

fml


What is Shogun?

Machine Learning Toolkit
Broad range of ML algorithms (600 classes)
Large-scale algorithms (up to 50 million examples)
Core written in C++ (> 190, 000 lines of code)
SWIG bindings (support for 8 target languages)

Used in many projects
Gene starts: ARTS [7]
Splice sites: mSplicer [5]
Sensor fusion (private sector)
...many more (see google scholar)!

pics/msklogo.p


Architecture

SWIG - Simple Wrapper Interface Generator
Bindings to a growing number of languages!
pics/msklogo.p
Typemaps!!


Shogun’s history

Project started 1999
Early focus on large-scale SVMs and Kernels
GSoC signiﬁcantly pushed project forward
pics/msklogo.p


Machine Learning - Learning from Data

What is Machine Learning and what can it do for you?

What is ML?
AIM: Learning from empirical data!

Applications
speech and handwriting recognition
medical diagnosis, bioinformatics
computer vision, object recognition
stock market analysis
network security, intrusion detection . . .

pics/msklogo.p


Support Vector Machines

Support Vector Machine (SVMs)

SVM primal

n
1 2
min w 2 +C max 1 − yi w xi , 0)
w 2
i=1
regularizer = robustness
loss = error on train data

Training: Solve optimization problem pics/msklogo.p


Support Vector Machines

SVM with Kernels

SVM dual
k(xi ,xj )
n n n
1
max − αi αj yi yj xT xj
i )− αi ,
α 2
i=1 j=1 i=1

s.t. 0 ≤ αi ≤ C ∀i ∈ {1, n}

Kernel: Similarity measure; generalization of dot product
pics/msklogo.p
Corresponds to dot product in higher dimensional space


Demo:

Support Vector Classiﬁcation
Task: separate 2 clouds of points in 2D

Simple code example: SVM Training
lab = BinaryLabels(labels)
train_xt = RealFeatures(features)
gk = GaussianKernel(train_xt, train_xt, width)
svm = LibSVM(10.0, gk, lab)
svm.train()

test_examples = RealFeatures(test_features)
out = svm.apply(test_examples)

pics/msklogo.p


SVMs and Kernels

Provides generic interface to 11 SVM solvers
Established implementations for solving SVMs with kernels
More recent developments: Fast linear SVM solvers

Kernels for Real-valued Data (in demo)
Linear Kernel, Polynomial Kernel, Gaussian Kernel

String Kernels
Applications in Bioinformatics [4, 8, 10]
Intrusion Detection

Heterogeneous Data Sources
M
Combined kernel: K (x, z) = i=1 βi · Ki (x, z)
βi can be learned using Multiple Kernel Learning [6, 2]
pics/msklogo.p


Beyond Classiﬁcation

(a) GP regression (b) Structured Output (c) Multitask Learning

Regression: Labels are real values (think least squares)
Structured Output Learning: Predict complex structures
Multitask Learning: Solve several related problems
simultaneuously pics/msklogo.p


Multitask Learning
Example: Learn movie user preferece

Multitask Learning: Jointly learn models for diﬀerent countries
pics/msklogo.p
Couple related models more strongly


Regularization-based MTL

Multitask Learning is often implemented using regularization:
T T 2A
Graph-regularizer: s=1 t=1 w s − wt s,t
Keeps model parameters similar
Based on given similarity matrix A

n
L2,1 -regularizer: W 2,1 = i=1 wi
Selects common sub-space
Allows any wt in that sub-space

Clustered MTL:
Unknown task relationship
Identiﬁes similar tasks
pics/msklogo.p


Multitask Learning:

MTL Training
feat, labels = ... # Shogun Data objects

task_one = Task(0,10)
task_two = Task(10,20)
group = TaskGroup()
group.append_task(task_one)
group.append_task(task_two)

mtlr = MultitaskL12(0.1,0.1,feat,labels,group)
mtlr.train()

Eﬃcient LibLinear-style solver Graph-reg SVM [9]
pics/msklogo.p
10 other MTL methods (based on SLEP[3]/MALSAR[1])


Structured Output Learning

Complex outputs
Similar framework, diﬀerent loss function
Bundle-methods: state of the art solvers!
pics/msklogo.p


Other methods

(d) Sparse/L1 methods (e) Gaussian processes (f) Dim-reduct

...and much more I can’t talk about! pics/msklogo.p


Python integration

Python integration
Serialization
Matrix integration
No-copy data wrapping
Rapid prototyping with directors

pics/msklogo.p


Python integration
pythonic interaction with shogun objects
m_real = array(in_data, dtype=float64, order=’F’)
f_real = RealFeatures(m_real)

# slicing
print f_real[0:3, 1]

# operators
f_real += f_real
f_real *= f_real
f_real -= f_real

# no copy
a = RealFeatures()
pics/msklogo.p
a.frombuffer(feats, False)


Python integration: Directors

Simple code example: SVM Training
class ExampleLinearKernel(DirectorKernel):
def __init__(self):
DirectorKernel.__init__(self, True)
def kernel_function(self, idx_a, idx_b):
seq1 = self.get_lhs().get_feature_vector(idx_a)
seq2 = self.get_rhs().get_feature_vector(idx_b)
return numpy.dot(seq1, seq2)

k = ExampleLinearKernel()

svm = SVMLight()
svm.set_kernel(k)
svm.train(train_data)
pics/msklogo.p


How to get started

Dive into Shogun
Visit our website
Source on github (fork-me!)
Documentation available
Many python examples (> 200)
Debian Package, MacPorts
Active Mailing-List

pics/msklogo.p


When is SHOGUN for you?

You want to work with SVMs (11 solvers to choose from)
You want to work with Kernels (35 diﬀerent kernels)
⇒ Esp.: String Kernels / combinations of Kernels
You’re interested recent ML developments (MTL, Structured
Output)
You have large scale computations to do (up to 50 million)
You use one of the following languages:
Python, Octave/MATLAB, R, Java, C#, Ruby, Lua, C++

pics/msklogo.p


Contributors

Original authors: Gunnar Raetsch, Soeren Sonnenburg, Christian Widmer,
Alexander Binder, Alexander Zien, Marius Kloft, Sebastian Henschel, Christian Gehl,
Jonas Behr.

Integrated Code:
Alex Smola (prloqo), Antoine Bordes (LaRank), Thorsten Joachims (SVMLight),
Chin-Chung Chang and Chih-Jen Lin (LIBSVM), Chih-Jen Lin (LibLinear), Vojtech
Franc (LibOCAS), Leon Bottou (SGD SVM), Vikas Sindhwani (SVMLin), Jieping Ye
and Jun Liu (SLEP), Jiayu Zhou and Jieping Ye (MALSAR)

GSoC alumni:
Heiko Strathmann (both 2011 and 2012), Sergey Lisitsyn (both 2011 and 2012),
Chiyuan Zhang (2012), Fernando Iglesias (2012), Viktor Gal (2012), Michal Uricar
(2012), Jacob Walker (2012), Evgeniy Andreev (2012), Baozeng Ding (2011), Alesis
Novik (2011), Shashwat Lal Das (2011)

pics/msklogo.p


Thank you!

Thank you for your attention!!

For more information, visit:
Implementation http://www.shogun-toolbox.org
More machine learning software http://mloss.org
Machine Learning Data http://mldata.org

pics/msklogo.p


References I

Zhou Jiayu, Jianhui Chen, and Jieping Ye.
User Manual MALSAR : Multi-tAsk Learning via Structural
Regularization.
Technical report, Arizona State University, 2012.

M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.R. M¨ller, and A. Zien.
u
Eﬃcient and accurate lp-norm multiple kernel learning.
Advances in Neural Information Processing Systems, 22(22):997–1005,
2009.
Jun Liu, Shuiwang Ji, and Jieping Ye.
SLEP : Sparse Learning with Eﬃcient Projections.
2011.

pics/msklogo.p


References II

G. Schweikert, A. Zien, G. Zeller, J. Behr, C. Dieterich, C.S. Ong,
P. Philips, F. De Bona, L. Hartmann, A. Bohlen, et al.
mGene: Accurate SVM-based gene ﬁnding with an application to
nematode genomes.
Genome research, 19(11):2133, 2009.

Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph
Dieterich, Cheng Soon Ong, Petra Philips, Fabio De Bona, Lisa Hartmann,
Anja Bohlen, Nina Kr¨ger, S¨ren Sonnenburg, and Gunnar R¨tsch.
u o a
mGene: accurate SVM-based gene ﬁnding with an application to
nematode genomes.
Genome research, 19(11):2133–43, November 2009.

S. Sonnenburg, G. R¨tsch, C. Sch¨fer, and B. Sch¨lkopf.
a a o
Large scale multiple kernel learning.
The Journal of Machine Learning Research, 7:1565, 2006.
pics/msklogo.p


References III

S Sonnenburg, A Zien, and G R¨tsch.
a
ARTS: accurate recognition of transcription starts in human.
Bioinformatics, 2006.

S. Sonnenburg, A. Zien, and G. R¨tsch.
a
ARTS: accurate recognition of transcription starts in human.
Bioinformatics, 22(14):e472, 2006.

C Widmer, M Kloft, N G¨rnitz, and G R¨tsch.
o a
Eﬃcient Training of Graph-Regularized Multitask SVMs.
In ECML 2012, 2012.

C. Widmer, J. Leiva, Y. Altun, and G. Raetsch.
Leveraging Sequence Classiﬁcation by Taxonomy-based Multitask
Learning.
In Research in Computational Molecular Biology, pages 522–534.
Springer, 2010. pics/msklogo.p

Shogun 2.0 @ PyData NYC 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Shogun 2.0 @ PyData NYC 2012

Similar to Shogun 2.0 @ PyData NYC 2012 (20)

Shogun 2.0 @ PyData NYC 2012