Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012

Introduction to
Text Mining
& Support
Vector Machines
(SVM)

Dr. Anton Heijs
CEO
Treparel
Delftechpark 26
2628 XH Delft July 2012
The Netherlands
www.treparel.com

KMX enables information and knowledge professionals
to gain faster, reliable, more precise insights in large
complex unstructured data sets allowing them to make
better informed decisions.

Treparel is a leading technology solution provider in
Big Data Text Analytics & Visualization

Treparel KMX – All rights reserved 2012 www.treparel.com 2

Topics covered in this presentation

• Who is Treparel?
• Introduction in Text Mining
• What is Automated Classification & Clustering?
• Introducing Support Vector Machines


Nexus of Forces: Social, Cloud, Mobile, Information
IT Market shift driving Big Data challenges
Copyright: Gartner, 2011

80% of data is Unstructured (Documents, Text, Images, Graphs)


About Treparel

• Delft, The Netherlands, 2006.
• Treparel is an innovative technology solution provider in Big Data
Analytics, Text Mining and Visualization.
• KMX is an integrated data analysis toolset which provide faster,
reliable intelligent insights in large complex unstructured data sets to
allow companies to make better informed decisions.
• Clients: Philips, Bayer, Abbott, European Patent Office, European
Commission
• Part of Research Centers and University ecosystem; TU Delft,
Universities of Paris and Sao Paulo
• More info: www.treparel.com


Positioning of Treparel’s KMX technology

Text Acquisition & Preparation Analysis and processing Output and display
‘Seek’ ‘Model’ ‘Adapt’

External sources Reporting &
Text preprocessing
Patents Presentation
Legal
Media and publishing
Research Indexing databases
Media / Publishers
Content management
Other sources Clustering systems
Documents
Websites Line-of-business
Classification applications
Blogs
Newsfeeds Research applications
Email Semantic Analysis
Application notes Search engines
Search results
Social networks Visualization

Information extraction (entities, facts, relationships, concepts, patents)
Management, Development and Configuration
Copyright: Gartner, J. Popkin 2010

Getting to know the basics

PART A: Intro in Text Mining
• The Data (text & image) Mining evolution
• What is Data Mining: in or out-side the database
• The Data Mining process
• Two types of Data Mining tasks: Predictive and Descriptive
• Two modes of Data Mining tasks: Supervised and Unsupervised
• The most important algorithms per category

PART B: SVM
• Machine Learning & Support Vector Machines (SVM)
• What makes SVM unique
• When and How to deploy SVM
• Case Studies & Examples


The Data/Text/Image mining evolution
The Road ahead
Future
High Enterprise
Today Text Analytics
Analytical
Modeling
1995 - 2000

SVM
Predictive
Modeling
Application Value

1980’s

Traditional
“Easy-to-Use”
Data Mining
Data Mining
Tools
1980’s

1990’s
OLAP Query and
Reporting
Low

Hard to use Easy to Use
Usability


Knowledge Mining
Different levels of depth in knowledge discovery

Visualization (Adapt)

Models of semantic data

Models of data

Models of meta data

Data Mining Knowledge
Filtered data
Text Mining Discovery
Meta Data Graph Mining

Data Collection (Seek)

Time

What is Data Mining?
Getting to know the basics
• Most businesses have an enormous amount of data, with a great deal of
information hiding within it; The data is also growing faster then the knowledge
which is now extracted from the data, which leads to a growing gap between
data and knowledge.
• Data mining provides a way to automatically extract information buried in the
data.
• Data Mining creates mathematical models which describe patterns in large,
complex collections of data.
• Patterns elude traditional statistical approaches to analysis because of the large
number of attributes, the complexity of the patterns, or the difficulty to perform
the analysis
• Mining the data directly in the database has advantages:
less data movement, more data security, one source of the
data
• Basically 2 Types of Data exist:
– Structured (tables & numbers) – 20% of data volume
– Un-Structured (text, images) - 80% of data volume


The Data & Text Mining process
Automating the mining steps; adding new features

Understanding the knowledge mining value chain

Data Model
Data Preparation Algorithm Model Model generation
& De- (All models) & Visualization
Collection & Selection Building
Understanding Cleansing & Testing ployment coordination

Treparel's Focus
& Core competence

Traditional Players

Treparel KMX – All rights reserved 2012

2 types of Data Mining Functions
Predictive Data Mining (supervised):
• Are used to predict a value; they require the specification of a
target (known outcome)
• Targets are either binary attributes (indicating yes/no) decisions or
multi-class targets indicating a preferred alternative (color of
sweater, salary range).
• Constructs one or more models; these models are used to predict
outcomes for data sets
Descriptive Data Mining (Unsupervised):
• Are used to find the intrinsic structure, relations, or affinities in
data.
• Describes a data set in a concise way and presents interesting
characteristics of the data
• The functions are: clustering, association models, and feature
extraction


How does Automated Classification & Clustering
works?
• Consists of dividing the items that make up a collection into
categories or classes.
• The goal is to accurately predict the target class for each
record in new data.
• Algorithms for classification: different algorithms for
different problems
 Naïve Bayes
 Adaptive Bayes Network
 Support Vector Machine
 Decision Tree

Classification is used in: customer segmentation, sentiment
analysis, competitive analysis, business modeling, credit
analysis, Smart content, Fraud and terrorist detection,
Diagnosis support, Patent & Drug discovery

Text Mining algorithms and features

Feature Naive Bayes Adaptive Suport Vector Decision Tree
Bayes Machine
Network
Speed Very fast Fast Fast with Fast
active learning
Accuracy Good in many Good in many Significant Good in many
domains domains domains

Transparancy No rules (black Rules for No rules (black Rules
box) box)

Missing value Missing value Missing value Sparse Data Missing value
intrepretation


What is Support Vector Machine Learning?
State of the Art algorithm
• SVM is a state of the art classification and regression algorithm
• The SVM optimization procedure maximizes predictive accuracy
while automatically avoiding over-fitting the training data
• SVM projects the input data into a kernel space. Then it builds a
linear model in this kernel space
• SVM performs well with real world applications such as
classifying text, recognizing hand-written characters, classifying
images, as well as bioinformatics and bio sequence analysis.
• SVM are the standard tools for machine learning and data mining


What is Support Vector Machine Learning?
Classical Data Mining vs SVM

Classical Statistics SVM - Support Vector Machines

 Hypothesis on Data  Study of the model family:
distribution the VC dimension

 Large number of dimensions  Number of dimensions can be
implies large number of model very high because generalization
parameters which leads to is controlled
generalization problems

 Modeling seeks to get the best  Modeling seeks to get the best
Fit compromise between Fit and
Robustness

 Manual iterations and time  Automation is possible
are necessary

Treparel KMX –
All rights
reserved 2012

What makes SVM such a unique technology?
• Strong theoretical foundation (Vapnik-Chervonenkis theory)
• There is no upper limit on the number of attributes ; Only constraint is
the hardware
• Good generalization to novel data
• SVM is the preferred algorithm for sparse data
• Algorithm of choice for challenging high-dimensional data
• SVM supports active learning.
– SVM models grow as the size of the training set increases, big data
sets would be difficult to handle.
– Aative learning forces the SVM algorithm to restrict learning to the
most informative training examples.
• SVM automatically selects a kernel
• You can control both the model quality (accuracy) and the performance
(build time)


What makes SVM unique?
SVM gives you control over the models
Robustness
High
Robustness

Under Fit Model Robust Model
High Robustness Low Training Error Low Test
Training Error = Test Error Error

Low Over Fit Model
Robustness
Low Robustness
No Training Error, High Test
Error
Low accuracy High accuracy
Quality of fit

What makes SVM unique?
SVM gives you control over the models

Need more training data Safe to Deploy
High
Robustness

(rows)

Need more data
Need more variables
(rows/columns)
Low

(columns) or different model
or different model type type

Low High

Quality


Treparel is a leading technology solution provider
in Big Data Text Analytics & Visualization

Treparel
Delftechpark 26
2628 XH Delft
The Netherlands
www.treparel.com


Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012

More Related Content

What's hot

Viewers also liked

Similar to Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012

Recently uploaded

Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012