Introduction to
                      Text Mining
                       & Support
                   Vector Machines
                         (SVM)



                    Dr. Anton Heijs
                         CEO
    Treparel
 Delftechpark 26
  2628 XH Delft        July 2012
The Netherlands
www.treparel.com
KMX enables information and knowledge professionals
to gain faster, reliable, more precise insights in large
complex unstructured data sets allowing them to make
better informed decisions.




                   Treparel is a leading technology solution provider in
                         Big Data Text Analytics & Visualization

Treparel KMX – All rights reserved 2012   www.treparel.com                 2
Topics covered in this presentation


         • Who is Treparel?
         • Introduction in Text Mining
         • What is Automated Classification & Clustering?
         • Introducing Support Vector Machines




Treparel KMX – All rights reserved 2012   www.treparel.com   3
Nexus of Forces: Social, Cloud, Mobile, Information
         IT Market shift driving Big Data challenges
                                                                                 Copyright: Gartner, 2011




                 80% of data is Unstructured (Documents, Text, Images, Graphs)



Treparel KMX – All rights reserved 2012     www.treparel.com                                 4
About Treparel

         • Delft, The Netherlands, 2006.
         • Treparel is an innovative technology solution provider in Big Data
           Analytics, Text Mining and Visualization.
         • KMX is an integrated data analysis toolset which provide faster,
           reliable intelligent insights in large complex unstructured data sets to
           allow companies to make better informed decisions.
         • Clients: Philips, Bayer, Abbott, European Patent Office, European
           Commission
         • Part of Research Centers and University ecosystem; TU Delft,
           Universities of Paris and Sao Paulo
         • More info: www.treparel.com




Treparel KMX – All rights reserved 2012   www.treparel.com                        5
Positioning of Treparel’s KMX technology

Text Acquisition & Preparation   Analysis and processing         Output and display
‘Seek’                           ‘Model’                         ‘Adapt’


External sources                                                 Reporting &
                             Text preprocessing
Patents                                                          Presentation
Legal
                                                                 Media and publishing
Research                     Indexing                            databases
Media / Publishers
                                                                 Content management
Other sources                Clustering                          systems
Documents
Websites                                                         Line-of-business
                             Classification                      applications
Blogs
Newsfeeds                                                        Research applications
Email                        Semantic Analysis
Application notes                                                Search engines
Search results
Social networks                                    Visualization


            Information extraction (entities, facts, relationships, concepts, patents)
                        Management, Development and Configuration
                                                                    Copyright: Gartner, J. Popkin 2010
Getting to know the basics

        PART A: Intro in Text Mining
        • The Data (text & image) Mining evolution
        • What is Data Mining: in or out-side the database
        • The Data Mining process
        • Two types of Data Mining tasks: Predictive and Descriptive
        • Two modes of Data Mining tasks: Supervised and Unsupervised
        • The most important algorithms per category


        PART B: SVM
        • Machine Learning & Support Vector Machines (SVM)
        • What makes SVM unique
        • When and How to deploy SVM
        • Case Studies & Examples


Treparel KMX – All rights reserved 2012   www.treparel.com              7
The Data/Text/Image mining evolution
         The Road ahead
                                                                                               Future
            High                                                                                        Enterprise
                                                                               Today                    Text Analytics
                                                                                  Analytical
                                                                                  Modeling
                                                                 1995 - 2000

                                                                        SVM
                                                                        Predictive
                                                                        Modeling
             Application Value




                                               1980’s

                                     Traditional
                                                               “Easy-to-Use”
                                     Data Mining
                                                                Data Mining
                                                                   Tools
                                                               1980’s


                                                                                                            1990’s
                                                                   OLAP                   Query and
                                                                                          Reporting
             Low

                                 Hard to use                                                            Easy to Use
                                                         Usability

Treparel KMX – All rights reserved 2012                 www.treparel.com                                                 8
Knowledge Mining
         Different levels of depth in knowledge discovery

          Visualization (Adapt)



                                                                    Models of semantic data


                                                  Models of data


                           Models of meta data


                                                   Data Mining      Knowledge
         Filtered data
                                                   Text Mining      Discovery
                           Meta Data               Graph Mining


          Data Collection (Seek)

                                                                      Time
Treparel KMX – All rights reserved 2012          www.treparel.com                             9
What is Data Mining?
           Getting to know the basics
        • Most businesses have an enormous amount of data, with a great deal of
          information hiding within it; The data is also growing faster then the knowledge
          which is now extracted from the data, which leads to a growing gap between
          data and knowledge.
        • Data mining provides a way to automatically extract information buried in the
          data.
        • Data Mining creates mathematical models which describe patterns in large,
          complex collections of data.
        • Patterns elude traditional statistical approaches to analysis because of the large
          number of attributes, the complexity of the patterns, or the difficulty to perform
          the analysis
        • Mining the data directly in the database has advantages:
          less data movement, more data security, one source of the
          data
        • Basically 2 Types of Data exist:
              – Structured (tables & numbers) – 20% of data volume
              – Un-Structured (text, images) - 80% of data volume




Treparel KMX – All rights reserved 2012        www.treparel.com                          10
The Data & Text Mining process
            Automating the mining steps; adding new features

                    Understanding the knowledge mining value chain




                                   Data                                              Model
              Data                 Preparation    Algorithm   Model       Model      generation
                                   &                                      De-        (All models) &   Visualization
              Collection &                        Selection   Building
              Understanding        Cleansing                  & Testing   ployment   coordination




                                                                          Treparel's Focus
                                                                          & Core competence


                                  Traditional Players


Treparel KMX – All rights reserved 2012
2 types of Data Mining Functions
         Predictive Data Mining (supervised):
         •    Are used to predict a value; they require the specification of a
              target (known outcome)
         •    Targets are either binary attributes (indicating yes/no) decisions or
              multi-class targets indicating a preferred alternative (color of
              sweater, salary range).
         •    Constructs one or more models; these models are used to predict
              outcomes for data sets
         Descriptive Data Mining (Unsupervised):
         •    Are used to find the intrinsic structure, relations, or affinities in
              data.
         •    Describes a data set in a concise way and presents interesting
              characteristics of the data
         •    The functions are: clustering, association models, and feature
              extraction

Treparel KMX – All rights reserved 2012   www.treparel.com                       12
How does Automated Classification & Clustering
         works?
         • Consists of dividing the items that make up a collection into
           categories or classes.
         • The goal is to accurately predict the target class for each
           record in new data.
         • Algorithms for classification: different algorithms for
           different problems
                  Naïve Bayes
                  Adaptive Bayes Network
                  Support Vector Machine
                  Decision Tree


            Classification is used in: customer segmentation, sentiment
                analysis, competitive analysis, business modeling, credit
                 analysis, Smart content, Fraud and terrorist detection,
                        Diagnosis support, Patent & Drug discovery
Treparel KMX – All rights reserved 2012     www.treparel.com          13
Text Mining algorithms and features

         Feature                  Naive Bayes         Adaptive        Suport Vector     Decision Tree
                                                      Bayes           Machine
                                                      Network
         Speed                    Very fast           Fast            Fast with         Fast
                                                                      active learning
         Accuracy                 Good in many        Good in many    Significant       Good in many
                                  domains             domains                           domains

         Transparancy             No rules (black Rules for           No rules (black Rules
                                  box)                                box)

         Missing value            Missing value       Missing value   Sparse Data       Missing value
         intrepretation




Treparel KMX – All rights reserved 2012           www.treparel.com                               14
What is Support Vector Machine Learning?
        State of the Art algorithm
        • SVM is a state of the art classification and regression algorithm
        • The SVM optimization procedure maximizes predictive accuracy
          while automatically avoiding over-fitting the training data
        • SVM projects the input data into a kernel space. Then it builds a
          linear model in this kernel space
        • SVM performs well with real world applications such as
          classifying text, recognizing hand-written characters, classifying
          images, as well as bioinformatics and bio sequence analysis.
        • SVM are the standard tools for machine learning and data mining




Treparel KMX – All rights reserved 2012   www.treparel.com                     15
What is Support Vector Machine Learning?
                 Classical Data Mining vs SVM

                     Classical Statistics            SVM - Support Vector Machines

                   Hypothesis on Data                  Study of the model family:
                    distribution                         the VC dimension

                   Large number of dimensions          Number of dimensions can be
                    implies large number of model        very high because generalization
                    parameters which leads to            is controlled
                    generalization problems


                   Modeling seeks to get the best      Modeling seeks to get the best
                    Fit                                  compromise between Fit and
                                                         Robustness


                   Manual iterations and time          Automation is possible
                    are necessary



Treparel KMX –
All rights
reserved 2012
What makes SVM such a unique technology?
         • Strong theoretical foundation (Vapnik-Chervonenkis theory)
         • There is no upper limit on the number of attributes ; Only constraint is
           the hardware
         • Good generalization to novel data
         • SVM is the preferred algorithm for sparse data
         • Algorithm of choice for challenging high-dimensional data
         • SVM supports active learning.
               – SVM models grow as the size of the training set increases, big data
                 sets would be difficult to handle.
               – Aative learning forces the SVM algorithm to restrict learning to the
                 most informative training examples.
         • SVM automatically selects a kernel
         • You can control both the model quality (accuracy) and the performance
           (build time)

Treparel KMX – All rights reserved 2012   www.treparel.com                        17
What makes SVM unique?
         SVM gives you control over the models
                  Robustness
                          High
                    Robustness




                                   Under Fit Model                              Robust Model
                                   High Robustness                              Low Training Error Low Test
                                   Training Error = Test Error                  Error




                          Low                                                   Over Fit Model
                    Robustness
                                                                                Low Robustness
                                                                                No Training Error, High Test
                                                                                Error
                                 Low accuracy                                                      High accuracy
                                                                                                               Quality of fit
Treparel KMX – All rights reserved 2012                          www.treparel.com                                         18
What makes SVM unique?
         SVM gives you control over the models




                                 Need more training data                 Safe to Deploy
                         High
            Robustness



                                 (rows)



                                Need more data
                                                                Need more variables
                                (rows/columns)
                         Low




                                                                (columns) or different model
                                or different model type         type

                                            Low                              High

                                                           Quality

Treparel KMX – All rights reserved 2012               www.treparel.com                         19
Treparel is a leading technology solution provider
       in Big Data Text Analytics & Visualization


                                              Treparel
                                           Delftechpark 26
                                            2628 XH Delft
                                          The Netherlands
                                          www.treparel.com


Treparel KMX – All rights reserved 2012      www.treparel.com   20

Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012

  • 1.
    Introduction to Text Mining & Support Vector Machines (SVM) Dr. Anton Heijs CEO Treparel Delftechpark 26 2628 XH Delft July 2012 The Netherlands www.treparel.com
  • 2.
    KMX enables informationand knowledge professionals to gain faster, reliable, more precise insights in large complex unstructured data sets allowing them to make better informed decisions. Treparel is a leading technology solution provider in Big Data Text Analytics & Visualization Treparel KMX – All rights reserved 2012 www.treparel.com 2
  • 3.
    Topics covered inthis presentation • Who is Treparel? • Introduction in Text Mining • What is Automated Classification & Clustering? • Introducing Support Vector Machines Treparel KMX – All rights reserved 2012 www.treparel.com 3
  • 4.
    Nexus of Forces:Social, Cloud, Mobile, Information IT Market shift driving Big Data challenges Copyright: Gartner, 2011 80% of data is Unstructured (Documents, Text, Images, Graphs) Treparel KMX – All rights reserved 2012 www.treparel.com 4
  • 5.
    About Treparel • Delft, The Netherlands, 2006. • Treparel is an innovative technology solution provider in Big Data Analytics, Text Mining and Visualization. • KMX is an integrated data analysis toolset which provide faster, reliable intelligent insights in large complex unstructured data sets to allow companies to make better informed decisions. • Clients: Philips, Bayer, Abbott, European Patent Office, European Commission • Part of Research Centers and University ecosystem; TU Delft, Universities of Paris and Sao Paulo • More info: www.treparel.com Treparel KMX – All rights reserved 2012 www.treparel.com 5
  • 6.
    Positioning of Treparel’sKMX technology Text Acquisition & Preparation Analysis and processing Output and display ‘Seek’ ‘Model’ ‘Adapt’ External sources Reporting & Text preprocessing Patents Presentation Legal Media and publishing Research Indexing databases Media / Publishers Content management Other sources Clustering systems Documents Websites Line-of-business Classification applications Blogs Newsfeeds Research applications Email Semantic Analysis Application notes Search engines Search results Social networks Visualization Information extraction (entities, facts, relationships, concepts, patents) Management, Development and Configuration Copyright: Gartner, J. Popkin 2010
  • 7.
    Getting to knowthe basics PART A: Intro in Text Mining • The Data (text & image) Mining evolution • What is Data Mining: in or out-side the database • The Data Mining process • Two types of Data Mining tasks: Predictive and Descriptive • Two modes of Data Mining tasks: Supervised and Unsupervised • The most important algorithms per category PART B: SVM • Machine Learning & Support Vector Machines (SVM) • What makes SVM unique • When and How to deploy SVM • Case Studies & Examples Treparel KMX – All rights reserved 2012 www.treparel.com 7
  • 8.
    The Data/Text/Image miningevolution The Road ahead Future High Enterprise Today Text Analytics Analytical Modeling 1995 - 2000 SVM Predictive Modeling Application Value 1980’s Traditional “Easy-to-Use” Data Mining Data Mining Tools 1980’s 1990’s OLAP Query and Reporting Low Hard to use Easy to Use Usability Treparel KMX – All rights reserved 2012 www.treparel.com 8
  • 9.
    Knowledge Mining Different levels of depth in knowledge discovery Visualization (Adapt) Models of semantic data Models of data Models of meta data Data Mining Knowledge Filtered data Text Mining Discovery Meta Data Graph Mining Data Collection (Seek) Time Treparel KMX – All rights reserved 2012 www.treparel.com 9
  • 10.
    What is DataMining? Getting to know the basics • Most businesses have an enormous amount of data, with a great deal of information hiding within it; The data is also growing faster then the knowledge which is now extracted from the data, which leads to a growing gap between data and knowledge. • Data mining provides a way to automatically extract information buried in the data. • Data Mining creates mathematical models which describe patterns in large, complex collections of data. • Patterns elude traditional statistical approaches to analysis because of the large number of attributes, the complexity of the patterns, or the difficulty to perform the analysis • Mining the data directly in the database has advantages: less data movement, more data security, one source of the data • Basically 2 Types of Data exist: – Structured (tables & numbers) – 20% of data volume – Un-Structured (text, images) - 80% of data volume Treparel KMX – All rights reserved 2012 www.treparel.com 10
  • 11.
    The Data &Text Mining process Automating the mining steps; adding new features Understanding the knowledge mining value chain Data Model Data Preparation Algorithm Model Model generation & De- (All models) & Visualization Collection & Selection Building Understanding Cleansing & Testing ployment coordination Treparel's Focus & Core competence Traditional Players Treparel KMX – All rights reserved 2012
  • 12.
    2 types ofData Mining Functions Predictive Data Mining (supervised): • Are used to predict a value; they require the specification of a target (known outcome) • Targets are either binary attributes (indicating yes/no) decisions or multi-class targets indicating a preferred alternative (color of sweater, salary range). • Constructs one or more models; these models are used to predict outcomes for data sets Descriptive Data Mining (Unsupervised): • Are used to find the intrinsic structure, relations, or affinities in data. • Describes a data set in a concise way and presents interesting characteristics of the data • The functions are: clustering, association models, and feature extraction Treparel KMX – All rights reserved 2012 www.treparel.com 12
  • 13.
    How does AutomatedClassification & Clustering works? • Consists of dividing the items that make up a collection into categories or classes. • The goal is to accurately predict the target class for each record in new data. • Algorithms for classification: different algorithms for different problems  Naïve Bayes  Adaptive Bayes Network  Support Vector Machine  Decision Tree Classification is used in: customer segmentation, sentiment analysis, competitive analysis, business modeling, credit analysis, Smart content, Fraud and terrorist detection, Diagnosis support, Patent & Drug discovery Treparel KMX – All rights reserved 2012 www.treparel.com 13
  • 14.
    Text Mining algorithmsand features Feature Naive Bayes Adaptive Suport Vector Decision Tree Bayes Machine Network Speed Very fast Fast Fast with Fast active learning Accuracy Good in many Good in many Significant Good in many domains domains domains Transparancy No rules (black Rules for No rules (black Rules box) box) Missing value Missing value Missing value Sparse Data Missing value intrepretation Treparel KMX – All rights reserved 2012 www.treparel.com 14
  • 15.
    What is SupportVector Machine Learning? State of the Art algorithm • SVM is a state of the art classification and regression algorithm • The SVM optimization procedure maximizes predictive accuracy while automatically avoiding over-fitting the training data • SVM projects the input data into a kernel space. Then it builds a linear model in this kernel space • SVM performs well with real world applications such as classifying text, recognizing hand-written characters, classifying images, as well as bioinformatics and bio sequence analysis. • SVM are the standard tools for machine learning and data mining Treparel KMX – All rights reserved 2012 www.treparel.com 15
  • 16.
    What is SupportVector Machine Learning? Classical Data Mining vs SVM Classical Statistics SVM - Support Vector Machines  Hypothesis on Data  Study of the model family: distribution the VC dimension  Large number of dimensions  Number of dimensions can be implies large number of model very high because generalization parameters which leads to is controlled generalization problems  Modeling seeks to get the best  Modeling seeks to get the best Fit compromise between Fit and Robustness  Manual iterations and time  Automation is possible are necessary Treparel KMX – All rights reserved 2012
  • 17.
    What makes SVMsuch a unique technology? • Strong theoretical foundation (Vapnik-Chervonenkis theory) • There is no upper limit on the number of attributes ; Only constraint is the hardware • Good generalization to novel data • SVM is the preferred algorithm for sparse data • Algorithm of choice for challenging high-dimensional data • SVM supports active learning. – SVM models grow as the size of the training set increases, big data sets would be difficult to handle. – Aative learning forces the SVM algorithm to restrict learning to the most informative training examples. • SVM automatically selects a kernel • You can control both the model quality (accuracy) and the performance (build time) Treparel KMX – All rights reserved 2012 www.treparel.com 17
  • 18.
    What makes SVMunique? SVM gives you control over the models Robustness High Robustness Under Fit Model Robust Model High Robustness Low Training Error Low Test Training Error = Test Error Error Low Over Fit Model Robustness Low Robustness No Training Error, High Test Error Low accuracy High accuracy Quality of fit Treparel KMX – All rights reserved 2012 www.treparel.com 18
  • 19.
    What makes SVMunique? SVM gives you control over the models Need more training data Safe to Deploy High Robustness (rows) Need more data Need more variables (rows/columns) Low (columns) or different model or different model type type Low High Quality Treparel KMX – All rights reserved 2012 www.treparel.com 19
  • 20.
    Treparel is aleading technology solution provider in Big Data Text Analytics & Visualization Treparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.com Treparel KMX – All rights reserved 2012 www.treparel.com 20