Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012


Published on

Introduction to Text Analytics algorithmn and Support Vector Machines (SVM) for modelling Text Analytics applications. Incl. Who is Treparel / Introduction to Text Mining / What is automated Classification and Clustering / Support Vector Machines, SVM

Published in: Technology, Education
  • please how can i download this slides
    Are you sure you want to  Yes  No
    Your message goes here
  • please how can i download this power point.i need it pleas
    Are you sure you want to  Yes  No
    Your message goes here

Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012

  1. 1. Introduction to Text Mining & Support Vector Machines (SVM) Dr. Anton Heijs CEO Treparel Delftechpark 26 2628 XH Delft July 2012The
  2. 2. KMX enables information and knowledge professionalsto gain faster, reliable, more precise insights in largecomplex unstructured data sets allowing them to makebetter informed decisions. Treparel is a leading technology solution provider in Big Data Text Analytics & VisualizationTreparel KMX – All rights reserved 2012 2
  3. 3. Topics covered in this presentation • Who is Treparel? • Introduction in Text Mining • What is Automated Classification & Clustering? • Introducing Support Vector MachinesTreparel KMX – All rights reserved 2012 3
  4. 4. Nexus of Forces: Social, Cloud, Mobile, Information IT Market shift driving Big Data challenges Copyright: Gartner, 2011 80% of data is Unstructured (Documents, Text, Images, Graphs)Treparel KMX – All rights reserved 2012 4
  5. 5. About Treparel • Delft, The Netherlands, 2006. • Treparel is an innovative technology solution provider in Big Data Analytics, Text Mining and Visualization. • KMX is an integrated data analysis toolset which provide faster, reliable intelligent insights in large complex unstructured data sets to allow companies to make better informed decisions. • Clients: Philips, Bayer, Abbott, European Patent Office, European Commission • Part of Research Centers and University ecosystem; TU Delft, Universities of Paris and Sao Paulo • More info: www.treparel.comTreparel KMX – All rights reserved 2012 5
  6. 6. Positioning of Treparel’s KMX technologyText Acquisition & Preparation Analysis and processing Output and display‘Seek’ ‘Model’ ‘Adapt’External sources Reporting & Text preprocessingPatents PresentationLegal Media and publishingResearch Indexing databasesMedia / Publishers Content managementOther sources Clustering systemsDocumentsWebsites Line-of-business Classification applicationsBlogsNewsfeeds Research applicationsEmail Semantic AnalysisApplication notes Search enginesSearch resultsSocial networks Visualization Information extraction (entities, facts, relationships, concepts, patents) Management, Development and Configuration Copyright: Gartner, J. Popkin 2010
  7. 7. Getting to know the basics PART A: Intro in Text Mining • The Data (text & image) Mining evolution • What is Data Mining: in or out-side the database • The Data Mining process • Two types of Data Mining tasks: Predictive and Descriptive • Two modes of Data Mining tasks: Supervised and Unsupervised • The most important algorithms per category PART B: SVM • Machine Learning & Support Vector Machines (SVM) • What makes SVM unique • When and How to deploy SVM • Case Studies & ExamplesTreparel KMX – All rights reserved 2012 7
  8. 8. The Data/Text/Image mining evolution The Road ahead Future High Enterprise Today Text Analytics Analytical Modeling 1995 - 2000 SVM Predictive Modeling Application Value 1980’s Traditional “Easy-to-Use” Data Mining Data Mining Tools 1980’s 1990’s OLAP Query and Reporting Low Hard to use Easy to Use UsabilityTreparel KMX – All rights reserved 2012 8
  9. 9. Knowledge Mining Different levels of depth in knowledge discovery Visualization (Adapt) Models of semantic data Models of data Models of meta data Data Mining Knowledge Filtered data Text Mining Discovery Meta Data Graph Mining Data Collection (Seek) TimeTreparel KMX – All rights reserved 2012 9
  10. 10. What is Data Mining? Getting to know the basics • Most businesses have an enormous amount of data, with a great deal of information hiding within it; The data is also growing faster then the knowledge which is now extracted from the data, which leads to a growing gap between data and knowledge. • Data mining provides a way to automatically extract information buried in the data. • Data Mining creates mathematical models which describe patterns in large, complex collections of data. • Patterns elude traditional statistical approaches to analysis because of the large number of attributes, the complexity of the patterns, or the difficulty to perform the analysis • Mining the data directly in the database has advantages: less data movement, more data security, one source of the data • Basically 2 Types of Data exist: – Structured (tables & numbers) – 20% of data volume – Un-Structured (text, images) - 80% of data volumeTreparel KMX – All rights reserved 2012 10
  11. 11. The Data & Text Mining process Automating the mining steps; adding new features Understanding the knowledge mining value chain Data Model Data Preparation Algorithm Model Model generation & De- (All models) & Visualization Collection & Selection Building Understanding Cleansing & Testing ployment coordination Treparels Focus & Core competence Traditional PlayersTreparel KMX – All rights reserved 2012
  12. 12. 2 types of Data Mining Functions Predictive Data Mining (supervised): • Are used to predict a value; they require the specification of a target (known outcome) • Targets are either binary attributes (indicating yes/no) decisions or multi-class targets indicating a preferred alternative (color of sweater, salary range). • Constructs one or more models; these models are used to predict outcomes for data sets Descriptive Data Mining (Unsupervised): • Are used to find the intrinsic structure, relations, or affinities in data. • Describes a data set in a concise way and presents interesting characteristics of the data • The functions are: clustering, association models, and feature extractionTreparel KMX – All rights reserved 2012 12
  13. 13. How does Automated Classification & Clustering works? • Consists of dividing the items that make up a collection into categories or classes. • The goal is to accurately predict the target class for each record in new data. • Algorithms for classification: different algorithms for different problems  Naïve Bayes  Adaptive Bayes Network  Support Vector Machine  Decision Tree Classification is used in: customer segmentation, sentiment analysis, competitive analysis, business modeling, credit analysis, Smart content, Fraud and terrorist detection, Diagnosis support, Patent & Drug discoveryTreparel KMX – All rights reserved 2012 13
  14. 14. Text Mining algorithms and features Feature Naive Bayes Adaptive Suport Vector Decision Tree Bayes Machine Network Speed Very fast Fast Fast with Fast active learning Accuracy Good in many Good in many Significant Good in many domains domains domains Transparancy No rules (black Rules for No rules (black Rules box) box) Missing value Missing value Missing value Sparse Data Missing value intrepretationTreparel KMX – All rights reserved 2012 14
  15. 15. What is Support Vector Machine Learning? State of the Art algorithm • SVM is a state of the art classification and regression algorithm • The SVM optimization procedure maximizes predictive accuracy while automatically avoiding over-fitting the training data • SVM projects the input data into a kernel space. Then it builds a linear model in this kernel space • SVM performs well with real world applications such as classifying text, recognizing hand-written characters, classifying images, as well as bioinformatics and bio sequence analysis. • SVM are the standard tools for machine learning and data miningTreparel KMX – All rights reserved 2012 15
  16. 16. What is Support Vector Machine Learning? Classical Data Mining vs SVM Classical Statistics SVM - Support Vector Machines  Hypothesis on Data  Study of the model family: distribution the VC dimension  Large number of dimensions  Number of dimensions can be implies large number of model very high because generalization parameters which leads to is controlled generalization problems  Modeling seeks to get the best  Modeling seeks to get the best Fit compromise between Fit and Robustness  Manual iterations and time  Automation is possible are necessaryTreparel KMX –All rightsreserved 2012
  17. 17. What makes SVM such a unique technology? • Strong theoretical foundation (Vapnik-Chervonenkis theory) • There is no upper limit on the number of attributes ; Only constraint is the hardware • Good generalization to novel data • SVM is the preferred algorithm for sparse data • Algorithm of choice for challenging high-dimensional data • SVM supports active learning. – SVM models grow as the size of the training set increases, big data sets would be difficult to handle. – Aative learning forces the SVM algorithm to restrict learning to the most informative training examples. • SVM automatically selects a kernel • You can control both the model quality (accuracy) and the performance (build time)Treparel KMX – All rights reserved 2012 17
  18. 18. What makes SVM unique? SVM gives you control over the models Robustness High Robustness Under Fit Model Robust Model High Robustness Low Training Error Low Test Training Error = Test Error Error Low Over Fit Model Robustness Low Robustness No Training Error, High Test Error Low accuracy High accuracy Quality of fitTreparel KMX – All rights reserved 2012 18
  19. 19. What makes SVM unique? SVM gives you control over the models Need more training data Safe to Deploy High Robustness (rows) Need more data Need more variables (rows/columns) Low (columns) or different model or different model type type Low High QualityTreparel KMX – All rights reserved 2012 19
  20. 20. Treparel is a leading technology solution provider in Big Data Text Analytics & Visualization Treparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.comTreparel KMX – All rights reserved 2012 20