Machine Learning based Text Classification introduction

2,261 views

Published on

Introduction on Classification and Clustering for modelling Text Analytics applications. Incl. Who is Treparel / 3 types of text classification / Why perform automated text classification / Appendix: The Genius Section. Support Vector Machines (SVM)

Published in: Technology, Education
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,261
On SlideShare
0
From Embeds
0
Number of Embeds
388
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Machine Learning based Text Classification introduction

  1. 1. Introduction to Text Classification Dr. Anton Heijs Treparel CTO Delftechpark 26 info@treparel.com 2628 XH Delft July 2012The Netherlandswww.treparel.com
  2. 2. KMX enables information and knowledge professionalsto gain faster, reliable, more precise insights in largecomplex unstructured data sets allowing them to makebetter informed decisions. Treparel is a leading technology solution provider in Big Data Text Analytics & VisualizationTreparel KMX – All rights reserved 2012 www.treparel.com 2
  3. 3. Topics covered in this presentation • Who is Treparel? • 3 types of text classification • Why perform automated text classification? • Appendix: The Genius Section • Evaluating classification performance • Improving the classifier • Multiclass classificationTreparel KMX – All rights reserved 2012 www.treparel.com 3
  4. 4. Nexus of Forces: Social, Cloud, Mobile, Information IT Market shift driving Big Data challenges Copyright: Gartner, 2011 80% of data is Unstructured (Documents, Text, Images, Graphs)Treparel KMX – All rights reserved 2012 www.treparel.com 4
  5. 5. About Treparel • Delft, The Netherlands, 2006. • Treparel is an innovative technology solution provider in Big Data Analytics, Text Mining and Visualization. • KMX is an integrated data analysis toolset which provide faster, reliable intelligent insights in large complex unstructured data sets to allow companies to make better informed decisions. • Clients: Philips, Bayer, Abbott, European Patent Office, European Commission • Part of Research Centers and University ecosystem; TU Delft, Universities of Paris and Sao Paulo • More info: www.treparel.comTreparel KMX – All rights reserved 2012 www.treparel.com 5
  6. 6. Positioning of Treparel’s KMX technologyText Acquisition & Preparation Analysis and processing Output and display‘Seek’ ‘Model’ ‘Adapt’External sources Reporting & Text preprocessingPatents PresentationLegal Media and publishingResearch Indexing databasesMedia / Publishers Content managementOther sources Clustering systemsDocumentsWebsites Line-of-business Classification applicationsBlogsNewsfeeds Research applicationsEmail Semantic AnalysisApplication notes Search enginesSearch resultsSocial networks Visualization Information extraction (entities, facts, relationships, concepts, patents) Management, Development and Configuration Copyright: Gartner, J. Popkin 2010
  7. 7. Why perform document classification? Example: Patent documents are already classified in one or more of the many classification schemes (IPC, USPC, ECLA, Derwent etc.) Why would we want to classify them again? • Classification systems are commonly based on a single viewpoint, and are limited in detail. • Inconsistent placement in classification • Related documents may be placed in very different locations Tasks can rarely be resolved using the classification system alone!Treparel KMX – All rights reserved 2012 www.treparel.com 7
  8. 8. Why perform document classification? Challenge: • Find relationships - e.g. between competitor activity and detailed technology - in large collections of documents. Example : • Given a collection of Optical Recording patent families over 15 years (ca. 10.000 documents) make a subdivision for 10 predefined detailed technologies that do not match any available classification system.Treparel KMX – All rights reserved 2012 www.treparel.com 8
  9. 9. Why perform document classification? Option 1: Manual Classification • Speed of manual classification vary considerably. • Taking an optimistic 10s per document classifying 10.000 documents would take over…. 27 hours of continuous work. Big Data Paradox: Growing data and need for data driven decisions, limited or decreasing (human) resources available for in-depth analysisTreparel KMX – All rights reserved 2012 www.treparel.com 9
  10. 10. Why perform document classification? Option 2: Boolean search queries • Research shows that most users only use one or two terms per query. • Expert users make significantly longer queries. • Finding the right combination of keywords is very difficult, and easily results in queries that are either too generic (poor precision) or over-specific (poor recall). Huge risk of Inaccuracy and many pitfalls including synonyms, homonyms, lack of context and intentionally vague languageTreparel KMX – All rights reserved 2012 www.treparel.com 10
  11. 11. Why perform document classification? Option 3: Automated classification • Automated classification = Text Data Mining • Find useful patterns in text by extracting data from text • Applying algorithms and methods to text, using expertise of machine learning and statistics. Automated Classification, Nothing Remains UnstructuredTreparel KMX – All rights reserved 2012 www.treparel.com 11
  12. 12. Automated Text Classification; How does it works? Original Data TRAINING Known DATA Output _____________ Yes Text _____________ No Classifier _____________ Yes Text Text Text Presentation & Data Preprocessing Classification Deployment TEST Unknown Predicted DATA Output Output ____________ ? Yes ____________ ? No ____________ ? Yes New DataTreparel KMX – All rights reserved 2012 www.treparel.com 12
  13. 13. The process of Automated Text ClassificationTreparel KMX – All rights reserved 2012 www.treparel.com 13
  14. 14. Appendix: The Genius Section • A small lecture in Machine Learning: From Text to VectorsTreparel KMX – All rights reserved 2012 www.treparel.com 14
  15. 15. Automated Text Classification • In order to perform data mining operations we convert the text documents into a vector space model. • The vector space model represents documents as vectors in n- dimensional space. Each document is described by a numerical feature vector. • Documents can be compared by use of vector operations. This enables us to perform computations on text data.Treparel KMX – All rights reserved 2012 www.treparel.com 15
  16. 16. Machine Learning: From Text to Vectors Stopword Original Text Tokenization removalSing, O goddess, the sing; o; goddess; the; sing; goddess; anger;anger of Achilles son of anger; of; achilles; son; achilles; son; peleus;Peleus, that brought of; peleus; that; brought; countless; ills;countless ills upon the brought; countless; ills; achaeansAchaeans upon; the; achaeans Very high dimensional! Stemming Vectorization (d ≈1000) sing; god; anger; Very sparse! achilles; son; peleus; (0,0,1,0,1,0,0,0…..) brin; count; ill; achae Treparel KMX – All rights reserved 2012 www.treparel.com 16
  17. 17. Automated Text Classification – Why SVM? Robustness • SVM enables us to find a good High trade-off between accuracy andRobustness robustness • SVM is well-suited for sparse data Under Fit Model Robust Model • SVM is will-suited for high- High Robustness Low Training Error Low Test Error dimensional data Training Error = Test Error • SVM is generally applicable, performing very well on a wide variety of tasks. Low Over Fit ModelRobustness Low Robustness No Training Error, High Test Error Quality of fit Low accuracy High accuracy Treparel KMX – All rights reserved 2012 www.treparel.com 17
  18. 18. What is Support Vector Machine Learning? Classical Data Mining vs SVM Classical Statistics SVM - Support Vector Machines  Hypothesis on Data  Study of the model family: distribution the VC dimension  Large number of dimensions  Number of dimensions can be implies large number of model very high because generalization parameters which leads to is controlled generalization problems  Modeling seeks to get the best  Modeling seeks to get the best Fit compromise between Fit and Robustness  Manual iterations and time  Automation is possible are necessaryTreparel KMX –All rightsreserved 2012
  19. 19. Automated Classification uses the SVM algorithm Score = 100 Vectors of Classes are separated by a line documents of (d=2) a plane (d=3) or a class A hyperplane (d>3). The Support Vector Machines (SVM) algorithm is used to determine the optimal separating (hyper-)plane Vectors of documents not of class A Unknown examples (red dot) are classified according to their position with respect to the hyperplane.Score = 50 Score = 0 Treparel KMX – All rights reserved 2012 www.treparel.com 19
  20. 20. Evaluating classification performance Total Relevant FP FN TN TP Result set True class P N Hypothesized True False Y class Positives Positives False True N Negatives NegativesTreparel KMX – All rights reserved 2012 www.treparel.com 20
  21. 21. Evaluating classification performance (should you question the quality of the results…) true positive (TP) eqv. with hit true negative (TN) eqv. with correct rejection false positive (FP) eqv. with false alarm, Type I error false negative (FN) eqv. with miss, Type II error true positive rate (TPR) eqv. with hit rate, recall, sensitivity TPR = TP / P = TP / (TP + FN) false positive rate (FPR) eqv. with false alarm rate, fall-out FPR = FP / N = FP / (FP + TN) accuracy (ACC) ACC = (TP + TN) / (P + N) specificity (SPC) SPC = TN / (FP + TN) = 1 − FPR positive predictive value (PPV) eqv. with precision PPV = TP / (TP + FP) negative predictive value (NPV) NPV = TN / (TN + FN) false discovery rate (FDR) FDR = FP / (FP + TP) Matthews Correlation Coefficient (MCC)Treparel KMX – All rights reserved 2012 www.treparel.com 21
  22. 22. Evaluating classification performance Recall: Number of relevant records retrieved TP Total number of relevant records TP FN Precision: Number of relevant records retrieved TP Total number of records retrieved TP FP We want P>0.8 and R>0.8 F-1 Measure: F1 = 2 P R / (P+R)Treparel KMX – All rights reserved 2012 www.treparel.com 22
  23. 23. Evaluating Classification performance Start with the threshold at 100. Now all documents are classified as negatives. Lower the threshold and see if the true positives grow faster than the false positives “monkey line” Area between the gray lines is the 95% confidence interval. To reduce the width of the confidence interval we need to increase the size of the labeled test set.Treparel KMX – All rights reserved 2012 www.treparel.com 23
  24. 24. Improving the ClassifierTM Iterate to improve classification performanceTreparel KMX – All rights reserved 2012 www.treparel.com 24
  25. 25. Improving the ClassifierTM Once we have created the first classifier and used it to classify the rest of the available documents, we can use the classification results to suggest additional training documents. Suggestion Labeling ImprovedTreparel KMX – All rights reserved 2012 www.treparel.com 25
  26. 26. Multiclass Classification using a binary ClassifierTM Multi-class classification: • Classify your data set into multiple non-overlapping classes. • Positive examples for one class are automatically used as negative examples for the other classes.Treparel KMX – All rights reserved 2012 www.treparel.com 26
  27. 27. Multiclass Classification using a binary ClassifierTM Motorcycle score Automobile score Class determination Lorry scoreTreparel KMX – All rights reserved 2012 www.treparel.com 27
  28. 28. Treparel is a leading technology solution provider in Big Data Text Analytics & Visualization Treparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.comTreparel KMX – All rights reserved 2012 www.treparel.com 28

×