Natural Language Processing and Machine Learning for Discovery

  • 4,263 views
Uploaded on

This presentation was given as a guest lecture by Bommarito Consulting for the Michigan State University College of Law eDiscovery seminar on October 29th, 2012. The goal of the presentation is to …

This presentation was given as a guest lecture by Bommarito Consulting for the Michigan State University College of Law eDiscovery seminar on October 29th, 2012. The goal of the presentation is to provide students with the ability to understand and communicate with their discovery software vendors and service providers with respect to the underlying mechanics of predictive coding.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,263
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
52
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. NATURAL LANGUAGE PROCESSING MSU Law AND MACHINE LEARNING Electronic Discovery Fall 2 01 2 FOR DISCOVERY Week 9
  • 2. GOALS Understand the BLACK BOX. Natural language processing  Mathematical and linguistic concepts  Models of representation  Real-world application Machine learning  Common pre-processing and learning algorithms  Real-world application Communicate with software and service vendors!© Bommarito Consulting
  • 3. BLACK BOX How do we characterize a black box? 3 English medium Inputs Parameters Outputs© Bommarito Consulting
  • 4. BLACK BOX  Secret: Most black boxes are ? very similar inside.  We‟re going to learn to identify the common parts.© Bommarito Consulting
  • 5. NATURAL LANGUAGE PROCESSING Definition: Dealing with real-world text in an automated, reproducible way. Often referred to as NLP. Used somewhat interchangeably with computational linguistics.© Bommarito Consulting
  • 6. NATURAL LANGUAGE PROCESSINGLet‟s start with some text. “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” (Bloomberg article on Sandy)© Bommarito Consulting
  • 7. NATURAL LANGUAGE PROCESSINGWhat kind of questions can we ask? Basic  What is the structure of the text?  Paragraphs  Sentences  Tokens/words  What are the words that appear in this text?  Nouns  Subjects  Direct objects  Verbs Advanced  What are the concepts that appear in this text?  How does this text compare to other text?© Bommarito Consulting
  • 8. NATURAL LANGUAGE PROCESSINGSegmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” • Segments Types • Paragraphs • Sentences • Tokens© Bommarito Consulting
  • 9. NATURAL LANGUAGE PROCESSINGSegmentation and TokenizationBut how does it work? Paragraphs  Two consecutive line breaks  A hard line break followed by an indent Sentences  Period, except abbreviation, ellipsis within quotation, etc. Tokens and Words  Whitespace  PunctuationRemember what real -world text looks like – think text and email.© Bommarito Consulting
  • 10. NATURAL LANGUAGE PROCESSINGSegmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” Paragraphs: 2 Sentences: 2 Words: 561 .  [Hurricane, Sandy, grounded, 3,200, flights, scheduled, for, today, and, tomorrow„, …]© Bommarito Consulting
  • 11. NATURAL LANGUAGE PROCESSINGWhat kind of questions can we ask?We now have an ordered list of tokens.[Hurricane, Sandy, grounded, 3,200, flights, scheduled, for,today, and, tomorrow„, …]  Does the word phrase “quote stuffing” occur in the text?  How many times does “Sandy” occur?  How often does “outage” occur after “power?”  What percentage of tokens are numbers?© Bommarito Consulting
  • 12. NATURAL LANGUAGE PROCESSINGAn Aside on StorageD ata: The word „the‟ ten times and the word ‘a’ ten times. Representation 1 - Ordered List:  [‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …] Representation 2 – Term Frequency:  [(„the‟, 10), („a‟, 10)]© Bommarito Consulting
  • 13. NATURAL LANGUAGE PROCESSINGAn Aside on Storage Representation 1 - Ordered List:  [‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …] Representation 2 - Frequency Map:  [(„the‟, 10), („a‟, 10)] Tradeoffs  Total space  Ease of answering certain questions  Information about context Not all software make the same choice!© Bommarito Consulting
  • 14. NATURAL LANGUAGE PROCESSINGStopwording, Stemming, Parsing, and Tagging Stopwording  Removing “filler” words like prepositions, auxiliary or infinitive verbs, and conjunctions. Stemming  Matching declined nouns like dog/dogs or child/children.  Matching conjugated verbs like run/ran. Parsing  Determining the “structure” of a sentence, typically as represented by a grade school sentence diagram (requires grammar definition; we‟ll skip). Tagging  Identifying the part of speech of each token in a sentence.© Bommarito Consulting
  • 15. NATURAL LANGUAGE PROCESSINGStopwording Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain. System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts.© Bommarito Consulting
  • 16. NATURAL LANGUAGE PROCESSINGStopwording + Stemming Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain. System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert.© Bommarito Consulting
  • 17. NATURAL LANGUAGE PROCESSINGTagging Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. [(Hurricane, NNP), (Sandy, NNP), (grounded, VBD), (3,200, CD), (flights, NNS), (scheduled, VBN), (for, IN), (today, NN), (and, CC), (tomorrow, NN), …]© Bommarito Consulting
  • 18. NATURAL LANGUAGE PROCESSINGBack to the black box. 3 English medium Inputs Parameters Outputs© Bommarito Consulting
  • 19. NATURAL LANGUAGE PROCESSING Let‟s say that we‟re investigating Enron for accounting fraudrelated to its reserve reporting and transfers. We want to look for any material that discusses reserves andprofits in the same sentence. However, we want cases wherethese words are used as nouns; we‟re not interested in dinnerreservations. Inputs Parameters Output Memos Stopword: No Memos Research Stem: Yes Research Emails Tag: Yes Emails Texts Search: … Texts Transcriptions Transcriptions© Bommarito Consulting
  • 20. NATURAL LANGUAGE PROCESSING In general, all document search and discovery softwarecombines the elements discussed above.  Segment  Tokenize  Stopword  Stem  Parse  Tag  Store  Search  Retrieve© Bommarito Consulting
  • 21. NATURAL LANGUAGE PROCESSING How do they dif fer?  Interface and ease-of-use  De-duplication and versioning  Supported languages  Optical character recognition (OCR)  File formats, e.g., Word, WordPerfect, PDF, HTML  Ability to scale to large databases.© Bommarito Consulting
  • 22. MACHINE LEARNING Definition: Automated classification and prediction on data. Examples:  Product recommenders, a la Amazon  Computer vision – is it a cat?  Sentiment analysis  Topic classification  Document clustering At least two stages to machine learning:  Training  Classification© Bommarito Consulting
  • 23. MACHINE LEARNINGLearning Machine learning requires “learning” or “training.” There are two types of training:  Supervised  Unsupervised The goal of training is to determine a mapping from input features to a set of target classes.© Bommarito Consulting
  • 24. MACHINE LEARNINGLearning Imagine a student given a small list of organisms anddescriptions. The student is tasked to assign the organisms intogroups based on these descriptions. Where do the groups comefrom? Super vised: The teacher provides the answers. Unsuper vised: The teacher provides nothing. When the student is done with the task , the teacher checks thestudent‟s responses and decides if the student has learned. In our example, the teac her will typically provide the “canonical” domainsand ki ngdoms of bi ol ogy. However, mos t real -world problems domai ns arenot so well-studied.© Bommarito Consulting
  • 25. MACHINE LEARNINGLearning What if the teacher gave the student some of the answers? This is semi-supervised learning. Supervised: The teacher provides the answers. Semi-supervised: The teacher provides some answers. Unsupervised: The teacher provides nothing.© Bommarito Consulting
  • 26. MACHINE LEARNINGClassification The student has now learned to map from an organism‟sdescription to a group. Now, the student is sent out into the field to use theirknowledge to classify newly discovered organisms. Theyobserve the organisms and document the features they learnedto use. Then, they apply the learned rules to determine theclass of organism.© Bommarito Consulting
  • 27. MACHINE LEARNINGThis is exactly how predictive coding works! Organisms : Documents Descriptions : Natural language features or models Semi-supervised : Sample coding The goal of predictive coding in discovery is to learn to classifydocuments based on natural language features, typically intorelevant/irrelevant or privileged/unprivileged.© Bommarito Consulting
  • 28. MACHINE LEARNINGSome Machine Learning Algorithms Super vised  Statistical models  Bayesian, e.g., Naïve Bayes Classification  Frequentist, e.g., Ordinary Least Squares.  Neural Networks (NN)  Support Vector Machines (SVM)  Random Forests (RF)  Genetic Algorithms (GA) Semi/unsuper vised  Neural Networks (NN)  Clustering  K-means  Hierarchical  Radial Basis (RBF)  Graph© Bommarito Consulting
  • 29. MACHINE LEARNINGNotes on Algorithm Diversity Not all algorithms return scores; some are binar y.  True, True, False  0.9, 0.7, 0.1 Not all algorithms suppor t more than two classes.  Cat, Dog, Mouse  Cat, Not Cat Not all algorithms scale similarly.  1M documents = 1 day  10M documents = {10 days, 100 days, 1000 days}© Bommarito Consulting
  • 30. THANKS! You can get these slides on my blog – http://bommaritollc.com/blog/.  Michael J Bommarito II  CEO, Bommarito Consulting, LLC  Email: michael@bommaritollc.com  Web: http://bommaritollc.com/© Bommarito Consulting
  • 31. REFERENCES B o o k s a n d Wi k i Pa g e s  A Brief Sur vey of Text Mining. Hotho, Nurnberger, Paaß.  http://www.kde.cs.uni -kassel.de/hotho/pub/2005/hotho05TextMining.pdf  Text Mining: Predictive Methods for Analyzing Unstructured Information. Weiss, Indurkhya, Zhang, Damerau.  http://www.amazon.com/Text -Mining-Predictive-Unstructured -Information/dp/0387954333  The Elements of Statistical Learning.  http://www-stat.stanford.edu/~tibs/ElemStatLearn /  Wiki – Machine Learning.  http://en.wikipedia.org/wiki/Machine_learning  Wiki – Machine Learning Algorithms.  http://en.wikipedia.org/wiki/List_of_machine_learni ng_algorithms So f t wa re  Natural Language Toolkit (NLTK).  http://nltk.org /  Stanford NLP Group.  http://nlp.stanford.edu/software /  Weka.  http://www.cs.waikato.ac.nz/ml/weka /  R.  http://www.r -project.org /  SAS Predictive Analytics and Data Mining.  http://www.sas.com/technologies/analytics/datamining/i ndex.html