• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
ICPSR - Complex Systems Models in the Social Sciences - Lecture 7 - Professor Daniel Martin Katz (Guest Speaker: Michael J Bommarito II)
 

ICPSR - Complex Systems Models in the Social Sciences - Lecture 7 - Professor Daniel Martin Katz (Guest Speaker: Michael J Bommarito II)

on

  • 565 views

 

Statistics

Views

Total Views
565
Views on SlideShare
565
Embed Views
0

Actions

Likes
0
Downloads
33
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    ICPSR - Complex Systems Models in the Social Sciences - Lecture 7 - Professor Daniel Martin Katz (Guest Speaker: Michael J Bommarito II) ICPSR - Complex Systems Models in the Social Sciences - Lecture 7 - Professor Daniel Martin Katz (Guest Speaker: Michael J Bommarito II) Presentation Transcript

    • ICPSR July 30, 2013 NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING
    • Let’s start with some text. “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” (Bloomberg article on Sandy) NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • Real Data ¡  When we work with real data, we often need to pre-process and clean data before we can segment and tokenize. ¡  Consider, for example: §  Hand-written documents: OCR §  Digital formats: PDF, Word, WordPerfect, HTML §  Typesetting remnants, e.g., page breaks, line break hyphens ¡  Pre-processing is very important! All subsequent work depends on this quality. NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • What kind of questions can we ask? ¡  Basic §  What is the structure of the text? §  Paragraphs §  Sentences §  Tokens/words §  What are the “words” that appear in this text? §  Nouns §  Subjects §  Direct objects §  … §  Verbs ¡  Advanced §  What are the concepts that appear in this text? §  How does this text compare to other text? NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted  New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” NATURAL LANGUAGE PROCESSING © Bommarito Consulting •  Segments Types •  Paragraphs •  Sentences •  Tokens
    • Segmentation and Tokenization But how does it work? ¡  Paragraphs §  Two consecutive line breaks §  A hard line break followed by an indent ¡  Sentences §  Period, except abbreviation, ellipsis within quotation, etc. ¡  Tokens and Words §  Whitespace §  Punctuation Remember what real-world text looks like – think text and email. NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted  New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” ¡  Paragraphs: 2 ¡  Sentences: 2 ¡  Words: 561. §  ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …] NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • What kind of questions can we ask? We now have an ordered list of tokens. ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …] §  Does the word phrase “quote stuffing” occur in the text? §  How many times does “Sandy” occur? §  How often does “outage” occur after “power?” §  What percentage of tokens are numbers? NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • An Aside on Storage Data: The word ‘the’ ten times and the word ‘a’ ten times. §  Representation 1 - Ordered List: §  [‘the’, ‘a’, ‘the’, ‘a’, ‘the’, ‘a’, …] §  Representation 2 – Term Frequency: §  [(‘the’, 10), (‘a’, 10)] NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • An Aside on Storage §  Representation 1 - Ordered List: §  [‘the’, ‘a’, ‘the’, ‘a’, ‘the’, ‘a’, …] §  Representation 2 - Frequency Map: §  [(‘the’, 10), (‘a’, 10)] §  Tradeoffs §  Total space §  Ease of answering certain questions §  Information about context §  Not all software make the same choice! NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • Stopwording, Stemming, Parsing, and Tagging §  Stopwording §  Removing “filler” words like prepositions, auxiliary or infinitive verbs, and conjunctions. §  Stemming §  Matching declined nouns like dog/dogs or child/children. §  Matching conjugated verbs like run/ran. §  Parsing §  Determining the “structure” of a sentence, typically as represented by a grade school sentence diagram (requires grammar definition; we’ll skip). §  Tagging §  Identifying the part of speech of each token in a sentence. NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • Stopwording Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted  New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain. System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts. NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • Stopwording + Stemming Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted  New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain. System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert. NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • Tagging Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted  New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. [('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights', 'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …] NATURAL LANGUAGE PROCESSING © Bommarito Consulting
    • ¡  Definition: Automated classification and prediction on data. ¡  Examples: §  Product recommenders, a la Amazon §  Computer vision – is it a cat? §  Sentiment analysis §  Topic classification §  Document clustering ¡  At least two stages to a classification problem: §  Training §  Classification MACHINE LEARNING © Bommarito Consulting
    • Learning ¡  Machine learning requires “learning” or “training.” ¡  There are two types of training: §  Supervised §  Unsupervised ¡  The goal of training is to determine a mapping from input features to a set of target classes. MACHINE LEARNING © Bommarito Consulting
    • Learning Imagine a student given a small list of organisms and descriptions. The student is tasked to assign the organisms into groups based on these descriptions. Where do the groups come from? ¡  Supervised: The teacher provides the answers while learning. ¡  Unsupervised: The teacher provides nothing while learning. In our example, the teacher will typically provide the “canonical” domains and kingdoms of biology. However, most real-world problems domains are not so well-studied. MACHINE LEARNING © Bommarito Consulting
    • Learning What if the teacher gave the student some of the answers? This is semi-supervised learning. ¡  Supervised: The teacher provides the answers while learning. ¡  Semi-supervised: The teacher provides some answers while learning.. ¡  Unsupervised: The teacher provides nothing while learning.. MACHINE LEARNING © Bommarito Consulting
    • Classification The student has now learned to map from an organism’s description to a group. Now, the student is sent out into the field to use their knowledge to classify newly discovered organisms. They observe the organisms and document the features they learned to use. Then, they apply the learned rules to determine the class of organism. MACHINE LEARNING © Bommarito Consulting
    • Replace the student with an algorithm and we have machine learning. ¡  Sentiment Analysis Example §  Organisms : Restaurant reviews §  Descriptions : §  Number of positive phrases §  Number of negative phrases §  Number of times visited §  Number of restaurants reviewed §  Recency of review §  Target: 1-5 stars for restaurant sentiment MACHINE LEARNING © Bommarito Consulting
    • Retailers are doing this every day. ¡  Purchasing Example §  Organism: Consumer §  Descriptions: §  How many products purchased of category A, B, … §  How many dollars spent on brand A, B, … §  How recently was an item purchased from category A, B, … §  How many visits to web pages in category A, B, … §  Target: Will they purchase in the next 30 days? §  Training: Look out-of-sample at purchasing database MACHINE LEARNING © Bommarito Consulting
    • Some Machine Learning Algorithms ¡  Supervised §  Statistical models §  Bayesian, e.g., Naïve Bayes Classification §  Frequentist, e.g., Ordinary Least Squares. §  Neural Networks (NN) §  Support Vector Machines (SVM) §  Random Forests (RF) §  Genetic Algorithms (GA) ¡  Semi/unsupervised §  Neural Networks (NN) §  Clustering §  K-means §  Hierarchical §  Radial Basis (RBF) §  Graph MACHINE LEARNING © Bommarito Consulting
    • Notes on Algorithm Diversity ¡  Not all algorithms return scores; some are binary. §  True, True, False §  0.9, 0.7, 0.1 ¡  Not all algorithms support more than two classes. §  Cat, Dog, Mouse §  Cat, Not Cat ¡  Not all algorithms scale similarly. §  1M documents = 1 day §  10M documents = {10 days, 100 days, 1000 days} MACHINE LEARNING © Bommarito Consulting
    • ¡  Michael J Bommarito II §  CEO, Bommarito Consulting, LLC §  Email: michael@bommaritollc.com §  Web: http://bommaritollc.com/ THANKS! You can get these slides on my blog – http://bommaritollc.com/blog/. © Bommarito Consulting
    • ¡  Books and Wiki Pages §  A Brief Survey of Text Mining. Hotho, Nurnberger, Paaß. §  http://www.kde.cs.uni-kassel.de/hotho/pub/2005/hotho05TextMining.pdf §  Text Mining: Predictive Methods for Analyzing Unstructured Information. Weiss, Indurkhya, Zhang, Damerau. §  http://www.amazon.com/Text-Mining-Predictive-Unstructured-Information/dp/0387954333 §  The Elements of Statistical Learning. §  http://www-stat.stanford.edu/~tibs/ElemStatLearn/ §  Wiki – Machine Learning. §  http://en.wikipedia.org/wiki/Machine_learning §  Wiki – Machine Learning Algorithms. §  http://en.wikipedia.org/wiki/List_of_machine_learning_algorithms ¡  Software §  Natural Language Toolkit (NLTK). §  http://nltk.org/ §  Stanford NLP Group. §  http://nlp.stanford.edu/software/ §  Weka. §  http://www.cs.waikato.ac.nz/ml/weka/ §  R. §  http://www.r-project.org/ §  SAS Predictive Analytics and Data Mining. §  http://www.sas.com/technologies/analytics/datamining/index.html REFERENCES