Have Data? What now?! Hilary Mason @hmason
(Focused) Data == Intelligence
Common Problems <ul><li>Gathering data </li></ul><ul><li>Parsing, Entity Extraction and Disambiguation </li></ul><ul><li>C...
Text is MESSY
Do you  need  to parse it? <ul><li>Parsing unstructured data is  hard . (we’ll get to this) </li></ul><ul><li>CHEAT. </li>...
Entity Disambiguation <ul><li>This is  important . </li></ul>
ME UGLY HAG
Entity Disambiguation <ul><li>This is  important . </li></ul><ul><li>Company disambiguation is a very common problem –  Ar...
A Practical Approach – Path101 Human classification Data APIs Automatic classification model Example: Company Name Externa...
SPAM sucks
Supervised Classification Text Feature Extractor Trained Classifier Cats Dogs Fire Training  Data Feature Extractor
Classification Example: Movie Reviews! <ul><li>[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party'...
<ul><li>#!/usr/bin/env python # encoding: utf-8 &quot;””classification_example.py&quot;&quot;&quot; from __future__ import...
Clustering immunity ultrasound medical imaging medical devices thermoelectric devices fault-tolerant circuits low power de...
Hierarchical Clustering
 
<3 Data <ul><li>Thank you! </li></ul>
Upcoming SlideShare
Loading in …5
×

Have data? What now?!

5,206 views

Published on

A brief overview of common data analysis problems and algorithms.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,206
On SlideShare
0
From Embeds
0
Number of Embeds
1,566
Actions
Shares
0
Downloads
47
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Have data? What now?!

    1. Have Data? What now?! Hilary Mason @hmason
    2. (Focused) Data == Intelligence
    3. Common Problems <ul><li>Gathering data </li></ul><ul><li>Parsing, Entity Extraction and Disambiguation </li></ul><ul><li>Clustering </li></ul><ul><li>Document classification </li></ul><ul><li>NLP </li></ul>
    4. Text is MESSY
    5. Do you need to parse it? <ul><li>Parsing unstructured data is hard . (we’ll get to this) </li></ul><ul><li>CHEAT. </li></ul><ul><li>Open Calais ( www.opencalais.com ) currently supports: </li></ul><ul><li>Anniversary, City, Company, Continent, Country, Currency, EmailAddress, EntertainmentAwardEvent, Facility, FaxNumber, Holiday, IndustryTerm, MarketIndex, MedicalCondition, MedicalTreatment, Movie, MusicAlbum, MusicGroup, NaturalFeature, OperatingSystem, Organization, Person, PhoneNumber, Position, Product, ProgrammingLanguage, ProvinceOrState, PublishedMedium, RadioProgram, RadioStation, Region, SportsEvent, SportsGame, SportsLeague, Technology, TVShow, TVStation, URL </li></ul>
    6. Entity Disambiguation <ul><li>This is important . </li></ul>
    7. ME UGLY HAG
    8. Entity Disambiguation <ul><li>This is important . </li></ul><ul><li>Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company? </li></ul>
    9. A Practical Approach – Path101 Human classification Data APIs Automatic classification model Example: Company Name External data from Open Calais, Freebase Based on industry , location , and type of job , we can differentiate between MS Volt (Microsoft) and Volt (Volt Information Sciences, Inc.)
    10. SPAM sucks
    11. Supervised Classification Text Feature Extractor Trained Classifier Cats Dogs Fire Training Data Feature Extractor
    12. Classification Example: Movie Reviews! <ul><li>[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...] </li></ul><ul><li>…tagged ‘positive’ and ‘negative’. </li></ul>
    13. <ul><li>#!/usr/bin/env python # encoding: utf-8 &quot;””classification_example.py&quot;&quot;&quot; from __future__ import division import sys, os, random, nltk, re, pprint from nltk.corpus import movie_reviews def document_features( document, word_features): document_words = set(document) features = {} for word in word_features: features[ 'contains( %s )' % word] = (word in document_words) return features def main(): all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[: 2000] documents = [( list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] featuresets = [(document_features(d, word_features), c) for (d,c) in documents] train_set, test_set = featuresets[ 100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features( 20) if __name__ == '__main__': main() </li></ul>
    14. Clustering immunity ultrasound medical imaging medical devices thermoelectric devices fault-tolerant circuits low power devices
    15. Hierarchical Clustering
    16.  
    17. <3 Data <ul><li>Thank you! </li></ul>

    ×