Have data? What now?!

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Have data? What now?! - Presentation Transcript

    1. Have Data? What now?! Hilary Mason @hmason
    2. (Focused) Data == Intelligence
    3. Common Problems
      • Gathering data
      • Parsing, Entity Extraction and Disambiguation
      • Clustering
      • Document classification
      • NLP
    4. Text is MESSY
    5. Do you need to parse it?
      • Parsing unstructured data is hard . (we’ll get to this)
      • CHEAT.
      • Open Calais ( www.opencalais.com ) currently supports:
      • Anniversary, City, Company, Continent, Country, Currency, EmailAddress, EntertainmentAwardEvent, Facility, FaxNumber, Holiday, IndustryTerm, MarketIndex, MedicalCondition, MedicalTreatment, Movie, MusicAlbum, MusicGroup, NaturalFeature, OperatingSystem, Organization, Person, PhoneNumber, Position, Product, ProgrammingLanguage, ProvinceOrState, PublishedMedium, RadioProgram, RadioStation, Region, SportsEvent, SportsGame, SportsLeague, Technology, TVShow, TVStation, URL
    6. Entity Disambiguation
      • This is important .
    7. ME UGLY HAG
    8. Entity Disambiguation
      • This is important .
      • Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?
    9. A Practical Approach – Path101 Human classification Data APIs Automatic classification model Example: Company Name External data from Open Calais, Freebase Based on industry , location , and type of job , we can differentiate between MS Volt (Microsoft) and Volt (Volt Information Sciences, Inc.)
    10. SPAM sucks
    11. Supervised Classification Text Feature Extractor Trained Classifier Cats Dogs Fire Training Data Feature Extractor
    12. Classification Example: Movie Reviews!
      • [['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]
      • …tagged ‘positive’ and ‘negative’.
      • #!/usr/bin/env python # encoding: utf-8 "””classification_example.py""" from __future__ import division import sys, os, random, nltk, re, pprint from nltk.corpus import movie_reviews def document_features( document, word_features): document_words = set(document) features = {} for word in word_features: features[ 'contains( %s )' % word] = (word in document_words) return features def main(): all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[: 2000] documents = [( list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] featuresets = [(document_features(d, word_features), c) for (d,c) in documents] train_set, test_set = featuresets[ 100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features( 20) if __name__ == '__main__': main()
    13. Clustering immunity ultrasound medical imaging medical devices thermoelectric devices fault-tolerant circuits low power devices
    14. Hierarchical Clustering
    15.  
    16. <3 Data
      • Thank you!

    + Hilary MasonHilary Mason, 5 months ago

    custom

    749 views, 0 favs, 2 embeds more stats

    A brief overview of common data analysis problems a more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 749
      • 572 on SlideShare
      • 177 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 6
    Most viewed embeds
    • 176 views on http://www.hilarymason.com
    • 1 views on http://feeds.feedburner.com

    more

    All embeds
    • 176 views on http://www.hilarymason.com
    • 1 views on http://feeds.feedburner.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories