Have data? What now?!

5,266 views

Published on

A brief overview of common data analysis problems and algorithms.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,266
On SlideShare
0
From Embeds
0
Number of Embeds
1,566
Actions
Shares
0
Downloads
48
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Have data? What now?!

    1. Have Data? What now?! Hilary Mason @hmason
    2. (Focused) Data == Intelligence
    3. Common Problems <ul><li>Gathering data </li></ul><ul><li>Parsing, Entity Extraction and Disambiguation </li></ul><ul><li>Clustering </li></ul><ul><li>Document classification </li></ul><ul><li>NLP </li></ul>
    4. Text is MESSY
    5. Do you need to parse it? <ul><li>Parsing unstructured data is hard . (we’ll get to this) </li></ul><ul><li>CHEAT. </li></ul><ul><li>Open Calais ( www.opencalais.com ) currently supports: </li></ul><ul><li>Anniversary, City, Company, Continent, Country, Currency, EmailAddress, EntertainmentAwardEvent, Facility, FaxNumber, Holiday, IndustryTerm, MarketIndex, MedicalCondition, MedicalTreatment, Movie, MusicAlbum, MusicGroup, NaturalFeature, OperatingSystem, Organization, Person, PhoneNumber, Position, Product, ProgrammingLanguage, ProvinceOrState, PublishedMedium, RadioProgram, RadioStation, Region, SportsEvent, SportsGame, SportsLeague, Technology, TVShow, TVStation, URL </li></ul>
    6. Entity Disambiguation <ul><li>This is important . </li></ul>
    7. ME UGLY HAG
    8. Entity Disambiguation <ul><li>This is important . </li></ul><ul><li>Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company? </li></ul>
    9. A Practical Approach – Path101 Human classification Data APIs Automatic classification model Example: Company Name External data from Open Calais, Freebase Based on industry , location , and type of job , we can differentiate between MS Volt (Microsoft) and Volt (Volt Information Sciences, Inc.)
    10. SPAM sucks
    11. Supervised Classification Text Feature Extractor Trained Classifier Cats Dogs Fire Training Data Feature Extractor
    12. Classification Example: Movie Reviews! <ul><li>[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...] </li></ul><ul><li>…tagged ‘positive’ and ‘negative’. </li></ul>
    13. <ul><li>#!/usr/bin/env python # encoding: utf-8 &quot;””classification_example.py&quot;&quot;&quot; from __future__ import division import sys, os, random, nltk, re, pprint from nltk.corpus import movie_reviews def document_features( document, word_features): document_words = set(document) features = {} for word in word_features: features[ 'contains( %s )' % word] = (word in document_words) return features def main(): all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[: 2000] documents = [( list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] featuresets = [(document_features(d, word_features), c) for (d,c) in documents] train_set, test_set = featuresets[ 100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features( 20) if __name__ == '__main__': main() </li></ul>
    14. Clustering immunity ultrasound medical imaging medical devices thermoelectric devices fault-tolerant circuits low power devices
    15. Hierarchical Clustering
    16.  
    17. <3 Data <ul><li>Thank you! </li></ul>

    ×