• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Have data? What now?!
 

Have data? What now?!

on

  • 5,540 views

A brief overview of common data analysis problems and algorithms.

A brief overview of common data analysis problems and algorithms.

Statistics

Views

Total Views
5,540
Views on SlideShare
4,124
Embed Views
1,416

Actions

Likes
2
Downloads
43
Comments
0

5 Embeds 1,416

http://www.hilarymason.com 1397
http://www.slideshare.net 16
http://feeds.feedburner.com 1
http://74.125.93.132 1
http://static.slidesharecdn.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Have data? What now?! Have data? What now?! Presentation Transcript

  • Have Data? What now?! Hilary Mason @hmason
  • (Focused) Data == Intelligence
  • Common Problems
    • Gathering data
    • Parsing, Entity Extraction and Disambiguation
    • Clustering
    • Document classification
    • NLP
  • Text is MESSY
  • Do you need to parse it?
    • Parsing unstructured data is hard . (we’ll get to this)
    • CHEAT.
    • Open Calais ( www.opencalais.com ) currently supports:
    • Anniversary, City, Company, Continent, Country, Currency, EmailAddress, EntertainmentAwardEvent, Facility, FaxNumber, Holiday, IndustryTerm, MarketIndex, MedicalCondition, MedicalTreatment, Movie, MusicAlbum, MusicGroup, NaturalFeature, OperatingSystem, Organization, Person, PhoneNumber, Position, Product, ProgrammingLanguage, ProvinceOrState, PublishedMedium, RadioProgram, RadioStation, Region, SportsEvent, SportsGame, SportsLeague, Technology, TVShow, TVStation, URL
  • Entity Disambiguation
    • This is important .
  • ME UGLY HAG
  • Entity Disambiguation
    • This is important .
    • Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?
  • A Practical Approach – Path101 Human classification Data APIs Automatic classification model Example: Company Name External data from Open Calais, Freebase Based on industry , location , and type of job , we can differentiate between MS Volt (Microsoft) and Volt (Volt Information Sciences, Inc.)
  • SPAM sucks
  • Supervised Classification Text Feature Extractor Trained Classifier Cats Dogs Fire Training Data Feature Extractor
  • Classification Example: Movie Reviews!
    • [['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]
    • …tagged ‘positive’ and ‘negative’.
    • #!/usr/bin/env python # encoding: utf-8 "””classification_example.py""" from __future__ import division import sys, os, random, nltk, re, pprint from nltk.corpus import movie_reviews def document_features( document, word_features): document_words = set(document) features = {} for word in word_features: features[ 'contains( %s )' % word] = (word in document_words) return features def main(): all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[: 2000] documents = [( list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] featuresets = [(document_features(d, word_features), c) for (d,c) in documents] train_set, test_set = featuresets[ 100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features( 20) if __name__ == '__main__': main()
  • Clustering immunity ultrasound medical imaging medical devices thermoelectric devices fault-tolerant circuits low power devices
  • Hierarchical Clustering
  •  
  • <3 Data
    • Thank you!