Practical Data Analysis in Python

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    1) Access to the data, and 2) CPU power/algorithms that are robust enough to analyze it

    NLTK – in development since 2001

    Favorites, Groups & Events

    Practical Data Analysis in Python - Presentation Transcript

    1. Practical Data Analysis in Python
      Hilary Mason
      @hmason
      www.hilarymason.com
      hilary@path101.com
    2. Data is ubiquitous.
      The ability and tools to use it are not.
    3. (Focused) Data == Intelligence
    4. Data Analysis on the Web
      Data items change rapidly.
      Data items are not independent.
      There’s a lot of semi-structured data around.
      There’s a LOT of data around.
      ==
      Too many problems, few tools, and few experts.
    5. Entity Disambiguation
      This is important.
    6. UGLY HAG
      ME
    7. Entity Disambiguation
      This is important.
      Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?
      This is a hard problem.
    8. SPAM sucks
    9. Classification
      Document classification.
      Image recognition.
      Topic recognition.
    10. Text Parsing
    11. Recommendation Systems
      Product recommendations.
      Disease predictions.
      Behavior analysis.
    12. IEEE Tag Clustering
      immunity
      fault-tolerant circuits
      medical imaging
      ultrasound
      low power devices
      thermoelectric devices
      medical devices
    13. Python for Data Analysis
      import why_python_is_awesome
      Python is readable.
      Easy to transition from Matlab or R.
      Numerical computing support.
      Growing set of machine learning libraries.
    14. Libraries
      NLTK (Natural Language Toolkit) – www.nltk.org
      mlpy (Machine Learning PY) – mlpy.fbk.eu
      numpy & scipy – scipy.org
    15. An EC2 AMI provisioned with all of the toys you need:
      http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released/
      MachetEC2
    16. Demo: Classifying Tweets
    17. Supervised Classification
      Spam
      Training
      Data
      Feature Extractor
      Text
      Feature Extractor
      Trained
      Classifier
      Not Spam
    18. Data: Tweets
      Hand-classified. For example, some spam:
      | don't disrespect me. I just wanted yall to get a head start so don't feel bad when I have more followers in two days. http://xyyx.eu/a1ha |
      | oh yay more new followers..hiii...if u want go to http://xyyx.eu/a1hb |
      | My friend made this new tool to get more twitter followers, http://xyyx.eu/a1ht |
      | Yes, Twitter is doing some Follower/Following count corrections. Get it back at: http://xyyx.eu/a1h8 |
      | man if i see one more person cry about losing followers!!! http://xyyx.eu/a1h4 |
    19. Features
      def document_features(self, document):document_words= set(document) features = {} for word in self.word_features:features['contains(%s)' % word] = (word in document_words) return features
      Break tweets into lists of relevant words.
    20. Naïve Bayesian Classifer
      P(A|B) = the conditional probability of A given B
      http://yudkowsky.net/rational/bayes
      http://blog.oscarbonilla.com/2009/05/visualizing-bayes-theorem/
      classifier = nltk.NaiveBayesClassifier.train(train_set)
    21. Classifer Accuracy
      Use a hand-classified test set to see the accuracy of the classifier:
      nltk.classify.accuracy(classifier, test_set)
    22. Feature Relevance
      contains(') = True not_s : spam = 53.6 : 1.4
      contains(") = True not_s : spam = 32.2 : 1.1
      contains(#) = True not_s : spam = 22.0 : 1.0
      contains(!) = True not_s : spam = 10.8 : 1.0
      contains(*) = True spam : not_s = 7.4 : 1.0
      contains(=) = True not_s : spam = 5.5 : 1.0
      contains(i) = False spam : not_s = 5.2 : 1.0
      contains(?) = True not_s : spam = 2.4 : 1.0
      contains(:) = True spam : not_s = 2.3 : 1.0
      contains(&) = True not_s : spam = 1.8 : 1.0
      contains(;) = True not_s : spam = 1.6 : 1.0
      contains($) = True spam : not_s = 1.5 : 1.0
      contains(u) = True spam : not_s = 1.5 : 1.0
      contains(2.0) = False not_s : spam = 1.4 : 1.0
      contains(saw) = False not_s : spam = 1.4 : 1.0
      contains(noble) = False not_s : spam = 1.4 : 1.0
      contains(sound) = False not_s : spam = 1.3 : 1.0
      contains(approach) = False not_s : spam = 1.3 : 1.0
      contains(finally) = False not_s : spam = 1.3 : 1.0
      contains(more) = False spam : not_s = 1.3 : 1.0
    23. Kitchen Sink
      wash, rinse, repeat
    24. Results
      90% accuracy on spam tweets – not bad!
      Other possibilities:
      categorization – what do you tweet about?
      human vsbot?
      which celebrity tweeter are you?
    25. <3 Data
      Thank you!

    + Hilary MasonHilary Mason, 3 months ago

    custom

    837 views, 0 favs, 2 embeds more stats

    These are the slides from my presentation to the NY more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 837
      • 722 on SlideShare
      • 115 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 14
    Most viewed embeds
    • 114 views on http://www.hilarymason.com
    • 1 views on http://74.125.93.132

    more

    All embeds
    • 114 views on http://www.hilarymason.com
    • 1 views on http://74.125.93.132

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories