Practical Data Analysis in Python

17,796
-1

Published on

These are the slides from my presentation to the NYC Python Meetup on July 28, 2009. The presentation was an overview of data analysis techniques and various python tools and libraries, along with the practical example (with code and algorithms) of a Twitter spam filter implemented with NLTK.

Published in: Technology, Education
1 Comment
22 Likes
Statistics
Notes
  • How to write a function for python 2.6 that would analyze text data?
    I have been given a text file which reads
    'HELLOMYNAMEISSANDYANDIWANTTOGOSOMEWHE…
    HOWAREYOUDOINGSWEETIEHELLOMYNAMEISSAND…
    Now I need to design, write and test a Python script which analyzes the text given above. What I need to do is find the percentage of EACH OF THE ALPHABETS in the given text (% of A, % of B...and so on). I do not know where to start. File I/O maybe.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
17,796
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
325
Comments
1
Likes
22
Embeds 0
No embeds

No notes for slide
  • 1) Access to the data, and 2) CPU power/algorithms that are robust enough to analyze it
  • NLTK – in development since 2001
  • Practical Data Analysis in Python

    1. Practical Data Analysis in Python Hilary Mason @hmason www.hilarymason.com hilary@path101.com
    2. Data is ubiquitous. The ability and tools to use it are not.
    3. (Focused) Data == Intelligence
    4. Data Analysis on the Web Data items change rapidly. Data items are not independent. There’s a lot of semi-structured data around. There’s a LOT of data around. == Too many problems, few tools, and few experts.
    5. Entity Disambiguation This is important.
    6. ME UGLY HAG
    7. Entity Disambiguation This is important. Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company? This is a hard problem.
    8. SPAM sucks
    9. Classification Document classification. Image recognition. Topic recognition.
    10. Text Parsing
    11. Recommendation Systems Product recommendations. Disease predictions. Behavior analysis.
    12. IEEE Tag Clustering immunity ultrasound medical imaging medical devices thermoelectric devices fault-tolerant circuits low power devices
    13. Python for Data Analysis import why_python_is_awesome Python is readable. Easy to transition from Matlab or R. Numerical computing support. Growing set of machine learning libraries.
    14. Libraries NLTK (Natural Language Toolkit) – www.nltk.org mlpy (Machine Learning PY) – mlpy.fbk.eu numpy & scipy – scipy.org
    15. An EC2 AMI provisioned with all of the toys you need: http://blog.infochimps.org/2009/02/06/start- hacking-machetec2-released/ MachetEC2
    16. Supervised Classification Text Feature Extractor Trained Classifier Spam Not Spam Training Data Feature Extractor
    17. Data: Tweets Hand-classified. For example, some spam: | don't disrespect me. I just wanted yall to get a head start so don't feel bad when I have more followers in two days. http://xyyx.eu/a1ha | | oh yay more new followers..hiii...if u want go to http://xyyx.eu/a1hb | | My friend made this new tool to get more twitter followers, http://xyyx.eu/a1ht | | Yes, Twitter is doing some Follower/Following count corrections. Get it back at: http://xyyx.eu/a1h8 | | man if i see one more person cry about losing followers!!! http://xyyx.eu/a1h4 |
    18. Features def document_features(self, document): document_words = set(document) features = {} for word in self.word_features: features['contains(%s)' % word] = (word in document_words) return features Break tweets into lists of relevant words.
    19. Naïve Bayesian Classifer P(A|B) = the conditional probability of A given B http://yudkowsky.net/rational/bayes http://blog.oscarbonilla.com/2009/05/visualizin g-bayes-theorem/ classifier = nltk.NaiveBayesClassifier.train(train_set)
    20. Classifer Accuracy Use a hand-classified test set to see the accuracy of the classifier: nltk.classify.accuracy(classifier, test_set)
    21. Feature Relevance contains(') = True not_s : spam = 53.6 : 1.4 contains(") = True not_s : spam = 32.2 : 1.1 contains(#) = True not_s : spam = 22.0 : 1.0 contains(!) = True not_s : spam = 10.8 : 1.0 contains(*) = True spam : not_s = 7.4 : 1.0 contains(=) = True not_s : spam = 5.5 : 1.0 contains(i) = False spam : not_s = 5.2 : 1.0 contains(?) = True not_s : spam = 2.4 : 1.0 contains(:) = True spam : not_s = 2.3 : 1.0 contains(&) = True not_s : spam = 1.8 : 1.0 contains(;) = True not_s : spam = 1.6 : 1.0 contains($) = True spam : not_s = 1.5 : 1.0 contains(u) = True spam : not_s = 1.5 : 1.0 contains(2.0) = False not_s : spam = 1.4 : 1.0 contains(saw) = False not_s : spam = 1.4 : 1.0 contains(noble) = False not_s : spam = 1.4 : 1.0 contains(sound) = False not_s : spam = 1.3 : 1.0 contains(approach) = False not_s : spam = 1.3 : 1.0 contains(finally) = False not_s : spam = 1.3 : 1.0 contains(more) = False spam : not_s = 1.3 : 1.0
    22. Kitchen Sink wash, rinse, repeat
    23. Results 90% accuracy on spam tweets – not bad! Other possibilities: categorization – what do you tweet about? human vs bot? which celebrity tweeter are you?
    24. <3 Data Thank you!
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×