• Like
Analyzing Data With Python
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Analyzing Data With Python

  • 791 views
Published

Given at OSCON 2014 …

Given at OSCON 2014

http://www.oscon.com/oscon2014/public/schedule/detail/34255

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
791
On SlideShare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
32
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Sarah Guido @sarah_guido Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON
  • 2. Data scientist at Reonomy University of Michigan graduate NYC Python organizer PyGotham organizer ABOUT ME
  • 3. Bird’s-eye overview: not comprehensive explanation of these tools! Take data from start-to-finish Preprocessing: Pandas Analysis: scikit-learn Analysis: nltk Data pipeline: MRjob Visualization: matplotlib What next? ABOUT THIS TALK
  • 4. So many tools Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability Community support “Easy” language to learn Both a scripting and production-ready language WHY PYTHON?
  • 5. How to find the best tool(s)? The 90/10 rule Simple is better than complex FROM POINT A TO POINT…X?
  • 6. Available resources Documentation, tutorials, books, videos Ease of use (with a grain of salt) Community support and continuous development Widely used WHY I CHOSE THESE TOOLS
  • 7. The importance of data preprocessing AKA wrangling, munging, manipulating, and so on Preprocessing is also getting to know your data Missing values? Categorical/continuous? Distribution? PREPROCESSING
  • 8. Data analysis and modeling Similar to R and Excel Easy-to-use data structures DataFrame Data wrangling tools Merging, pivoting, etc PANDAS
  • 9. Keep everything in Python Community support/resources Use for preprocessing File I/0, cleaning, manipulation, etc Combinable with other modules NumPy, SciPy, statsmodel, matplotlib PANDAS
  • 10. File I/O PANDAS
  • 11. Finding missing values PANDAS
  • 12. Removing missing values PANDAS
  • 13. Pivoting PANDAS
  • 14. Other things Statistical methods Merge/join like SQL Time series Has some visualization functionality PANDAS
  • 15. Application of algorithms that learn from examples Representation and generalization Useful in everyday life Especially useful in data analysis MACHINE LEARNING
  • 16. Supervised learning Classification and regression Unsupervised learning Clustering and dimensionality reduction MACHINE LEARNING
  • 17. Machine learning module Open-source Built-in datasets Good resources for learning SCIKIT-LEARN
  • 18. Scikit-learn: your data has to be continuous Here’s what one observation/label looks like: SCIKIT-LEARN
  • 19. Transform categorical values/labels SCIKIT-LEARN
  • 20. Classification SCIKIT-LEARN
  • 21. Classification SCIKIT-LEARN
  • 22. Other things Very comprehensive of machine learning algorithms Preprocessing tools Methods for testing the accuracy of your model SCIKIT-LEARN
  • 23. Concerned with interactions between computers and human languages Derive meaning from text Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING
  • 24. Natural Language ToolKit Access to over 50 corpora Corpus: body of text NLP tools Stemming, tokenizing, etc Resources for learning NLTK
  • 25. Stopword removal NLTK
  • 26. Stopword removal NLTK
  • 27. Stemming NLTK
  • 28. Other things Lemmatizing, tokenization, tagging, parse trees Classification Chunking Sentence structure NLTK
  • 29. Data that takes too long to process on your machine Not “big data” but larger data Solution: MapReduce! Processing large datasets with a parallel, distributed algorithm Map step Reduce step PROCESSING LARGE DATA
  • 30. Map step Takes series of key/value pairs Ex. Word counts: break line into words, return word and count within line Reduce step Once for each unique key: iterates through values associated with that key Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA
  • 31. Write MapReduce jobs in Python Test code locally without installing Hadoop Lots of thorough documentation A few things to know Keep everything in one class MRJob program in a separate file Output to new file if doing something like word counts MRJOB
  • 32. Stemmed file Line 1: (‘miss’, 2), (‘taylor’, 1) Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) And so on… MRJOB
  • 33. Map  Line 1: (‘miss’, 2), (‘taylor’, 1)  Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1)  Line 3: (‘first’, 1), (‘wed’, 1)  Line 4: (‘father’, 1)  Line 5: (‘father’, 1) Reduce  (‘miss’, 2)  (‘taylor’, 2)  (‘first’, 2)  (‘wed’, 2)  (‘father’, 2) MRJOB
  • 34. Let’s count all words in the Gutenberg file Map step MRJOB
  • 35. Reduce (and run) step MRJOB
  • 36. Results Mapped counts reduced Key/val pairs MRJOB
  • 37. Other things Run on Hadoop clusters Can write highly complex jobs Works with Elasticsearch MRJOB
  • 38. The “final step” Conveying your results in a meaningful way Literally see what’s going on DATA VISUALIZATION
  • 39. 2D visualization library Very VERY widely used Wide variety of plots Easy to feed in results from other modules (like Pandas, scikit-learn, NumPy, SciPy, etc) MATPLOTLIB
  • 40. Remember this? MATPLOTLIB
  • 41. Bar chart of distribution MATPLOTLIB
  • 42. Let’s graph our word count frequencies (Hint: It’s a power law distribution!) MATPLOTLIB
  • 43. High frequency of low numbers, low frequency of high numbers MATPLOTLIB
  • 44. Other things Many different kinds of graphs Customizable Time series MATPLOTLIB
  • 45. Phew! Which tool to choose depends on your needs Workflow: Preprocess Analyze Visualize WHAT NEXT?
  • 46. Pandas http://pandas.pydata.org/ scikit-learn http://scikit-learn.org/ NLTK http://www.nltk.org/ MRJob http://mrjob.readthedocs.org/ matplotlib http://matplotlib.org/ RESOURCES
  • 47. Twitter @sarah_guido LinkedIn https://www.linkedin.com/in/sarahguido NYC Python http://www.meetup.com/nycpython/ CONTACT ME!
  • 48. AND FINALLY…
  • 49. Questions? THE END!