• Like
Analyzing Data With Python
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Analyzing Data With Python


Given at OSCON 2014 …

Given at OSCON 2014


Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Sarah Guido @sarah_guido Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON
  • 2. Data scientist at Reonomy University of Michigan graduate NYC Python organizer PyGotham organizer ABOUT ME
  • 3. Bird’s-eye overview: not comprehensive explanation of these tools! Take data from start-to-finish Preprocessing: Pandas Analysis: scikit-learn Analysis: nltk Data pipeline: MRjob Visualization: matplotlib What next? ABOUT THIS TALK
  • 4. So many tools Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability Community support “Easy” language to learn Both a scripting and production-ready language WHY PYTHON?
  • 5. How to find the best tool(s)? The 90/10 rule Simple is better than complex FROM POINT A TO POINT…X?
  • 6. Available resources Documentation, tutorials, books, videos Ease of use (with a grain of salt) Community support and continuous development Widely used WHY I CHOSE THESE TOOLS
  • 7. The importance of data preprocessing AKA wrangling, munging, manipulating, and so on Preprocessing is also getting to know your data Missing values? Categorical/continuous? Distribution? PREPROCESSING
  • 8. Data analysis and modeling Similar to R and Excel Easy-to-use data structures DataFrame Data wrangling tools Merging, pivoting, etc PANDAS
  • 9. Keep everything in Python Community support/resources Use for preprocessing File I/0, cleaning, manipulation, etc Combinable with other modules NumPy, SciPy, statsmodel, matplotlib PANDAS
  • 10. File I/O PANDAS
  • 11. Finding missing values PANDAS
  • 12. Removing missing values PANDAS
  • 13. Pivoting PANDAS
  • 14. Other things Statistical methods Merge/join like SQL Time series Has some visualization functionality PANDAS
  • 15. Application of algorithms that learn from examples Representation and generalization Useful in everyday life Especially useful in data analysis MACHINE LEARNING
  • 16. Supervised learning Classification and regression Unsupervised learning Clustering and dimensionality reduction MACHINE LEARNING
  • 17. Machine learning module Open-source Built-in datasets Good resources for learning SCIKIT-LEARN
  • 18. Scikit-learn: your data has to be continuous Here’s what one observation/label looks like: SCIKIT-LEARN
  • 19. Transform categorical values/labels SCIKIT-LEARN
  • 20. Classification SCIKIT-LEARN
  • 21. Classification SCIKIT-LEARN
  • 22. Other things Very comprehensive of machine learning algorithms Preprocessing tools Methods for testing the accuracy of your model SCIKIT-LEARN
  • 23. Concerned with interactions between computers and human languages Derive meaning from text Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING
  • 24. Natural Language ToolKit Access to over 50 corpora Corpus: body of text NLP tools Stemming, tokenizing, etc Resources for learning NLTK
  • 25. Stopword removal NLTK
  • 26. Stopword removal NLTK
  • 27. Stemming NLTK
  • 28. Other things Lemmatizing, tokenization, tagging, parse trees Classification Chunking Sentence structure NLTK
  • 29. Data that takes too long to process on your machine Not “big data” but larger data Solution: MapReduce! Processing large datasets with a parallel, distributed algorithm Map step Reduce step PROCESSING LARGE DATA
  • 30. Map step Takes series of key/value pairs Ex. Word counts: break line into words, return word and count within line Reduce step Once for each unique key: iterates through values associated with that key Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA
  • 31. Write MapReduce jobs in Python Test code locally without installing Hadoop Lots of thorough documentation A few things to know Keep everything in one class MRJob program in a separate file Output to new file if doing something like word counts MRJOB
  • 32. Stemmed file Line 1: (‘miss’, 2), (‘taylor’, 1) Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) And so on… MRJOB
  • 33. Map  Line 1: (‘miss’, 2), (‘taylor’, 1)  Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1)  Line 3: (‘first’, 1), (‘wed’, 1)  Line 4: (‘father’, 1)  Line 5: (‘father’, 1) Reduce  (‘miss’, 2)  (‘taylor’, 2)  (‘first’, 2)  (‘wed’, 2)  (‘father’, 2) MRJOB
  • 34. Let’s count all words in the Gutenberg file Map step MRJOB
  • 35. Reduce (and run) step MRJOB
  • 36. Results Mapped counts reduced Key/val pairs MRJOB
  • 37. Other things Run on Hadoop clusters Can write highly complex jobs Works with Elasticsearch MRJOB
  • 38. The “final step” Conveying your results in a meaningful way Literally see what’s going on DATA VISUALIZATION
  • 39. 2D visualization library Very VERY widely used Wide variety of plots Easy to feed in results from other modules (like Pandas, scikit-learn, NumPy, SciPy, etc) MATPLOTLIB
  • 40. Remember this? MATPLOTLIB
  • 41. Bar chart of distribution MATPLOTLIB
  • 42. Let’s graph our word count frequencies (Hint: It’s a power law distribution!) MATPLOTLIB
  • 43. High frequency of low numbers, low frequency of high numbers MATPLOTLIB
  • 44. Other things Many different kinds of graphs Customizable Time series MATPLOTLIB
  • 45. Phew! Which tool to choose depends on your needs Workflow: Preprocess Analyze Visualize WHAT NEXT?
  • 46. Pandas http://pandas.pydata.org/ scikit-learn http://scikit-learn.org/ NLTK http://www.nltk.org/ MRJob http://mrjob.readthedocs.org/ matplotlib http://matplotlib.org/ RESOURCES
  • 47. Twitter @sarah_guido LinkedIn https://www.linkedin.com/in/sarahguido NYC Python http://www.meetup.com/nycpython/ CONTACT ME!
  • 48. AND FINALLY…
  • 49. Questions? THE END!