Analyzing Data With Python


Published on

Given at OSCON 2014 and PyTennessee 2015.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Analyzing Data With Python

  1. 1. Sarah Guido @sarah_guido Reonomy PYTN 2015 ANALYZING DATA WITH PYTHON
  2. 2. Data scientist at Reonomy University of Michigan graduate NYC Python organizer PyGotham organizer ABOUT ME
  3. 3. Bird’s-eye overview: not comprehensive explanation of these tools! Take data from start-to-finish Preprocessing: Pandas Analysis: scikit-learn Analysis: nltk Data pipeline: MRjob Visualization: ggplot What next? ABOUT THIS TALK
  4. 4. So many tools Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability Community support “Easy” language to learn Both a scripting and production-ready language WHY PYTHON?
  5. 5. How to find the best tool(s)? The 90/10 rule Simple is better than complex FROM POINT A TO POINT…X?
  6. 6. Available resources Documentation, tutorials, books, videos Ease of use (with a grain of salt) Community support and continuous development Widely used WHY I CHOSE THESE TOOLS
  7. 7. The importance of data preprocessing AKA wrangling, cleaning, manipulating, and so on Preprocessing is also getting to know your data Missing values? Categorical/continuous? Distribution? PREPROCESSING
  8. 8. Data analysis and modeling Similar to R and Excel Easy-to-use data structures DataFrame Data wrangling tools Merging, pivoting, etc PANDAS
  9. 9. Keep everything in Python Community support/resources Use for preprocessing File I/0, cleaning, manipulation, etc Combinable with other modules NumPy, SciPy, statsmodel, matplotlib PANDAS
  10. 10. File I/O PANDAS
  11. 11. Finding missing values PANDAS
  12. 12. Removing missing values PANDAS
  13. 13. Pivoting PANDAS
  14. 14. Other things Statistical methods Merge/join like SQL Time series Has some visualization functionality PANDAS
  15. 15. Application of algorithms that learn from examples Representation and generalization Useful in everyday life Especially useful in data analysis MACHINE LEARNING
  16. 16. Supervised learning Classification and regression Unsupervised learning Clustering and dimensionality reduction MACHINE LEARNING
  17. 17. Machine learning module Open-source Built-in datasets Good resources for learning SCIKIT-LEARN
  18. 18. Scikit-learn: your data has to be continuous Here’s what one observation/label looks like: SCIKIT-LEARN
  19. 19. Transform categorical values/labels SCIKIT-LEARN
  20. 20. Classification SCIKIT-LEARN
  21. 21. Classification SCIKIT-LEARN
  22. 22. Other things Very comprehensive of machine learning algorithms Preprocessing tools Methods for testing the accuracy of your model SCIKIT-LEARN
  23. 23. Concerned with interactions between computers and human languages Derive meaning from text Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING
  24. 24. Natural Language ToolKit Access to over 50 corpora Corpus: body of text NLP tools Stemming, tokenizing, etc Resources for learning NLTK
  25. 25. Stopword removal NLTK
  26. 26. Stopword removal NLTK
  27. 27. Stemming NLTK
  28. 28. Other things Lemmatizing, tokenization, tagging, parse trees Classification Chunking Sentence structure NLTK
  29. 29. Data that takes too long to process on your machine Not “big data” but larger data Solution: MapReduce! Processing large datasets with a parallel, distributed algorithm Map step Reduce step PROCESSING LARGE DATA
  30. 30. Map step Takes series of key/value pairs Ex. Word counts: break line into words, return word and count within line Reduce step Once for each unique key: iterates through values associated with that key Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA
  31. 31. Write MapReduce jobs in Python Test code locally without installing Hadoop Lots of thorough documentation A few things to know Keep everything in one class MRJob program in a separate file Output to new file if doing something like word counts MRJOB
  32. 32. Stemmed file Line 1: (‘miss’, 2), (‘taylor’, 1) Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) And so on… MRJOB
  33. 33. Map  Line 1: (‘miss’, 2), (‘taylor’, 1)  Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1)  Line 3: (‘first’, 1), (‘wed’, 1)  Line 4: (‘father’, 1)  Line 5: (‘father’, 1) Reduce  (‘miss’, 2)  (‘taylor’, 2)  (‘first’, 2)  (‘wed’, 2)  (‘father’, 2) MRJOB
  34. 34. Let’s count all words in the Gutenberg file Map step MRJOB
  35. 35. Reduce (and run) step MRJOB
  36. 36. Results Mapped counts reduced Key/val pairs MRJOB
  37. 37. Other things Run on Hadoop clusters Can write highly complex jobs Works with Elasticsearch MRJOB
  38. 38. The “final step” Conveying your results in a meaningful way Literally see what’s going on DATA VISUALIZATION
  39. 39. Remember this? DATA VISUALIZATION
  40. 40. Bar chart of distribution DATA VISUALIZATION
  41. 41. 2D visualization library Based on the similar R library Wrapper around matplotlib Wide variety of nice-looking plots Easy to feed in Pandas GGPLOT
  42. 42. Bar chart of distribution GGPLOT
  43. 43. Layers Aesthetics aes() No need for value_counts()! GGPLOT
  44. 44.  Breakdown of class per maintenance type GGPLOT
  45. 45. Other things Many different kinds of graphs Customizable Smoothing, facets Time series Themes! GGPLOT
  46. 46. theme_xkcd() GGPLOT
  47. 47. Phew! Which tool to choose depends on your needs Workflow: Preprocess Analyze Visualize WHAT NEXT?
  48. 48. Pandas  scikit-learn  NLTK  MRJob  ggplot  RESOURCES
  49. 49. Twitter @sarah_guido LinkedIn  NYC Python  CONTACT ME!
  50. 50. AND FINALLY…
  51. 51. Questions? THE END!