Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Sarah Guido
@sarah_guido
Reonomy
PYTN 2015
ANALYZING DATA WITH
PYTHON
Data scientist at Reonomy
University of Michigan graduate
NYC Python organizer
PyGotham organizer
ABOUT ME
Bird’s-eye overview: not comprehensive
explanation of these tools!
Take data from start-to-finish
Preprocessing: Pandas...
So many tools
Preprocessing, analysis, statistics, machine learning,
natural language processing, network analysis,
visu...
How to find the best tool(s)?
The 90/10 rule
Simple is better than complex
FROM POINT A TO POINT…X?
Available resources
Documentation, tutorials, books, videos
Ease of use (with a grain of salt)
Community support and c...
The importance of data preprocessing
AKA wrangling, cleaning, manipulating, and so on
Preprocessing is also getting to ...
Data analysis and modeling
Similar to R and Excel
Easy-to-use data structures
DataFrame
Data wrangling tools
Merging...
Keep everything in Python
Community support/resources
Use for preprocessing
File I/0, cleaning, manipulation, etc
Com...
File I/O
PANDAS
Finding missing values
PANDAS
Removing missing values
PANDAS
Pivoting
PANDAS
Other things
Statistical methods
Merge/join like SQL
Time series
Has some visualization functionality
PANDAS
Application of algorithms that learn from
examples
Representation and generalization
Useful in everyday life
Especiall...
Supervised learning
Classification and regression
Unsupervised learning
Clustering and dimensionality reduction
MACHIN...
Machine learning module
Open-source
Built-in datasets
Good resources for learning
SCIKIT-LEARN
Scikit-learn: your data has to be continuous
Here’s what one observation/label looks like:
SCIKIT-LEARN
Transform categorical values/labels
SCIKIT-LEARN
Classification
SCIKIT-LEARN
Classification
SCIKIT-LEARN
Other things
Very comprehensive of machine learning
algorithms
Preprocessing tools
Methods for testing the accuracy of...
Concerned with interactions between
computers and human languages
Derive meaning from text
Many NLP algorithms are base...
Natural Language ToolKit
Access to over 50 corpora
Corpus: body of text
NLP tools
Stemming, tokenizing, etc
Resource...
Stopword removal
NLTK
Stopword removal
NLTK
Stemming
NLTK
Other things
Lemmatizing, tokenization, tagging, parse trees
Classification
Chunking
Sentence structure
NLTK
Data that takes too long to process on your
machine
Not “big data” but larger data
Solution: MapReduce!
Processing lar...
Map step
Takes series of key/value pairs
Ex. Word counts: break line into words, return
word and count within line
Red...
Write MapReduce jobs in Python
Test code locally without installing Hadoop
Lots of thorough documentation
A few things...
Stemmed file
Line 1: (‘miss’, 2), (‘taylor’, 1)
Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1)
And so on…
MRJOB
Map
 Line 1: (‘miss’, 2),
(‘taylor’, 1)
 Line 2: (‘taylor’, 1),
(‘first’, 1), (‘wed’, 1)
 Line 3: (‘first’, 1),
(‘wed’,...
Let’s count all words in the Gutenberg file
Map step
MRJOB
Reduce (and run) step
MRJOB
Results
Mapped counts reduced
Key/val pairs
MRJOB
Other things
Run on Hadoop clusters
Can write highly complex jobs
Works with Elasticsearch
MRJOB
The “final step”
Conveying your results in a meaningful way
Literally see what’s going on
DATA VISUALIZATION
Remember this?
DATA VISUALIZATION
Bar chart of distribution
DATA VISUALIZATION
2D visualization library
Based on the similar R library
Wrapper around matplotlib
Wide variety of nice-looking plots
...
Bar chart of distribution
GGPLOT
Layers
Aesthetics
aes()
No need for value_counts()!
GGPLOT
 Breakdown of
class per
maintenance type
GGPLOT
Other things
Many different kinds of graphs
Customizable
Smoothing, facets
Time series
Themes!
GGPLOT
theme_xkcd()
GGPLOT
Phew!
Which tool to choose depends on your needs
Workflow:
Preprocess
Analyze
Visualize
WHAT NEXT?
Pandas
http://pandas.pydata.org/
scikit-learn
http://scikit-learn.org/
NLTK
http://www.nltk.org/
MRJob
http://mrjo...
Twitter
@sarah_guido
LinkedIn
https://www.linkedin.com/in/sarahguido
NYC Python
http://www.meetup.com/nycpython/
CON...
AND FINALLY…
Questions?
THE END!
Upcoming SlideShare
Loading in …5
×

Analyzing Data With Python

3,246 views

Published on

Given at OSCON 2014 and PyTennessee 2015.

Published in: Technology

Analyzing Data With Python

  1. 1. Sarah Guido @sarah_guido Reonomy PYTN 2015 ANALYZING DATA WITH PYTHON
  2. 2. Data scientist at Reonomy University of Michigan graduate NYC Python organizer PyGotham organizer ABOUT ME
  3. 3. Bird’s-eye overview: not comprehensive explanation of these tools! Take data from start-to-finish Preprocessing: Pandas Analysis: scikit-learn Analysis: nltk Data pipeline: MRjob Visualization: ggplot What next? ABOUT THIS TALK
  4. 4. So many tools Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability Community support “Easy” language to learn Both a scripting and production-ready language WHY PYTHON?
  5. 5. How to find the best tool(s)? The 90/10 rule Simple is better than complex FROM POINT A TO POINT…X?
  6. 6. Available resources Documentation, tutorials, books, videos Ease of use (with a grain of salt) Community support and continuous development Widely used WHY I CHOSE THESE TOOLS
  7. 7. The importance of data preprocessing AKA wrangling, cleaning, manipulating, and so on Preprocessing is also getting to know your data Missing values? Categorical/continuous? Distribution? PREPROCESSING
  8. 8. Data analysis and modeling Similar to R and Excel Easy-to-use data structures DataFrame Data wrangling tools Merging, pivoting, etc PANDAS
  9. 9. Keep everything in Python Community support/resources Use for preprocessing File I/0, cleaning, manipulation, etc Combinable with other modules NumPy, SciPy, statsmodel, matplotlib PANDAS
  10. 10. File I/O PANDAS
  11. 11. Finding missing values PANDAS
  12. 12. Removing missing values PANDAS
  13. 13. Pivoting PANDAS
  14. 14. Other things Statistical methods Merge/join like SQL Time series Has some visualization functionality PANDAS
  15. 15. Application of algorithms that learn from examples Representation and generalization Useful in everyday life Especially useful in data analysis MACHINE LEARNING
  16. 16. Supervised learning Classification and regression Unsupervised learning Clustering and dimensionality reduction MACHINE LEARNING
  17. 17. Machine learning module Open-source Built-in datasets Good resources for learning SCIKIT-LEARN
  18. 18. Scikit-learn: your data has to be continuous Here’s what one observation/label looks like: SCIKIT-LEARN
  19. 19. Transform categorical values/labels SCIKIT-LEARN
  20. 20. Classification SCIKIT-LEARN
  21. 21. Classification SCIKIT-LEARN
  22. 22. Other things Very comprehensive of machine learning algorithms Preprocessing tools Methods for testing the accuracy of your model SCIKIT-LEARN
  23. 23. Concerned with interactions between computers and human languages Derive meaning from text Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING
  24. 24. Natural Language ToolKit Access to over 50 corpora Corpus: body of text NLP tools Stemming, tokenizing, etc Resources for learning NLTK
  25. 25. Stopword removal NLTK
  26. 26. Stopword removal NLTK
  27. 27. Stemming NLTK
  28. 28. Other things Lemmatizing, tokenization, tagging, parse trees Classification Chunking Sentence structure NLTK
  29. 29. Data that takes too long to process on your machine Not “big data” but larger data Solution: MapReduce! Processing large datasets with a parallel, distributed algorithm Map step Reduce step PROCESSING LARGE DATA
  30. 30. Map step Takes series of key/value pairs Ex. Word counts: break line into words, return word and count within line Reduce step Once for each unique key: iterates through values associated with that key Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA
  31. 31. Write MapReduce jobs in Python Test code locally without installing Hadoop Lots of thorough documentation A few things to know Keep everything in one class MRJob program in a separate file Output to new file if doing something like word counts MRJOB
  32. 32. Stemmed file Line 1: (‘miss’, 2), (‘taylor’, 1) Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) And so on… MRJOB
  33. 33. Map  Line 1: (‘miss’, 2), (‘taylor’, 1)  Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1)  Line 3: (‘first’, 1), (‘wed’, 1)  Line 4: (‘father’, 1)  Line 5: (‘father’, 1) Reduce  (‘miss’, 2)  (‘taylor’, 2)  (‘first’, 2)  (‘wed’, 2)  (‘father’, 2) MRJOB
  34. 34. Let’s count all words in the Gutenberg file Map step MRJOB
  35. 35. Reduce (and run) step MRJOB
  36. 36. Results Mapped counts reduced Key/val pairs MRJOB
  37. 37. Other things Run on Hadoop clusters Can write highly complex jobs Works with Elasticsearch MRJOB
  38. 38. The “final step” Conveying your results in a meaningful way Literally see what’s going on DATA VISUALIZATION
  39. 39. Remember this? DATA VISUALIZATION
  40. 40. Bar chart of distribution DATA VISUALIZATION
  41. 41. 2D visualization library Based on the similar R library Wrapper around matplotlib Wide variety of nice-looking plots Easy to feed in Pandas GGPLOT
  42. 42. Bar chart of distribution GGPLOT
  43. 43. Layers Aesthetics aes() No need for value_counts()! GGPLOT
  44. 44.  Breakdown of class per maintenance type GGPLOT
  45. 45. Other things Many different kinds of graphs Customizable Smoothing, facets Time series Themes! GGPLOT
  46. 46. theme_xkcd() GGPLOT
  47. 47. Phew! Which tool to choose depends on your needs Workflow: Preprocess Analyze Visualize WHAT NEXT?
  48. 48. Pandas http://pandas.pydata.org/ scikit-learn http://scikit-learn.org/ NLTK http://www.nltk.org/ MRJob http://mrjob.readthedocs.org/ ggplot http://ggplot.yhathq.com/ RESOURCES
  49. 49. Twitter @sarah_guido LinkedIn https://www.linkedin.com/in/sarahguido NYC Python http://www.meetup.com/nycpython/ CONTACT ME!
  50. 50. AND FINALLY…
  51. 51. Questions? THE END!

×