2. Data scientist at Reonomy
University of Michigan graduate
NYC Python organizer
PyGotham organizer
ABOUT ME
3. Bird’s-eye overview: not comprehensive
explanation of these tools!
Take data from start-to-finish
Preprocessing: Pandas
Analysis: scikit-learn
Analysis: nltk
Data pipeline: MRjob
Visualization: ggplot
What next?
ABOUT THIS TALK
4. So many tools
Preprocessing, analysis, statistics, machine learning,
natural language processing, network analysis,
visualization, scalability
Community support
“Easy” language to learn
Both a scripting and production-ready
language
WHY PYTHON?
5. How to find the best tool(s)?
The 90/10 rule
Simple is better than complex
FROM POINT A TO POINT…X?
7. The importance of data preprocessing
AKA wrangling, cleaning, manipulating, and so on
Preprocessing is also getting to know your
data
Missing values? Categorical/continuous?
Distribution?
PREPROCESSING
8. Data analysis and modeling
Similar to R and Excel
Easy-to-use data structures
DataFrame
Data wrangling tools
Merging, pivoting, etc
PANDAS
9. Keep everything in Python
Community support/resources
Use for preprocessing
File I/0, cleaning, manipulation, etc
Combinable with other modules
NumPy, SciPy, statsmodel, matplotlib
PANDAS
15. Application of algorithms that learn from
examples
Representation and generalization
Useful in everyday life
Especially useful in data analysis
MACHINE LEARNING
22. Other things
Very comprehensive of machine learning
algorithms
Preprocessing tools
Methods for testing the accuracy of your model
SCIKIT-LEARN
23. Concerned with interactions between
computers and human languages
Derive meaning from text
Many NLP algorithms are based on machine
learning
NATURAL LANGUAGE PROCESSING
24. Natural Language ToolKit
Access to over 50 corpora
Corpus: body of text
NLP tools
Stemming, tokenizing, etc
Resources for learning
NLTK
29. Data that takes too long to process on your
machine
Not “big data” but larger data
Solution: MapReduce!
Processing large datasets with a parallel,
distributed algorithm
Map step
Reduce step
PROCESSING LARGE DATA
30. Map step
Takes series of key/value pairs
Ex. Word counts: break line into words, return
word and count within line
Reduce step
Once for each unique key: iterates through values
associated with that key
Ex. Word counts: returns word and sum of all
counts
PROCESSING LARGE DATA
31. Write MapReduce jobs in Python
Test code locally without installing Hadoop
Lots of thorough documentation
A few things to know
Keep everything in one class
MRJob program in a separate file
Output to new file if doing something like word
counts
MRJOB
41. 2D visualization library
Based on the similar R library
Wrapper around matplotlib
Wide variety of nice-looking plots
Easy to feed in Pandas
GGPLOT