Your SlideShare is downloading. ×
0
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Analyzing Data With Python
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Analyzing Data With Python

1,387

Published on

Given at OSCON 2014 and PyTennessee 2015.

Given at OSCON 2014 and PyTennessee 2015.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,387
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
59
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Sarah Guido @sarah_guido Reonomy PYTN 2015 ANALYZING DATA WITH PYTHON
  • 2. Data scientist at Reonomy University of Michigan graduate NYC Python organizer PyGotham organizer ABOUT ME
  • 3. Bird’s-eye overview: not comprehensive explanation of these tools! Take data from start-to-finish Preprocessing: Pandas Analysis: scikit-learn Analysis: nltk Data pipeline: MRjob Visualization: ggplot What next? ABOUT THIS TALK
  • 4. So many tools Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability Community support “Easy” language to learn Both a scripting and production-ready language WHY PYTHON?
  • 5. How to find the best tool(s)? The 90/10 rule Simple is better than complex FROM POINT A TO POINT…X?
  • 6. Available resources Documentation, tutorials, books, videos Ease of use (with a grain of salt) Community support and continuous development Widely used WHY I CHOSE THESE TOOLS
  • 7. The importance of data preprocessing AKA wrangling, cleaning, manipulating, and so on Preprocessing is also getting to know your data Missing values? Categorical/continuous? Distribution? PREPROCESSING
  • 8. Data analysis and modeling Similar to R and Excel Easy-to-use data structures DataFrame Data wrangling tools Merging, pivoting, etc PANDAS
  • 9. Keep everything in Python Community support/resources Use for preprocessing File I/0, cleaning, manipulation, etc Combinable with other modules NumPy, SciPy, statsmodel, matplotlib PANDAS
  • 10. File I/O PANDAS
  • 11. Finding missing values PANDAS
  • 12. Removing missing values PANDAS
  • 13. Pivoting PANDAS
  • 14. Other things Statistical methods Merge/join like SQL Time series Has some visualization functionality PANDAS
  • 15. Application of algorithms that learn from examples Representation and generalization Useful in everyday life Especially useful in data analysis MACHINE LEARNING
  • 16. Supervised learning Classification and regression Unsupervised learning Clustering and dimensionality reduction MACHINE LEARNING
  • 17. Machine learning module Open-source Built-in datasets Good resources for learning SCIKIT-LEARN
  • 18. Scikit-learn: your data has to be continuous Here’s what one observation/label looks like: SCIKIT-LEARN
  • 19. Transform categorical values/labels SCIKIT-LEARN
  • 20. Classification SCIKIT-LEARN
  • 21. Classification SCIKIT-LEARN
  • 22. Other things Very comprehensive of machine learning algorithms Preprocessing tools Methods for testing the accuracy of your model SCIKIT-LEARN
  • 23. Concerned with interactions between computers and human languages Derive meaning from text Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING
  • 24. Natural Language ToolKit Access to over 50 corpora Corpus: body of text NLP tools Stemming, tokenizing, etc Resources for learning NLTK
  • 25. Stopword removal NLTK
  • 26. Stopword removal NLTK
  • 27. Stemming NLTK
  • 28. Other things Lemmatizing, tokenization, tagging, parse trees Classification Chunking Sentence structure NLTK
  • 29. Data that takes too long to process on your machine Not “big data” but larger data Solution: MapReduce! Processing large datasets with a parallel, distributed algorithm Map step Reduce step PROCESSING LARGE DATA
  • 30. Map step Takes series of key/value pairs Ex. Word counts: break line into words, return word and count within line Reduce step Once for each unique key: iterates through values associated with that key Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA
  • 31. Write MapReduce jobs in Python Test code locally without installing Hadoop Lots of thorough documentation A few things to know Keep everything in one class MRJob program in a separate file Output to new file if doing something like word counts MRJOB
  • 32. Stemmed file Line 1: (‘miss’, 2), (‘taylor’, 1) Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) And so on… MRJOB
  • 33. Map  Line 1: (‘miss’, 2), (‘taylor’, 1)  Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1)  Line 3: (‘first’, 1), (‘wed’, 1)  Line 4: (‘father’, 1)  Line 5: (‘father’, 1) Reduce  (‘miss’, 2)  (‘taylor’, 2)  (‘first’, 2)  (‘wed’, 2)  (‘father’, 2) MRJOB
  • 34. Let’s count all words in the Gutenberg file Map step MRJOB
  • 35. Reduce (and run) step MRJOB
  • 36. Results Mapped counts reduced Key/val pairs MRJOB
  • 37. Other things Run on Hadoop clusters Can write highly complex jobs Works with Elasticsearch MRJOB
  • 38. The “final step” Conveying your results in a meaningful way Literally see what’s going on DATA VISUALIZATION
  • 39. Remember this? DATA VISUALIZATION
  • 40. Bar chart of distribution DATA VISUALIZATION
  • 41. 2D visualization library Based on the similar R library Wrapper around matplotlib Wide variety of nice-looking plots Easy to feed in Pandas GGPLOT
  • 42. Bar chart of distribution GGPLOT
  • 43. Layers Aesthetics aes() No need for value_counts()! GGPLOT
  • 44.  Breakdown of class per maintenance type GGPLOT
  • 45. Other things Many different kinds of graphs Customizable Smoothing, facets Time series Themes! GGPLOT
  • 46. theme_xkcd() GGPLOT
  • 47. Phew! Which tool to choose depends on your needs Workflow: Preprocess Analyze Visualize WHAT NEXT?
  • 48. Pandas http://pandas.pydata.org/ scikit-learn http://scikit-learn.org/ NLTK http://www.nltk.org/ MRJob http://mrjob.readthedocs.org/ ggplot http://ggplot.yhathq.com/ RESOURCES
  • 49. Twitter @sarah_guido LinkedIn https://www.linkedin.com/in/sarahguido NYC Python http://www.meetup.com/nycpython/ CONTACT ME!
  • 50. AND FINALLY…
  • 51. Questions? THE END!

×