Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dc python meetup


Published on


Published in: Software
  • Be the first to comment

  • Be the first to like this

Dc python meetup

  1. 1. { PANDAS Not your ordinary fuzzy bear Agenda: 6:45-6:50 Why do we need Pandas 6:50-7:00 General Information (Download etc..) 7:00-7:15 Quick Tour and Use Cases using Python Notebooks 7:15-7:30 Q&A
  2. 2. Data Science has been around for much longer than it’s been buzz worthy. At it’s core we simply apply the scientific method to data: (Data) Scientists ask questions, do research, construct hypotheses, test, analyze , interpret results and repeat.
  3. 3. Too much testing Too little Observe and Prep. We think Data looks clean and pretty But the reality is that it looks like well… like…
  4. 4. But, “ETL is not as fun as Statistics, Machine learning …” Queue the PANDA ! Time to show you that it is
  5. 5. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. You can download Pandas using standard pypi. I recommend, however you try the continuum analytics package provided for free by Anaconda. It includes pandas, numpy, scipy ,python notebooks and the Spyder editor. Quick Tour using open Financial data set 1. Data cleansing, manipulation, merging etc… 2.Web Crawling and data preparation using scrapy-pandas pipeline
  6. 6. A fast and efficient DataFrame object for data manipulation with integrated indexing; Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form; Flexible reshaping and pivoting of data sets; Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; Columns can be inserted and deleted from data structures for size mutability; Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets; High performance merging and joining of data sets; Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure; Time series-functionality: date range generation and frequency conversion, moving window statistics,moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  7. 7. Use Case 1: Determine metric for sales data for company/client using common mapping (geography, age, etc..) Data: Data set is in numerous files with different file names, columns, data types, null values, numeric and text data. How can you: a. Combine all the files quickly b. identify and clean erroneous data c. create mapping that will link data sets (all data links to location, age etc..) Solution: PANDAS! Usually with excel and an RDBMS this can take days/weeks. In pandas it took me a day and can probably take a few hours. Added benefit: The code can be repurposed for all input data. With an rdbms you would need to re-write sql (or related queries) for each particular database. * * There is also pandassql which connects to a sql instance. Please see the pandas docs.
  8. 8. This is what I got: Files that were neither here nor there Files that were everywhere You could not tell which was which And the data quality was a B***
  9. 9. Demo: Original files : Sales txt file, customer file, and address book file. No files had headers. They had to be inserted using a main data definitions file. Outcome: One File showing sales by zip, region, province and country , mapped to specific column types for APACHE HIVE
  10. 10. Exciting Libraries Geo Mapping: Machine Learning: External Data Sources growing:
  11. 11. What else You can write in Cython to get maximize performance. It will be up to 10X faster than pure python because in python it calls a series for each row so with cython and numpy you can pass ndarrays instead. Development Roadmap: (0.13) Improved SQL / relational database tools Tools for working with data sets that do not fit into memory (0.10) Better memory usage and performance when reading very large CSV files Better statistical graphics using matplotlib Integration with D3.js Better support for integer NA values Extend GroupBy functionality to regular ndarrays, record arrays ✔ numpy.datetime64 integration, scikits.timeseries codebase integration. Substantially improved time series functionality. ✔ Improved PyTables (HDF5) integration ✔ NDFrame data structure for arbitrarily high-dimensional labeled data ✔ Better support for NumPy dtype hierarchy without sacrificing usability ✔ Add a Factor data type (in R parlance)