Your SlideShare is downloading. ×
0
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Building Data Apps with Python
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Building Data Apps with Python

665

Published on

District Data Labs Workshop …

District Data Labs Workshop
Current Workshop: August 23, 2014

Previous Workshops:
- April 5, 2014

Data products are usually software applications that derive their value from data by leveraging the data science pipeline and generate data through their operation. They aren’t apps with data, nor are they one time analyses that produce insights - they are operational and interactive. The rise of these types of applications has directly contributed to the rise of the data scientist and the idea that data scientists are professionals “who are better at statistics than any software engineer and better at software engineering than any statistician.”

These applications have been largely built with Python. Python is flexible enough to develop extremely quickly on many different types of servers and has a rich tradition in web applications. Python contributes to every stage of the data science pipeline including real time ingestion and the production of APIs, and it is powerful enough to perform machine learning computations. In this class we’ll produce a data product with Python, leveraging every stage of the data science pipeline to produce a book recommender.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
665
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
29
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Building Data Products with Python District Data Labs
  • 2. Links to various resources Introduction to Python http://bit.ly/1gJ73Tt Github Repository http://bit.ly/1eLBzki
  • 3. About the Instructor Benjamin Bengfort Data Science: ● MS Computer Science from North Dakota State ● PhD Candidate in CS at the University of Maryland ● Data Scientist at Cobrain Company in Bethesda, MD ● Board member of Data Community DC ● Lecturer at Georgetown University Python Programmer: ● Python developer for 7 years ● Open source contributor ● My work on Github: https://github.com/bbengfort
  • 4. About the Instructor Benjamin Bengfort I am available to collaborate and answer questions for all of my students. Twitter: twitter.com/bbengfort LinkedIn: linkedin.com/in/bbengfort Github: github.com/bbengfort Email: benjamin@bengfort.com
  • 5. About the Teaching Assistant Keshav Magge ● MS Computer Science from University of Houston ● Lead Data/Software Engineer at Cobrain Company in Bethesda, MD Python Programmer: ● Python developer for 7 years ● Plone/Zope for 2 years, Django for 5 years ● My work on Github: https://github.com/keshavmagge
  • 6. About the Teaching Assistant Keshav Magge Reach out to me to talk about all things python/data or just about life Twitter: twitter.com/keshavmagge LinkedIn: linkedin.com/pub/keshav-magge/12/a2a/324/ Github: github.com/keshavmagge Email: keshav@keshavmagge.com
  • 7. Building Data Products
  • 8. Hilary Mason A data product is a product that is based on the combination of data and algorithms. ” “
  • 9. Mike Loukides A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products. ” “
  • 10. The Data Science Pipeline
  • 11. Data Ingestion Data Munging and Wrangling Computation and Analyses Modeling and Application Reporting and Visualization
  • 12. Data Ingestion ● There is a world of data out there- how to get it? Web crawlers, APIs, Sensors? Python and other web scripting languages are custom made for this task. ● The real question is how can we deal with such a giant volume and velocity of data? ● Big Data and Data Science often require ingestion specialists!
  • 13. ● Warehousing the data means storing the data in as raw a form as possible. ● Extract, transform, and load operations move data to operational storage locations. ● Filtering, aggregation, normalization and denormalization all ensure data is in a form it can be computed on. ● Annotated training sets must be created for ML tasks. Data Wrangling
  • 14. ● Hypothesis driven computation includes design and development of predictive models. ● Many models have to be trained or constrained into a computational form like a Graph database, and this is time consuming. ● Other data products like indices, relations, classifications, and clusters may be computed. Computation and Analyses
  • 15. Modeling and Application This is the part we’re most familiar with. Supervised classification, Unsupervised clustering - Bayes, Logistic Regression, Decision Trees, and other models. This is also where the money is.
  • 16. ● Often overlooked, this part is crucial, even if we have data products. ● Humans recognize patterns better than machines. Human feedback is crucial in Active Learning and remodeling (error detection). ● Mashups and collaborations generate more data- and therefore more value! Reporting and Visualization
  • 17. Don’t forget feedback! (Active Learning for Data Products)
  • 18. What we’re going to build today SCIENCE BOOKCLUB!! ● A book club that chooses what to read via a recommender system. ● Uses GoodReads data to ingest and return feedback on books. ● Statistical model is a non- negative matrix factorization ● Reporting using Jinja (almost a web app)
  • 19. Workflow 1. Setting up a Python skeleton 2. Creating and Running Tests 3. Wading in with a configuration 4. Ingestion with urllib and requests 5. Creating a command line admin with argparse 6. Wrangling with BeautifulSoup and SQLAlchemy 7. Modeling with numpy 8. Reporting with Jinja2
  • 20. Octavo Architecture (really clear DSP) requests.py Ingestion Module Raw Data Storage Computational Data Storage Wrangling Module BeautifulSou p SQLAlchemy Recommender Module Numpy Reporting Module Jinja2Matplotlib
  • 21. requests.py Octavo Architecture (really clear DSP) requests.py Ingestion Module Raw Data Storage Computational Data Storage Wrangling Module BeautifulSoup SQLAlchemy Recommender Module Numpy Reporting Module Jinja2 Matplotlib
  • 22. How to tackle this course ...
  • 23. How to tackle this course ... Lean into it- absorb as much as possible, don’t worry about falling behind - it will be in your head! Then afterwards - lets all digest it together (keep in touch)

×