Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to be able to automate your analyses to personalize your users experiences. LinkedIn's People you May Know, Netflix and Pandora's recommenders, and Amazon's eerily custom shopping experience have all shown us why it is essential to leverage data if you want to stay relevant as a company.
As data analyses turn into products, it is essential that your tech/data stack be flexible enough to run models in production, integrate with web applications, and provide users with immediate and valuable feedback. I believe Python is becoming the lingua franca of data science due to its flexibility as a general purpose performant programming language, rich scientific ecosystem (numpy, scipy, scikit-learn, pandas, etc.), web frameworks/community, and utilities/libraries for handling data at scale. In this talk I will walk through a fictional company bringing it's first data product to market. Along the way I will cover Python and data science best practices for such a pipeline, cover some of the pitfalls of what happens when you put models into production, and how to make sure your users (and engineers) are as happy as they can be.
https://github.com/Jay-Oh-eN/pydatasv2014
This is a brief overview of our Knight Foundation project—scrubadub—that was presented to the Data Science Chicago meetup. on May 4 http://www.meetup.com/Data-Science-Chicago/events/230076311/
Presentation given at Barcamp Chiang Mai 4 on the basics of Semantic Web. A simple introduction with examples, aimed for those with a little Web development experience.
Raises questions about the true identity of Tim Berners-Lee.
Agile Mumbai 2019 Conference | Right to left | Mike BurrowsAgileNetwork
Session Title : Right to left
Session Overview : What does a Lean-Agile delivery process look like? If you had to describe one, where would you start - "from the left", with a backlog of work to plough through, or "from the right", with needs met by working software? And does the difference in perspective matter?
The difference does indeed matter; we'll see that learning to describe and introduce Agile and its frameworks "from the right" brings considerable benefits. And it's not hard - we can all do it!
Data Engineering Efficiency @ Netflix - Strata 2017Michelle Ufford
Slides from Strata 2017 talk, "Data Engineering Efficiency @ Netflix."
Michelle Ufford explains how Netflix’s data engineering and analytics team is using data to find common patterns among the chaos that enable the company to automate repetitive and time-consuming tasks and discover ways to improve data quality, reduce costs, and quickly identify and respond to issues. Michelle provides a quick overview of Netflix’s analytics environment before diving into some of the major challenges facing the company’s data engineers. Along the way, Michelle shares how Netflix is building more intelligent data platform services and tools to improve data quality, automate data maintenance, alert on job optimization opportunities, and more.
This is a brief overview of our Knight Foundation project—scrubadub—that was presented to the Data Science Chicago meetup. on May 4 http://www.meetup.com/Data-Science-Chicago/events/230076311/
Presentation given at Barcamp Chiang Mai 4 on the basics of Semantic Web. A simple introduction with examples, aimed for those with a little Web development experience.
Raises questions about the true identity of Tim Berners-Lee.
Agile Mumbai 2019 Conference | Right to left | Mike BurrowsAgileNetwork
Session Title : Right to left
Session Overview : What does a Lean-Agile delivery process look like? If you had to describe one, where would you start - "from the left", with a backlog of work to plough through, or "from the right", with needs met by working software? And does the difference in perspective matter?
The difference does indeed matter; we'll see that learning to describe and introduce Agile and its frameworks "from the right" brings considerable benefits. And it's not hard - we can all do it!
Data Engineering Efficiency @ Netflix - Strata 2017Michelle Ufford
Slides from Strata 2017 talk, "Data Engineering Efficiency @ Netflix."
Michelle Ufford explains how Netflix’s data engineering and analytics team is using data to find common patterns among the chaos that enable the company to automate repetitive and time-consuming tasks and discover ways to improve data quality, reduce costs, and quickly identify and respond to issues. Michelle provides a quick overview of Netflix’s analytics environment before diving into some of the major challenges facing the company’s data engineers. Along the way, Michelle shares how Netflix is building more intelligent data platform services and tools to improve data quality, automate data maintenance, alert on job optimization opportunities, and more.
Working With Facebook, Twitter, et al. - Social Media CampMike Anderson
What does social media analytics actually mean. We know that companies like Twitter and Facebook sell our data, but how much does it cost, and what value can we get from it?
The Ultimate Free Digital Marketing ToolkitSteve Lock
A presentation at Digital Marketing London in July 2012. The talk is based on a cut down version of a full free eBook available at www.analyticsseo.com.
A comprehensive guide of the best free digital marketing tools including learning materials, browser extensions, tutorials to build agile tools in Excel and Google Docs, browser extensions, bookmarklets, link building, backlink analysis, social media, productivity tools and much more!
Integrating user insights and validation on a weekly basis to product teams
* Building team capabilities to create low and high fidelity prototypes (design, engineering, and product management)
* Experience prototyping events as a method to de-risk and learn from end-users
* Evolutionary and lean iterations that create a revolutionary product
Microservices Manchester: Security, Microservces and Vault by Nicki WattOpenCredo
In this talk, Nicki Watt will initially look to introduce and highlight some of the typical security challenges which engineers may encounter, and need to be aware of, when trying to develop and deploy a microservices-based architecture. The 2nd half of the talk tries to get a bit more practical, and through some examples, looks to demonstrate how a tool like Vault from HashiCorp can be used as part of your overall security toolkit to address some of these challenges.
This talk will not be delving into the depths of cryptography and algorithms, rather it is aimed at highlighting some typical problem areas, and giving practical insight into some of the options which can be used to address them.
About Nicki Watt
Nicki Watt is a Lead Consultant for OpenCredo having joined the company in 2011. Nicki is responsible for both hands on and overall leadership of engagements for OpenCredo. She has experience leading both development and architectural teams across a wide range of industries including enterprise organisations and start ups.
Enterprise Open Source Intelligence GatheringTom Eston
Presented at the Ohio Information Security Summit, October 30, 2009.
What does the Internet say about your company? Do you know what is being posted by your employees, customers, or your competition? We all know information or intelligence gathering is one of the most important phases of a penetration test. However, gathering information and intelligence about your own company is even more valuable and can help an organization proactively determine the information that may damage your brand, reputation and help mitigate leakage of confidential information.
This presentation will cover what the risks are to an organization regarding publicly available open source intelligence. How can your enterprise put an open source intelligence gathering program in place without additional resources or money. What free tools are available for gathering intelligence including how to find your company information on social networks and how metadata can expose potential vulnerabilities about your company and applications. Next, we will explore how to get information you may not want posted about your company removed and how sensitive metadata information you may not be aware of can be removed or limited. Finally, we will discuss how to build a Internet posting policy for your company and why this is more important then ever.
SEO 101: An Intro to Search Engine OptimizationUpTikMedia
The following slideshow was presented by Brady Callahan - on behalf of UpTik Media - to a class at Lee University. The subject of the presentation is An Introduction to Search Engine Optimization, or more informally, SEO 101. Topics covered in this presentation range from (but are not limited to): the history of search engines, what SEO is, how search engines crawl the Web, how websites can be optimized for search, how search engines rank pages, the power of Google+, schema markup, machine learning, and more.
Twitter recruiting presentation given @ McGill University on 9/26/2013
- About Twitter
- Growth Engineering
- Why work @ Twitter
- Lesson learned in Industry
By @PhilYoussef
PyData: Past, Present Future (PyData SV 2014 Keynote)Peter Wang
From the closing keynoteLook back at the last two years of PyData, discussion about Python's role in the growing and changing data analytics landscape, and encouragement of ways to grow the community
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
At this workshop, you will build your own messaging insights system - data ingestion from a live data source (Reddit), queueing, deploying a machine learning model, and serving messages with insights to your mobile phone!
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
In the same way that we need to make assertions about how code functions, we need to make assertions about data, and unit testing is a promising framework. In this talk, we'll explore what is unique about unit testing data, and see how Two Sigma's open source library Marbles addresses these unique challenges in several real-world scenarios.
More Related Content
Similar to Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014
Working With Facebook, Twitter, et al. - Social Media CampMike Anderson
What does social media analytics actually mean. We know that companies like Twitter and Facebook sell our data, but how much does it cost, and what value can we get from it?
The Ultimate Free Digital Marketing ToolkitSteve Lock
A presentation at Digital Marketing London in July 2012. The talk is based on a cut down version of a full free eBook available at www.analyticsseo.com.
A comprehensive guide of the best free digital marketing tools including learning materials, browser extensions, tutorials to build agile tools in Excel and Google Docs, browser extensions, bookmarklets, link building, backlink analysis, social media, productivity tools and much more!
Integrating user insights and validation on a weekly basis to product teams
* Building team capabilities to create low and high fidelity prototypes (design, engineering, and product management)
* Experience prototyping events as a method to de-risk and learn from end-users
* Evolutionary and lean iterations that create a revolutionary product
Microservices Manchester: Security, Microservces and Vault by Nicki WattOpenCredo
In this talk, Nicki Watt will initially look to introduce and highlight some of the typical security challenges which engineers may encounter, and need to be aware of, when trying to develop and deploy a microservices-based architecture. The 2nd half of the talk tries to get a bit more practical, and through some examples, looks to demonstrate how a tool like Vault from HashiCorp can be used as part of your overall security toolkit to address some of these challenges.
This talk will not be delving into the depths of cryptography and algorithms, rather it is aimed at highlighting some typical problem areas, and giving practical insight into some of the options which can be used to address them.
About Nicki Watt
Nicki Watt is a Lead Consultant for OpenCredo having joined the company in 2011. Nicki is responsible for both hands on and overall leadership of engagements for OpenCredo. She has experience leading both development and architectural teams across a wide range of industries including enterprise organisations and start ups.
Enterprise Open Source Intelligence GatheringTom Eston
Presented at the Ohio Information Security Summit, October 30, 2009.
What does the Internet say about your company? Do you know what is being posted by your employees, customers, or your competition? We all know information or intelligence gathering is one of the most important phases of a penetration test. However, gathering information and intelligence about your own company is even more valuable and can help an organization proactively determine the information that may damage your brand, reputation and help mitigate leakage of confidential information.
This presentation will cover what the risks are to an organization regarding publicly available open source intelligence. How can your enterprise put an open source intelligence gathering program in place without additional resources or money. What free tools are available for gathering intelligence including how to find your company information on social networks and how metadata can expose potential vulnerabilities about your company and applications. Next, we will explore how to get information you may not want posted about your company removed and how sensitive metadata information you may not be aware of can be removed or limited. Finally, we will discuss how to build a Internet posting policy for your company and why this is more important then ever.
SEO 101: An Intro to Search Engine OptimizationUpTikMedia
The following slideshow was presented by Brady Callahan - on behalf of UpTik Media - to a class at Lee University. The subject of the presentation is An Introduction to Search Engine Optimization, or more informally, SEO 101. Topics covered in this presentation range from (but are not limited to): the history of search engines, what SEO is, how search engines crawl the Web, how websites can be optimized for search, how search engines rank pages, the power of Google+, schema markup, machine learning, and more.
Twitter recruiting presentation given @ McGill University on 9/26/2013
- About Twitter
- Growth Engineering
- Why work @ Twitter
- Lesson learned in Industry
By @PhilYoussef
PyData: Past, Present Future (PyData SV 2014 Keynote)Peter Wang
From the closing keynoteLook back at the last two years of PyData, discussion about Python's role in the growing and changing data analytics landscape, and encouragement of ways to grow the community
Similar to Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014 (20)
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
At this workshop, you will build your own messaging insights system - data ingestion from a live data source (Reddit), queueing, deploying a machine learning model, and serving messages with insights to your mobile phone!
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
In the same way that we need to make assertions about how code functions, we need to make assertions about data, and unit testing is a promising framework. In this talk, we'll explore what is unique about unit testing data, and see how Two Sigma's open source library Marbles addresses these unique challenges in several real-world scenarios.
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
TileDB is an open-source storage manager for multi-dimensional sparse and dense array data. It has a novel architecture that addresses some of the pain points in storing array data on “big-data” and “cloud” storage architectures. This talk will highlight TileDB’s design and its ability to integrate with analysis environments relevant to the PyData community such as Python, R, Julia, etc.
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
In this talk I will discuss exponential family embeddings, which are methods that extend the idea behind word embeddings to other data types. I will describe how we used dynamic embeddings to understand how data science skill-sets have transformed over the last 3 years using our large corpus of jobs. The key takeaway is that these models can enrich analysis of specialized datasets.
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
How many newspapers should be distributed to each store for sale every day? The data science group at The New York Times addresses this optimization problem using custom time series modeling and analytical solutions, while also incorporating qualitative business concerns. I'll describe our modeling and data engineering approaches, written in Python and hosted on Google Cloud Platform.
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
However, the the graph theory jargon can make graph analytics seem more intimidating for self-study than is necessary. In this talk, the audience will be exposed to some of the basic concepts of graph theory (no prerequisite math knowledge needed!) and a few of the Python tools available for graph analysis.
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
Those of us who use TensorFlow often focus on building the model that's most predictive, not the one that's most deployable. So how to put that hard work to work? In this talk, we'll walk through a strategy for taking your machine learning models from Jupyter Notebook into production and beyond.
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
In September 2017, dockless bikeshare joined the transportation options in the District of Columbia. In March 2018, scooter share followed. During the pilot of these technologies, Python has helped District Department of Transportation answer some critical questions. This talk will discuss how Python was used to answer research questions and how it supported the evaluation of this demonstration.
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
There are many stories of developers creating databases that don't operate at scale. The application is good, but the database won't work the realistic volumes of data. It's like a horror movie where they never looked behind the door, ran into the dark forest and night, and discovered the database was the monster killing their application. How can we leverage Python to avoid scaling problems?
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
The recent advances in machine learning and artificial intelligence are amazing! Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure. In this session, I'll demonstrate how one-off experiments can be transformed into scalable ML pipelines with minimal effort.
We will be using Beautiful Soup to Webscrape the IMDB website and create a function that will allow you to create a dictionary object on specific metadata of the IMDB profile for any IMDB ID you pass through as an argument.
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
This talk describes an experimental approach to time series modeling using 1D convolution filter layers in a neural network architecture. This approach was developed at System1 for forecasting marketplace value of online advertising categories.
Extending Pandas with Custom Types - Will AydPyData
Pandas v.0.23 brought to life a new extension interface through which you can extend NumPy's type system. This talk will explain what that means in more detail and provide practical examples of how the new interface can be leveraged to drastically improve your reporting.
Machine learning models are increasingly used to make decisions that affect people’s lives. With this power comes a responsibility to ensure that model predictions are fair. In this talk I’ll introduce several common model fairness metrics, discuss their tradeoffs, and finally demonstrate their use with a case study analyzing anonymized data from one of Civis Analytics’s client engagements.
What's the Science in Data Science? - Skipper SeaboldPyData
The gold standard for validating any scientific assumption is to run an experiment. Data science isn’t any different. Unfortunately, it’s not always possible to design the perfect experiment. In this talk, we’ll take a realistic look at measurement using tools from the social sciences to conduct quasi-experiments with observational data.
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
Forecasting time-series data has applications in many fields, including finance, health, etc. There are potential pitfalls when applying classic statistical and machine learning methods to time-series problems. This talk will give folks the basic toolbox to analyze time-series data and perform forecasting using statistical and machine learning models, as well as interpret and convey the outputs.
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
A historical text may now be unreadable, because its language is unknown, or its script forgotten (or both), or because it was deliberately enciphered. Deciphering needs two steps: Identify the language, then map the unknown script to a familiar one. I’ll present an algorithm to solve a cartoon version of this problem, where the language is known, and the cipher is alphabet rearrangement.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Securing your Kubernetes cluster_ a step-by-step guide to success !
Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014
1. Jonathan Dinu
Co-Founder, Zipfian Academy
jonathan@zipfianacademy.com
@clearspandex
@ZipfianAcademy
Data Engineering 101: Building your first
data product
May 4th, 2014
2. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
6. Today Disclaimer:
All characters appearing in this presentation are
fictitious. Any resemblance to real persons, living
or dead, is purely coincidental.
Questions? tweet @zipfianacademy #pydata
7. Today Disclaimer:
This presentation contains strong opinions that
you may or may not agree with. All thoughts are
my own.
Jonathan Dinu
Co-Founder, Zipfian Academy
jonathan@zipfianacademy.com
@clearspandex
Questions? tweet @zipfianacademy #pydata
8. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• CreatingValue for Users
• Q&A
Questions? tweet @zipfianacademy #pydata
9. nwsrdr (News Reader)
Source: http://www.groovypost.com/wp-content/uploads/2013/05/Bookmark-
Button.png
OR
nwsrdr
+ nwrsrdr
+ nwrsrdr
+ nwrsrdr
nwsrdr
getnews.com/bookmarklet
When browsing the web simply click the
+nwsrdr to save any page to nwsrdr
Get nwsrdr on your desktop
Questions? tweet @zipfianacademy #pydata
10. nwsrdr
• Auto-categorize Articles
• Find Similar Articles
• Recommend articles
• Suggest Feeds to Follow
• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
11. nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)
• Clustering (unsupervised learning)
• Collaborative Filtering
• Triangle Closing
• Real Business Model!
Questions? tweet @zipfianacademy #pydata
12. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
20. Products that enhance a users’
experience the more “data” a user
provides
Data Generating
Products
Ex: Recommender Systems
Questions? tweet @zipfianacademy #pydata
21. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
30. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
31. What
• Naive Bayes (classification)
• Clustering (unsupervised learning)
• Collaborative Filtering
• Triangle Closing
• Real Business Model
Questions? tweet @zipfianacademy #pydata
32. nwsrdr
• Auto-categorize Articles
• Find Similar Articles
• Recommend articles
• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
33. nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)
• Clustering (unsupervised learning)
• Collaborative Filtering
• Triangle Closing
• Real Business Model!
Questions? tweet @zipfianacademy #pydata
44. # parse resulting JSON and insert into a mongoDB collection!
for content in api.json()['response']['docs']:!
if not collection.find_one(content):!
collection.insert(content)!
!
!
# only returns 10 per page!
"There are only %i docuemtns returned 0_o" % !
! len(api.json()[‘response']['docs'])!
Questions? tweet @zipfianacademy #pydata
Acquire
45. # there are many more than 10 articles however!
total_art = articles_left = api.json()['response']['meta']['hits']!
!
!
print "There are currently %s articles in the NYT archive" % total_art!
!
!
#=> There are currently 15277775 articles in the NYT archive
Questions? tweet @zipfianacademy #pydata
Acquire
65. Tokenize article text and
create feature vectors with NLTK
Questions? tweet @zipfianacademy #pydata
Vectorize
66. Vectorize
wnl = nltk.WordNetLemmatizer()!
!
def tokenize_and_normalize(chunks):!
words = [ tokenize.word_tokenize(sent) for sent in
tokenize.sent_tokenize("".join(chunks)) ]!
flatten = [ inner for sublist in words for inner in sublist ]!
stripped = [] !
!
for word in flatten: !
if word not in stopwords.words('english'):!
try:!
stripped.append(word.encode('latin-1').decode('utf8').lower())!
except:!
print "Cannot encode: " + word!
!
no_punks = [ word for word in stripped if len(word) > 1 ] !
return [wnl.lemmatize(t) for t in no_punks]!
Questions? tweet @zipfianacademy #pydata
89. Immutable append only set of Raw Data
Computation is a view on data
*Lambda Architecture by Nathan MarzQuestions? tweet @zipfianacademy #pydata
Pipeline
90. Functional Data Science
• Modularity
• Define interfaces
• Separate data from computation
• Data Lineage
Functional
Questions? tweet @zipfianacademy #pydata
91. Need Robust and Flexible Pipeline!
Questions? tweet @zipfianacademy #pydata
Pipeline
92. Whatever you do, DO NOT cross the streams
Questions? tweet @zipfianacademy #pydata
Pipeline
94. Gotchas!
• Only have a static subset of articles
• Pipeline not automated for re-training
Questions? tweet @zipfianacademy #pydata
Gotchas
95. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
98. testing
Start small (data)
and fast
(development)
testing
Increase size of
data set
Optimize and
productionize
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
How to Scale
99. How to Scale
testing
Develop locally
testing
Distribute
computation
(run on cluster)
Tune parameters
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
Can also use a
streaming algorithm or
single machine disk
based “medium data”
technologies (i.e.
database or memory
mapped files)
102. Today
• whoami
• Nws Rdr (News Reader)
• The What,Why, and How of Data Products
• Data Engineering
• Building a Pipeline
• Productionizing the Products
• Q&A
Questions? tweet @zipfianacademy #pydata
107. Data Sources
Obtain
(ranked by ease of use)
1. DaaS -- Data as a service
2. Bulk Download
3. APIs
4. Web Scraping
Questions? tweet @zipfianacademy #pydata
108. DaaS
(Data as a Service)
•Time Series/Numeric: Quandl
• Financial Modeling: Quantopian
• Email Contextualization: Rapleaf
• Location and POI: Factual
Data Sources
Questions? tweet @zipfianacademy #pydata
109. Bulk Download
(just like the good ol’ days)
• File Transfer Protocol (FTP): CDC
•Amazon Web Services: Public Datasets
• Infochimps: Data Marketplace
•Academia: UCI Machine Learning Repository
Data Sources
Questions? tweet @zipfianacademy #pydata
110. APIs
(if it’s not RESTed, I’m not buying)
• Geographic: Foursquare
• Social: Facebook
•Audio: Rdio
• Content:Tumblr
• Realtime:Twitter
• Hidden:Yahoo Finance
Data Sources
Questions? tweet @zipfianacademy #pydata
111. Web Scraping
1. wget and curl
2. Web Spider/Crawler
3. API scraping
4. Manual Download
(DIY for life)
Data Sources
Questions? tweet @zipfianacademy #pydata
112. • DelimitedValues
• TSV
• CSV
• WSV
• JSON
• XML
• Ad Hoc Formats (avoid these if you can)
Data Formats
Questions? tweet @zipfianacademy #pydata
113. • JSON is made up of hash tables and arrays
• Hash tables: { “foo” : 1, “bar” : 2, baz : “3” }
• Arrays: [1, 2, 3]
• Arrays of arrays: [[1, 2, 3], [‘foo’, ‘bar’, ‘baz’]]
• Array of hashes: [{‘foo’:1, ‘bar’:2}, {‘baz’:3}]
• Hashes of hashes: {‘foo’: {‘bar’: 2, ‘baz’: 3}}
Questions? tweet @zipfianacademy #pydata
Data Formats
121. Programming languages like
Python, Ruby, and R have built in
parsers for data formats such as
JSON and CSV. For other
esoteric formats you will
probably have to write your own
Questions? tweet @zipfianacademy #pydata
Data Formats