Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014
Upcoming SlideShare
Loading in...5
×
 

Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

on

  • 906 views

Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to ...

Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to be able to automate your analyses to personalize your users experiences. LinkedIn's People you May Know, Netflix and Pandora's recommenders, and Amazon's eerily custom shopping experience have all shown us why it is essential to leverage data if you want to stay relevant as a company.

As data analyses turn into products, it is essential that your tech/data stack be flexible enough to run models in production, integrate with web applications, and provide users with immediate and valuable feedback. I believe Python is becoming the lingua franca of data science due to its flexibility as a general purpose performant programming language, rich scientific ecosystem (numpy, scipy, scikit-learn, pandas, etc.), web frameworks/community, and utilities/libraries for handling data at scale. In this talk I will walk through a fictional company bringing it's first data product to market. Along the way I will cover Python and data science best practices for such a pipeline, cover some of the pitfalls of what happens when you put models into production, and how to make sure your users (and engineers) are as happy as they can be.

https://github.com/Jay-Oh-eN/pydatasv2014

Statistics

Views

Total Views
906
Views on SlideShare
877
Embed Views
29

Actions

Likes
6
Downloads
21
Comments
0

2 Embeds 29

https://twitter.com 27
https://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014 Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014 Presentation Transcript

  • Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex @ZipfianAcademy Data Engineering 101: Building your first data product May 4th, 2014
  • Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  • Formerly Questions? tweet @zipfianacademy #pydata
  • Formerly Questions? tweet @zipfianacademy #pydata
  • Currently Questions? tweet @zipfianacademy #pydata
  • Today Disclaimer: All characters appearing in this presentation are fictitious. Any resemblance to real persons, living or dead, is purely coincidental. Questions? tweet @zipfianacademy #pydata
  • Today Disclaimer: This presentation contains strong opinions that you may or may not agree with. All thoughts are my own. Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex Questions? tweet @zipfianacademy #pydata
  • Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • CreatingValue for Users • Q&A Questions? tweet @zipfianacademy #pydata
  • nwsrdr (News Reader) Source: http://www.groovypost.com/wp-content/uploads/2013/05/Bookmark- Button.png OR nwsrdr + nwrsrdr + nwrsrdr + nwrsrdr nwsrdr getnews.com/bookmarklet When browsing the web simply click the +nwsrdr to save any page to nwsrdr Get nwsrdr on your desktop Questions? tweet @zipfianacademy #pydata
  • nwsrdr • Auto-categorize Articles • Find Similar Articles • Recommend articles • Suggest Feeds to Follow • No Ads! It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious! Questions? tweet @zipfianacademy #pydata
  • nwsrdr It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious! • Naive Bayes (classification) • Clustering (unsupervised learning) • Collaborative Filtering • Triangle Closing • Real Business Model! Questions? tweet @zipfianacademy #pydata
  • Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  • OR Data Products Product Built on Data (that you sell) Questions? tweet @zipfianacademy #pydata
  • OR Data Products Product that Generates Data Questions? tweet @zipfianacademy #pydata
  • OR Data Products Product that Generates Data (that you sell) Questions? tweet @zipfianacademy #pydata
  • OR Data Products Product that Generates Data (that you sell) i.e. Facebook Questions? tweet @zipfianacademy #pydata
  • OR Data Products Questions? tweet @zipfianacademy #pydata Source: http://gifgif.media.mit.edu/
  • OR Data Products Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
  • OR Data Generating Products Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
  • Products that enhance a users’ experience the more “data” a user provides Data Generating Products Ex: Recommender Systems Questions? tweet @zipfianacademy #pydata
  • Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  • OR Data Science Questions? tweet @zipfianacademy #pydata
  • i.e. solve more problems than you create Data Science Questions? tweet @zipfianacademy #pydata
  • Source: http://estoyentretenido.com/wp-content/uploads/2012/11/Jackie-Chan- Meme.jpg But.... How?!?!?!!? Data Science Questions? tweet @zipfianacademy #pydata
  • Data Engineering Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer- It-1350063721.PNG Questions? tweet @zipfianacademy #pydata
  • Data Engineering Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer- It-1350063721.PNG ! Questions? tweet @zipfianacademy #pydata
  • OR Data Engineering Questions? tweet @zipfianacademy #pydata
  • Prepared Data Test Set Training Set Train Model Sampling Evaluate Cross Validation Data Science Questions? tweet @zipfianacademy #pydata
  • Raw Data Cleaned Data Scrubbing Prepared DataVectorization New Data Test Set Training Set Train Model Sampling Evaluate Cross Validation Cleaned Data Prepared DataVectorizationScrubbing Predict Labels/ Classes Data Engineering Questions? tweet @zipfianacademy #pydata
  • Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  • What • Naive Bayes (classification) • Clustering (unsupervised learning) • Collaborative Filtering • Triangle Closing • Real Business Model Questions? tweet @zipfianacademy #pydata
  • nwsrdr • Auto-categorize Articles • Find Similar Articles • Recommend articles • No Ads! It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious! Questions? tweet @zipfianacademy #pydata
  • nwsrdr It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious! • Naive Bayes (classification) • Clustering (unsupervised learning) • Collaborative Filtering • Triangle Closing • Real Business Model! Questions? tweet @zipfianacademy #pydata
  • Source: http://media.tumblr.com/tumblr_lakcynCyG31qbzcoy.jpg Abstraction (Cake) How (ABK) Questions? tweet @zipfianacademy #pydata
  • Obligatory Name Drop Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation requests BeautifulSoup4 pandas pymongo scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  • Obligatory Name Drop Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation requests BeautifulSoup4 pandas pymongo scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  • Obligatory Name Drop Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation requests BeautifulSoup4 pandas pymongo Flask yHat scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  • Obligatory Name Drop Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation requests BeautifulSoup4 pandas pymongo Flask yHat At Scale Locally scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) Flask yHat scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  • Obligatory Name Drop Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  • Pipeline Iteration 0: • Find out how much data • Run locally • Experiment Questions? tweet @zipfianacademy #pydata
  • Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata Acquire
  • Retrieve Meta-data for ALL NYT articles Questions? tweet @zipfianacademy #pydata Acquire
  • api_key='xxxxxxxxxxxxx'! ! ! ! url = 'http://api.nytimes.com/svc/search/v2/ articlesearch.json?fq=section_name.contains:("Arts" "Business Day" "Opinion" "Sports" "U.S." "World")&sort=newest&api-key=' + api_key! ! ! ! # make an API request! api = requests.get(url)! Questions? tweet @zipfianacademy #pydata Acquire
  • # parse resulting JSON and insert into a mongoDB collection! for content in api.json()['response']['docs']:! if not collection.find_one(content):! collection.insert(content)! ! ! # only returns 10 per page! "There are only %i docuemtns returned 0_o" % ! ! len(api.json()[‘response']['docs'])! Questions? tweet @zipfianacademy #pydata Acquire
  • # there are many more than 10 articles however! total_art = articles_left = api.json()['response']['meta']['hits']! ! ! print "There are currently %s articles in the NYT archive" % total_art! ! ! #=> There are currently 15277775 articles in the NYT archive Questions? tweet @zipfianacademy #pydata Acquire
  • Gotchas! • Rate Limiting • Page Limiting Questions? tweet @zipfianacademy #pydata Acquire
  • Iterate Iteration 1: • (Meaningful) Sample of Data • Prototype — “Close the Loop” Questions? tweet @zipfianacademy #pydata
  • Retrieve Meta-data for ALL NYT articles Questions? tweet @zipfianacademy #pydata Acquire (take 2)
  • # let us loop (and hopefully not hit our rate limit)! while articles_left > 0 and page_count < max_pages:! more_articles = requests.get(url + "&page=" + str(page) + "&end_date=" + str(last_date))! print "Inserting page " + str(page)! # make sure it was successful! if more_articles.status_code == 200:! for content in more_articles.json()['response']['docs']:! latest_article = parser.parse(content['pub_date']).strftime("%Y%m%d")! if not collection.find_one(content) and content['document_type'] == 'article':! print "No dups"! try:! print "Inserting article " + str(content['headline'])! collection.insert(content)! except errors.DuplicateKeyError:! print "Duplicates"! continue! else:! print "In collection already”! ! ! … Iteration 0.5 Questions? tweet @zipfianacademy #pydata Acquire
  • articles_left -= 10! page += 1! page_count += 1! cursor_count += 1! final_page = max(final_page, page)! else:! if more_articles.status_code == 403:! print "Sleepy..."! # account for rate limiting! time.sleep(2)! elif cursor_count > 100:! print "Adjusting date”! ! ! ! ! # account for page limiting! cursor_count = 0! page = 0! last_date = latest_article! else:! print "ERRORS: " + str(more_articles.status_code)! cursor_count = 0! page = 0! last_date = latest_article! Questions? tweet @zipfianacademy #pydata Acquire
  • Download HTML content of articles from NYT.com Questions? tweet @zipfianacademy #pydata Acquire (and store in MongoDB™)
  • Acquire # now we can get some content!! #limit = 100! limit = 10000! ! for article in collection.find({'html' : {'$exists' : False} }):! if limit and limit > 0:! if not article.has_key('html') and article['document_type'] == 'article':! limit -= 1! print article['web_url']! html = requests.get(article['web_url'] + "?smid=tw-nytimes")! ! if html.status_code == 200:! soup = BeautifulSoup(html.text)! ! # serialize html! collection.update({ '_id' : article['_id'] }, { '$set' : ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'html' : unicode(soup), 'content' : [] } ! ! ! ! ! ! ! ! ! ! ! ! ! } )! ! for p in soup.find_all('div', class_='articleBody'):! collection.update({ '_id' : article['_id'] }, { '$push' : ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() ! ! ! ! ! ! ! ! ! ! ! ! ! ! } })! Questions? tweet @zipfianacademy #pydata
  • Parse Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo scikit-learn/NLTK Flask yHat Locally Questions? tweet @zipfianacademy #pydata
  • Parse HTML with BeautifulSoup and Extract the article Body Questions? tweet @zipfianacademy #pydata (and store in MongoDB™) Parse
  • # parse HTML content of articles! for article in collection.find({'html' : {'$exists' : True} }):! print article['web_url']! soup = BeautifulSoup(article['html'], 'html.parser')! arts = soup.find_all('div', class_='articleBody')! ! if len(arts) == 0:! arts = soup.find_all('p', class_=‘story-body-text')! ! ! ! … Questions? tweet @zipfianacademy #pydata Parse
  • Store Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  • for p in arts:! collection.update({ '_id' : article['_id'] }, { '$push' : ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() } ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! })! Questions? tweet @zipfianacademy #pydata Store
  • Explore Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  • Exploratory Data Analysis with pandas Questions? tweet @zipfianacademy #pydata Explore
  • articles.describe()! # ! ! text section! # count 1405 1405! # unique 1397 10! ! fig = plt.figure()! # histogram of section counts! articles['section'].value_counts().plot(kind='bar') Questions? tweet @zipfianacademy #pydata Explore
  • Questions? tweet @zipfianacademy #pydata Explore
  • error with NYT API Questions? tweet @zipfianacademy #pydata Explore
  • api_key='xxxxxxxxxxxxx'! ! ! ! url = 'http://api.nytimes.com/svc/search/v2/ articlesearch.json?fq=section_name.contains:("Arts" "Business Day" "Opinion" "Sports" "U.S." "World")&sort=newest&api-key=' + api_key! ! ! ! # make an API request! api = requests.get(url)! Questions? tweet @zipfianacademy #pydata Explore error with NYT API
  • Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata Vectorize
  • Tokenize article text and create feature vectors with NLTK Questions? tweet @zipfianacademy #pydata Vectorize
  • Vectorize wnl = nltk.WordNetLemmatizer()! ! def tokenize_and_normalize(chunks):! words = [ tokenize.word_tokenize(sent) for sent in tokenize.sent_tokenize("".join(chunks)) ]! flatten = [ inner for sublist in words for inner in sublist ]! stripped = [] ! ! for word in flatten: ! if word not in stopwords.words('english'):! try:! stripped.append(word.encode('latin-1').decode('utf8').lower())! except:! print "Cannot encode: " + word! ! no_punks = [ word for word in stripped if len(word) > 1 ] ! return [wnl.lemmatize(t) for t in no_punks]! Questions? tweet @zipfianacademy #pydata
  • Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata Train
  • Train and score a model with scikit-learn Questions? tweet @zipfianacademy #pydata Train
  • # cross validate! from sklearn.cross_validation import train_test_split! ! xtrain, xtest, ytrain, ytest = ! ! ! ! ! ! ! ! train_test_split(X, labels, test_size=0.3)! ! # train a model! alpha = 1! multi_bayes = MultinomialNB(alpha=alpha)! ! multi_bayes.fit(xtrain, ytrain)! multi_bayes.score(xtest, ytest) Questions? tweet @zipfianacademy #pydata Train
  • Gotchas! • Model only exists locally on Laptop • Not Automated for realtime prediction Questions? tweet @zipfianacademy #pydata Train
  • Exposé Questions? tweet @zipfianacademy #pydata
  • Iteration 2: • Expose your model • Automate your processes Questions? tweet @zipfianacademy #pydata Exposé
  • Getting that model off your lap(top) Questions? tweet @zipfianacademy #pydata Exposé
  • Source: http://pixel.nymag.com/imgs/daily/vulture/2012/03/09/09_joan-taylor.o.jpg/ a_560x0.jpg Questions? tweet @zipfianacademy #pydata Exposé
  • A model is just a function Questions? tweet @zipfianacademy #pydata Exposé
  • Inputs... Questions? tweet @zipfianacademy #pydata Exposé
  • Outputs... Questions? tweet @zipfianacademy #pydata Exposé
  • Serialize your model with pickle (or cPickle or joblib) Questions? tweet @zipfianacademy #pydata Persistence
  • Source: http://www.glogster.com/mrsallenballard/pickles-i-love-em-/ g-6mevh13be74mgnc9i8qifa0 Persistence Questions? tweet @zipfianacademy #pydata
  • Persistence SerDes • Disk • Database • Memory Questions? tweet @zipfianacademy #pydata
  • Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata Exposé
  • Deploy your Model to yHat Questions? tweet @zipfianacademy #pydata Exposé
  • class DocumentClassifier(YhatModel):! @preprocess(in_type=dict, out_type=dict)! def execute(self, data):! featureBody = vectorizer.transform([data['content']])! result = multi_bayes.predict(featureBody)! list_res = result.tolist()! return {"section_name": list_res}! ! clf = DocumentClassifier()! yh = Yhat("jonathan@zipfianacademy.com", “xxxxxx",! ! ! ! ! ! ! ! ! ! ! ! ! ! "http://cloud.yhathq.com/")! yh.deploy("documentClassifier", DocumentClassifier, globals()) Questions? tweet @zipfianacademy #pydata Exposé
  • Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask (on Heroku) yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata Present
  • Create a Flask application to expose your model on the web Questions? tweet @zipfianacademy #pydata Present
  • yh = Yhat("<USERNAME>", "<API KEY>", "http://cloud.yhathq.com/") ! @app.route('/') def index(): return app.send_static_file('index.html') ! @app.route('/predict', methods=['POST']) def predict(): article = request.form['article'] results = yh.predict("documentClf", { 'content': article }) return jsonify({"results": results}) Questions? tweet @zipfianacademy #pydata Present
  • Pipeline Only Data should Flow Questions? tweet @zipfianacademy #pydata
  • Data Remember to Remember (Lineage) Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation Questions? tweet @zipfianacademy #pydata Pipeline
  • Immutable append only set of Raw Data Computation is a view on data *Lambda Architecture by Nathan MarzQuestions? tweet @zipfianacademy #pydata Pipeline
  • Functional Data Science • Modularity • Define interfaces • Separate data from computation • Data Lineage Functional Questions? tweet @zipfianacademy #pydata
  • Need Robust and Flexible Pipeline! Questions? tweet @zipfianacademy #pydata Pipeline
  • Whatever you do, DO NOT cross the streams Questions? tweet @zipfianacademy #pydata Pipeline
  • NYT API MongoDB BeautifulSoup Feature Matrixscikit-learn Web App Model Deploy yHat Heroku POST Predict Predicted Section Where we are NLTK scikit-learn Questions? tweet @zipfianacademy #pydata
  • Gotchas! • Only have a static subset of articles • Pipeline not automated for re-training Questions? tweet @zipfianacademy #pydata Gotchas
  • Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  • Iteration 3: Source: http://vninja.net/wordpress/wp-content/uploads/2013/03/KCaAutomate.pngQuestions? tweet @zipfianacademy #pydata Iterate
  • NYT API MongoDB cron Feature Matrixscikit-learn Web App Model Deploy yHat Heroku POST Predict Predicted Section Where we are NLTK scikit-learn Questions? tweet @zipfianacademy #pydata Amazon EC2
  • testing Start small (data) and fast (development) testing Increase size of data set Optimize and productionize PROFIT! $$$ Questions? tweet @zipfianacademy #pydata How to Scale
  • How to Scale testing Develop locally testing Distribute computation (run on cluster) Tune parameters PROFIT! $$$ Questions? tweet @zipfianacademy #pydata Can also use a streaming algorithm or single machine disk based “medium data” technologies (i.e. database or memory mapped files)
  • Products If you build it... Questions? tweet @zipfianacademy #pydata
  • Source: http://nateemery.com/wp-content/uploads/2013/05/field-of-dreams.jpg Products Questions? tweet @zipfianacademy #pydata
  • Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  • Q & A Q&A Questions? tweet @zipfianacademy #pydata
  • Zipfian Academy @ZipfianAcademy Data Science & Data Engineering 12-week Bootcamp (May 12th & Sep 8th) Weekend Workshops http://zipfianacademy.com/apply http://zipfianacademy.com/workshops Next: InteractiveVisualizations w/ d3.js (June 7th) Questions? tweet @zipfianacademy #pydata
  • Thank You! Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex @ZipfianAcademy http://zipfianacademy.com Questions? tweet @zipfianacademy #pydata
  • Appendix Questions? tweet @zipfianacademy #pydata
  • Data Sources Obtain (ranked by ease of use) 1. DaaS -- Data as a service 2. Bulk Download 3. APIs 4. Web Scraping Questions? tweet @zipfianacademy #pydata
  • DaaS (Data as a Service) •Time Series/Numeric: Quandl • Financial Modeling: Quantopian • Email Contextualization: Rapleaf • Location and POI: Factual Data Sources Questions? tweet @zipfianacademy #pydata
  • Bulk Download (just like the good ol’ days) • File Transfer Protocol (FTP): CDC •Amazon Web Services: Public Datasets • Infochimps: Data Marketplace •Academia: UCI Machine Learning Repository Data Sources Questions? tweet @zipfianacademy #pydata
  • APIs (if it’s not RESTed, I’m not buying) • Geographic: Foursquare • Social: Facebook •Audio: Rdio • Content:Tumblr • Realtime:Twitter • Hidden:Yahoo Finance Data Sources Questions? tweet @zipfianacademy #pydata
  • Web Scraping 1. wget and curl 2. Web Spider/Crawler 3. API scraping 4. Manual Download (DIY for life) Data Sources Questions? tweet @zipfianacademy #pydata
  • • DelimitedValues • TSV • CSV • WSV • JSON • XML • Ad Hoc Formats (avoid these if you can) Data Formats Questions? tweet @zipfianacademy #pydata
  • • JSON is made up of hash tables and arrays • Hash tables: { “foo” : 1, “bar” : 2, baz : “3” } • Arrays: [1, 2, 3] • Arrays of arrays: [[1, 2, 3], [‘foo’, ‘bar’, ‘baz’]] • Array of hashes: [{‘foo’:1, ‘bar’:2}, {‘baz’:3}] • Hashes of hashes: {‘foo’: {‘bar’: 2, ‘baz’: 3}} Questions? tweet @zipfianacademy #pydata Data Formats
  • {"widget": {! "debug": "on",! "window": {! "title": "Sample Konfabulator Widget",! "name": "main_window",! "width": 500,! "height": 500! },! "image": { ! "src": "Images/Sun.png",! "name": "sun1",! "hOffset": 250,! "vOffset": 250,! "alignment": "center"! },! "text": {! "data": "Click Here",! "size": 36,! "style": "bold",! "name": "text1",! "hOffset": 250,! "vOffset": 100,! "alignment": "center",! "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"! }! }} ! Questions? tweet @zipfianacademy #pydata Data Formats
  • • XML is a recursive self-describing container <container> <item>Foo</item> <item>Bar</item> <container> <item attr=”SomethingAboutBaz”>Baz</item> </container> </item> <container> Questions? tweet @zipfianacademy #pydata Data Formats
  • <widget>! <debug>on</debug>! <window title="Sample Konfabulator Widget">! <name>main_window</name>! <width>500</width>! <height>500</height>! </window>! <image src="Images/Sun.png" name="sun1">! <hOffset>250</hOffset>! <vOffset>250</vOffset>! <alignment>center</alignment>! </image>! <text data="Click Here" size="36" style="bold">! <name>text1</name>! <hOffset>250</hOffset>! <vOffset>100</vOffset>! <alignment>center</alignment>! <onMouseUp>! sun1.opacity = (sun1.opacity / 100) * 90;! </onMouseUp>! </text>! </widget>! Questions? tweet @zipfianacademy #pydata Data Formats
  • • Ad hoc data formats • Fixed-width (Census data) • Graph Edgelists • Voting records • etc. Questions? tweet @zipfianacademy #pydata Data Formats
  • • 7-5-5 format •Sam foo bar! •Roger baz 6! •Jane 314 99 Questions? tweet @zipfianacademy #pydata Data Formats
  • • Directed Graph Format 1 2! 1 3! 1 4! 2 3! 4 4 Questions? tweet @zipfianacademy #pydata Data Formats
  • • Directed Graph Format 1 2! 1 3! 1 4! 2 3! 4 4 Questions? tweet @zipfianacademy #pydata Data Formats
  • Programming languages like Python, Ruby, and R have built in parsers for data formats such as JSON and CSV. For other esoteric formats you will probably have to write your own Questions? tweet @zipfianacademy #pydata Data Formats