Successfully reported this slideshow.

Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

3,737 views

Published on

Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to be able to automate your analyses to personalize your users experiences. LinkedIn's People you May Know, Netflix and Pandora's recommenders, and Amazon's eerily custom shopping experience have all shown us why it is essential to leverage data if you want to stay relevant as a company.

As data analyses turn into products, it is essential that your tech/data stack be flexible enough to run models in production, integrate with web applications, and provide users with immediate and valuable feedback. I believe Python is becoming the lingua franca of data science due to its flexibility as a general purpose performant programming language, rich scientific ecosystem (numpy, scipy, scikit-learn, pandas, etc.), web frameworks/community, and utilities/libraries for handling data at scale. In this talk I will walk through a fictional company bringing it's first data product to market. Along the way I will cover Python and data science best practices for such a pipeline, cover some of the pitfalls of what happens when you put models into production, and how to make sure your users (and engineers) are as happy as they can be.

https://github.com/Jay-Oh-eN/pydatasv2014

Published in: Technology, Education

Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014

  1. 1. Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex @ZipfianAcademy Data Engineering 101: Building your first data product May 4th, 2014
  2. 2. Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  3. 3. Formerly Questions? tweet @zipfianacademy #pydata
  4. 4. Formerly Questions? tweet @zipfianacademy #pydata
  5. 5. Currently Questions? tweet @zipfianacademy #pydata
  6. 6. Today Disclaimer: All characters appearing in this presentation are fictitious. Any resemblance to real persons, living or dead, is purely coincidental. Questions? tweet @zipfianacademy #pydata
  7. 7. Today Disclaimer: This presentation contains strong opinions that you may or may not agree with. All thoughts are my own. Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex Questions? tweet @zipfianacademy #pydata
  8. 8. Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • CreatingValue for Users • Q&A Questions? tweet @zipfianacademy #pydata
  9. 9. nwsrdr (News Reader) Source: http://www.groovypost.com/wp-content/uploads/2013/05/Bookmark- Button.png OR nwsrdr + nwrsrdr + nwrsrdr + nwrsrdr nwsrdr getnews.com/bookmarklet When browsing the web simply click the +nwsrdr to save any page to nwsrdr Get nwsrdr on your desktop Questions? tweet @zipfianacademy #pydata
  10. 10. nwsrdr • Auto-categorize Articles • Find Similar Articles • Recommend articles • Suggest Feeds to Follow • No Ads! It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious! Questions? tweet @zipfianacademy #pydata
  11. 11. nwsrdr It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious! • Naive Bayes (classification) • Clustering (unsupervised learning) • Collaborative Filtering • Triangle Closing • Real Business Model! Questions? tweet @zipfianacademy #pydata
  12. 12. Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  13. 13. OR Data Products Product Built on Data (that you sell) Questions? tweet @zipfianacademy #pydata
  14. 14. OR Data Products Product that Generates Data Questions? tweet @zipfianacademy #pydata
  15. 15. OR Data Products Product that Generates Data (that you sell) Questions? tweet @zipfianacademy #pydata
  16. 16. OR Data Products Product that Generates Data (that you sell) i.e. Facebook Questions? tweet @zipfianacademy #pydata
  17. 17. OR Data Products Questions? tweet @zipfianacademy #pydata Source: http://gifgif.media.mit.edu/
  18. 18. OR Data Products Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
  19. 19. OR Data Generating Products Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
  20. 20. Products that enhance a users’ experience the more “data” a user provides Data Generating Products Ex: Recommender Systems Questions? tweet @zipfianacademy #pydata
  21. 21. Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  22. 22. OR Data Science Questions? tweet @zipfianacademy #pydata
  23. 23. i.e. solve more problems than you create Data Science Questions? tweet @zipfianacademy #pydata
  24. 24. Source: http://estoyentretenido.com/wp-content/uploads/2012/11/Jackie-Chan- Meme.jpg But.... How?!?!?!!? Data Science Questions? tweet @zipfianacademy #pydata
  25. 25. Data Engineering Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer- It-1350063721.PNG Questions? tweet @zipfianacademy #pydata
  26. 26. Data Engineering Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer- It-1350063721.PNG ! Questions? tweet @zipfianacademy #pydata
  27. 27. OR Data Engineering Questions? tweet @zipfianacademy #pydata
  28. 28. Prepared Data Test Set Training Set Train Model Sampling Evaluate Cross Validation Data Science Questions? tweet @zipfianacademy #pydata
  29. 29. Raw Data Cleaned Data Scrubbing Prepared DataVectorization New Data Test Set Training Set Train Model Sampling Evaluate Cross Validation Cleaned Data Prepared DataVectorizationScrubbing Predict Labels/ Classes Data Engineering Questions? tweet @zipfianacademy #pydata
  30. 30. Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  31. 31. What • Naive Bayes (classification) • Clustering (unsupervised learning) • Collaborative Filtering • Triangle Closing • Real Business Model Questions? tweet @zipfianacademy #pydata
  32. 32. nwsrdr • Auto-categorize Articles • Find Similar Articles • Recommend articles • No Ads! It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious! Questions? tweet @zipfianacademy #pydata
  33. 33. nwsrdr It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious! • Naive Bayes (classification) • Clustering (unsupervised learning) • Collaborative Filtering • Triangle Closing • Real Business Model! Questions? tweet @zipfianacademy #pydata
  34. 34. Source: http://media.tumblr.com/tumblr_lakcynCyG31qbzcoy.jpg Abstraction (Cake) How (ABK) Questions? tweet @zipfianacademy #pydata
  35. 35. Obligatory Name Drop Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation requests BeautifulSoup4 pandas pymongo scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  36. 36. Obligatory Name Drop Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation requests BeautifulSoup4 pandas pymongo scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  37. 37. Obligatory Name Drop Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation requests BeautifulSoup4 pandas pymongo Flask yHat scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  38. 38. Obligatory Name Drop Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation requests BeautifulSoup4 pandas pymongo Flask yHat At Scale Locally scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) Flask yHat scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  39. 39. Obligatory Name Drop Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  40. 40. Pipeline Iteration 0: • Find out how much data • Run locally • Experiment Questions? tweet @zipfianacademy #pydata
  41. 41. Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata Acquire
  42. 42. Retrieve Meta-data for ALL NYT articles Questions? tweet @zipfianacademy #pydata Acquire
  43. 43. api_key='xxxxxxxxxxxxx'! ! ! ! url = 'http://api.nytimes.com/svc/search/v2/ articlesearch.json?fq=section_name.contains:("Arts" "Business Day" "Opinion" "Sports" "U.S." "World")&sort=newest&api-key=' + api_key! ! ! ! # make an API request! api = requests.get(url)! Questions? tweet @zipfianacademy #pydata Acquire
  44. 44. # parse resulting JSON and insert into a mongoDB collection! for content in api.json()['response']['docs']:! if not collection.find_one(content):! collection.insert(content)! ! ! # only returns 10 per page! "There are only %i docuemtns returned 0_o" % ! ! len(api.json()[‘response']['docs'])! Questions? tweet @zipfianacademy #pydata Acquire
  45. 45. # there are many more than 10 articles however! total_art = articles_left = api.json()['response']['meta']['hits']! ! ! print "There are currently %s articles in the NYT archive" % total_art! ! ! #=> There are currently 15277775 articles in the NYT archive Questions? tweet @zipfianacademy #pydata Acquire
  46. 46. Gotchas! • Rate Limiting • Page Limiting Questions? tweet @zipfianacademy #pydata Acquire
  47. 47. Iterate Iteration 1: • (Meaningful) Sample of Data • Prototype — “Close the Loop” Questions? tweet @zipfianacademy #pydata
  48. 48. Retrieve Meta-data for ALL NYT articles Questions? tweet @zipfianacademy #pydata Acquire (take 2)
  49. 49. # let us loop (and hopefully not hit our rate limit)! while articles_left > 0 and page_count < max_pages:! more_articles = requests.get(url + "&page=" + str(page) + "&end_date=" + str(last_date))! print "Inserting page " + str(page)! # make sure it was successful! if more_articles.status_code == 200:! for content in more_articles.json()['response']['docs']:! latest_article = parser.parse(content['pub_date']).strftime("%Y%m%d")! if not collection.find_one(content) and content['document_type'] == 'article':! print "No dups"! try:! print "Inserting article " + str(content['headline'])! collection.insert(content)! except errors.DuplicateKeyError:! print "Duplicates"! continue! else:! print "In collection already”! ! ! … Iteration 0.5 Questions? tweet @zipfianacademy #pydata Acquire
  50. 50. articles_left -= 10! page += 1! page_count += 1! cursor_count += 1! final_page = max(final_page, page)! else:! if more_articles.status_code == 403:! print "Sleepy..."! # account for rate limiting! time.sleep(2)! elif cursor_count > 100:! print "Adjusting date”! ! ! ! ! # account for page limiting! cursor_count = 0! page = 0! last_date = latest_article! else:! print "ERRORS: " + str(more_articles.status_code)! cursor_count = 0! page = 0! last_date = latest_article! Questions? tweet @zipfianacademy #pydata Acquire
  51. 51. Download HTML content of articles from NYT.com Questions? tweet @zipfianacademy #pydata Acquire (and store in MongoDB™)
  52. 52. Acquire # now we can get some content!! #limit = 100! limit = 10000! ! for article in collection.find({'html' : {'$exists' : False} }):! if limit and limit > 0:! if not article.has_key('html') and article['document_type'] == 'article':! limit -= 1! print article['web_url']! html = requests.get(article['web_url'] + "?smid=tw-nytimes")! ! if html.status_code == 200:! soup = BeautifulSoup(html.text)! ! # serialize html! collection.update({ '_id' : article['_id'] }, { '$set' : ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'html' : unicode(soup), 'content' : [] } ! ! ! ! ! ! ! ! ! ! ! ! ! } )! ! for p in soup.find_all('div', class_='articleBody'):! collection.update({ '_id' : article['_id'] }, { '$push' : ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() ! ! ! ! ! ! ! ! ! ! ! ! ! ! } })! Questions? tweet @zipfianacademy #pydata
  53. 53. Parse Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo scikit-learn/NLTK Flask yHat Locally Questions? tweet @zipfianacademy #pydata
  54. 54. Parse HTML with BeautifulSoup and Extract the article Body Questions? tweet @zipfianacademy #pydata (and store in MongoDB™) Parse
  55. 55. # parse HTML content of articles! for article in collection.find({'html' : {'$exists' : True} }):! print article['web_url']! soup = BeautifulSoup(article['html'], 'html.parser')! arts = soup.find_all('div', class_='articleBody')! ! if len(arts) == 0:! arts = soup.find_all('p', class_=‘story-body-text')! ! ! ! … Questions? tweet @zipfianacademy #pydata Parse
  56. 56. Store Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  57. 57. for p in arts:! collection.update({ '_id' : article['_id'] }, { '$push' : ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() } ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! })! Questions? tweet @zipfianacademy #pydata Store
  58. 58. Explore Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata
  59. 59. Exploratory Data Analysis with pandas Questions? tweet @zipfianacademy #pydata Explore
  60. 60. articles.describe()! # ! ! text section! # count 1405 1405! # unique 1397 10! ! fig = plt.figure()! # histogram of section counts! articles['section'].value_counts().plot(kind='bar') Questions? tweet @zipfianacademy #pydata Explore
  61. 61. Questions? tweet @zipfianacademy #pydata Explore
  62. 62. error with NYT API Questions? tweet @zipfianacademy #pydata Explore
  63. 63. api_key='xxxxxxxxxxxxx'! ! ! ! url = 'http://api.nytimes.com/svc/search/v2/ articlesearch.json?fq=section_name.contains:("Arts" "Business Day" "Opinion" "Sports" "U.S." "World")&sort=newest&api-key=' + api_key! ! ! ! # make an API request! api = requests.get(url)! Questions? tweet @zipfianacademy #pydata Explore error with NYT API
  64. 64. Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata Vectorize
  65. 65. Tokenize article text and create feature vectors with NLTK Questions? tweet @zipfianacademy #pydata Vectorize
  66. 66. Vectorize wnl = nltk.WordNetLemmatizer()! ! def tokenize_and_normalize(chunks):! words = [ tokenize.word_tokenize(sent) for sent in tokenize.sent_tokenize("".join(chunks)) ]! flatten = [ inner for sublist in words for inner in sublist ]! stripped = [] ! ! for word in flatten: ! if word not in stopwords.words('english'):! try:! stripped.append(word.encode('latin-1').decode('utf8').lower())! except:! print "Cannot encode: " + word! ! no_punks = [ word for word in stripped if len(word) > 1 ] ! return [wnl.lemmatize(t) for t in no_punks]! Questions? tweet @zipfianacademy #pydata
  67. 67. Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata Train
  68. 68. Train and score a model with scikit-learn Questions? tweet @zipfianacademy #pydata Train
  69. 69. # cross validate! from sklearn.cross_validation import train_test_split! ! xtrain, xtest, ytrain, ytest = ! ! ! ! ! ! ! ! train_test_split(X, labels, test_size=0.3)! ! # train a model! alpha = 1! multi_bayes = MultinomialNB(alpha=alpha)! ! multi_bayes.fit(xtrain, ytrain)! multi_bayes.score(xtest, ytest) Questions? tweet @zipfianacademy #pydata Train
  70. 70. Gotchas! • Model only exists locally on Laptop • Not Automated for realtime prediction Questions? tweet @zipfianacademy #pydata Train
  71. 71. Exposé Questions? tweet @zipfianacademy #pydata
  72. 72. Iteration 2: • Expose your model • Automate your processes Questions? tweet @zipfianacademy #pydata Exposé
  73. 73. Getting that model off your lap(top) Questions? tweet @zipfianacademy #pydata Exposé
  74. 74. Source: http://pixel.nymag.com/imgs/daily/vulture/2012/03/09/09_joan-taylor.o.jpg/ a_560x0.jpg Questions? tweet @zipfianacademy #pydata Exposé
  75. 75. A model is just a function Questions? tweet @zipfianacademy #pydata Exposé
  76. 76. Inputs... Questions? tweet @zipfianacademy #pydata Exposé
  77. 77. Outputs... Questions? tweet @zipfianacademy #pydata Exposé
  78. 78. Serialize your model with pickle (or cPickle or joblib) Questions? tweet @zipfianacademy #pydata Persistence
  79. 79. Source: http://www.glogster.com/mrsallenballard/pickles-i-love-em-/ g-6mevh13be74mgnc9i8qifa0 Persistence Questions? tweet @zipfianacademy #pydata
  80. 80. Persistence SerDes • Disk • Database • Memory Questions? tweet @zipfianacademy #pydata
  81. 81. Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata Exposé
  82. 82. Deploy your Model to yHat Questions? tweet @zipfianacademy #pydata Exposé
  83. 83. class DocumentClassifier(YhatModel):! @preprocess(in_type=dict, out_type=dict)! def execute(self, data):! featureBody = vectorizer.transform([data['content']])! result = multi_bayes.predict(featureBody)! list_res = result.tolist()! return {"section_name": list_res}! ! clf = DocumentClassifier()! yh = Yhat("jonathan@zipfianacademy.com", “xxxxxx",! ! ! ! ! ! ! ! ! ! ! ! ! ! "http://cloud.yhathq.com/")! yh.deploy("documentClassifier", DocumentClassifier, globals()) Questions? tweet @zipfianacademy #pydata Exposé
  84. 84. Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation At Scale Flask yHat scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) Snakebite (HDFS) MLlib (pySpark) requests BeautifulSoup4 pandas pymongo Flask (on Heroku) yHat Locally scikit-learn/NLTK Questions? tweet @zipfianacademy #pydata Present
  85. 85. Create a Flask application to expose your model on the web Questions? tweet @zipfianacademy #pydata Present
  86. 86. yh = Yhat("<USERNAME>", "<API KEY>", "http://cloud.yhathq.com/") ! @app.route('/') def index(): return app.send_static_file('index.html') ! @app.route('/predict', methods=['POST']) def predict(): article = request.form['article'] results = yh.predict("documentClf", { 'content': article }) return jsonify({"results": results}) Questions? tweet @zipfianacademy #pydata Present
  87. 87. Pipeline Only Data should Flow Questions? tweet @zipfianacademy #pydata
  88. 88. Data Remember to Remember (Lineage) Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation Questions? tweet @zipfianacademy #pydata Pipeline
  89. 89. Immutable append only set of Raw Data Computation is a view on data *Lambda Architecture by Nathan MarzQuestions? tweet @zipfianacademy #pydata Pipeline
  90. 90. Functional Data Science • Modularity • Define interfaces • Separate data from computation • Data Lineage Functional Questions? tweet @zipfianacademy #pydata
  91. 91. Need Robust and Flexible Pipeline! Questions? tweet @zipfianacademy #pydata Pipeline
  92. 92. Whatever you do, DO NOT cross the streams Questions? tweet @zipfianacademy #pydata Pipeline
  93. 93. NYT API MongoDB BeautifulSoup Feature Matrixscikit-learn Web App Model Deploy yHat Heroku POST Predict Predicted Section Where we are NLTK scikit-learn Questions? tweet @zipfianacademy #pydata
  94. 94. Gotchas! • Only have a static subset of articles • Pipeline not automated for re-training Questions? tweet @zipfianacademy #pydata Gotchas
  95. 95. Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  96. 96. Iteration 3: Source: http://vninja.net/wordpress/wp-content/uploads/2013/03/KCaAutomate.pngQuestions? tweet @zipfianacademy #pydata Iterate
  97. 97. NYT API MongoDB cron Feature Matrixscikit-learn Web App Model Deploy yHat Heroku POST Predict Predicted Section Where we are NLTK scikit-learn Questions? tweet @zipfianacademy #pydata Amazon EC2
  98. 98. testing Start small (data) and fast (development) testing Increase size of data set Optimize and productionize PROFIT! $$$ Questions? tweet @zipfianacademy #pydata How to Scale
  99. 99. How to Scale testing Develop locally testing Distribute computation (run on cluster) Tune parameters PROFIT! $$$ Questions? tweet @zipfianacademy #pydata Can also use a streaming algorithm or single machine disk based “medium data” technologies (i.e. database or memory mapped files)
  100. 100. Products If you build it... Questions? tweet @zipfianacademy #pydata
  101. 101. Source: http://nateemery.com/wp-content/uploads/2013/05/field-of-dreams.jpg Products Questions? tweet @zipfianacademy #pydata
  102. 102. Today • whoami • Nws Rdr (News Reader) • The What,Why, and How of Data Products • Data Engineering • Building a Pipeline • Productionizing the Products • Q&A Questions? tweet @zipfianacademy #pydata
  103. 103. Q & A Q&A Questions? tweet @zipfianacademy #pydata
  104. 104. Zipfian Academy @ZipfianAcademy Data Science & Data Engineering 12-week Bootcamp (May 12th & Sep 8th) Weekend Workshops http://zipfianacademy.com/apply http://zipfianacademy.com/workshops Next: InteractiveVisualizations w/ d3.js (June 7th) Questions? tweet @zipfianacademy #pydata
  105. 105. Thank You! Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex @ZipfianAcademy http://zipfianacademy.com Questions? tweet @zipfianacademy #pydata
  106. 106. Appendix Questions? tweet @zipfianacademy #pydata
  107. 107. Data Sources Obtain (ranked by ease of use) 1. DaaS -- Data as a service 2. Bulk Download 3. APIs 4. Web Scraping Questions? tweet @zipfianacademy #pydata
  108. 108. DaaS (Data as a Service) •Time Series/Numeric: Quandl • Financial Modeling: Quantopian • Email Contextualization: Rapleaf • Location and POI: Factual Data Sources Questions? tweet @zipfianacademy #pydata
  109. 109. Bulk Download (just like the good ol’ days) • File Transfer Protocol (FTP): CDC •Amazon Web Services: Public Datasets • Infochimps: Data Marketplace •Academia: UCI Machine Learning Repository Data Sources Questions? tweet @zipfianacademy #pydata
  110. 110. APIs (if it’s not RESTed, I’m not buying) • Geographic: Foursquare • Social: Facebook •Audio: Rdio • Content:Tumblr • Realtime:Twitter • Hidden:Yahoo Finance Data Sources Questions? tweet @zipfianacademy #pydata
  111. 111. Web Scraping 1. wget and curl 2. Web Spider/Crawler 3. API scraping 4. Manual Download (DIY for life) Data Sources Questions? tweet @zipfianacademy #pydata
  112. 112. • DelimitedValues • TSV • CSV • WSV • JSON • XML • Ad Hoc Formats (avoid these if you can) Data Formats Questions? tweet @zipfianacademy #pydata
  113. 113. • JSON is made up of hash tables and arrays • Hash tables: { “foo” : 1, “bar” : 2, baz : “3” } • Arrays: [1, 2, 3] • Arrays of arrays: [[1, 2, 3], [‘foo’, ‘bar’, ‘baz’]] • Array of hashes: [{‘foo’:1, ‘bar’:2}, {‘baz’:3}] • Hashes of hashes: {‘foo’: {‘bar’: 2, ‘baz’: 3}} Questions? tweet @zipfianacademy #pydata Data Formats
  114. 114. {"widget": {! "debug": "on",! "window": {! "title": "Sample Konfabulator Widget",! "name": "main_window",! "width": 500,! "height": 500! },! "image": { ! "src": "Images/Sun.png",! "name": "sun1",! "hOffset": 250,! "vOffset": 250,! "alignment": "center"! },! "text": {! "data": "Click Here",! "size": 36,! "style": "bold",! "name": "text1",! "hOffset": 250,! "vOffset": 100,! "alignment": "center",! "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"! }! }} ! Questions? tweet @zipfianacademy #pydata Data Formats
  115. 115. • XML is a recursive self-describing container <container> <item>Foo</item> <item>Bar</item> <container> <item attr=”SomethingAboutBaz”>Baz</item> </container> </item> <container> Questions? tweet @zipfianacademy #pydata Data Formats
  116. 116. <widget>! <debug>on</debug>! <window title="Sample Konfabulator Widget">! <name>main_window</name>! <width>500</width>! <height>500</height>! </window>! <image src="Images/Sun.png" name="sun1">! <hOffset>250</hOffset>! <vOffset>250</vOffset>! <alignment>center</alignment>! </image>! <text data="Click Here" size="36" style="bold">! <name>text1</name>! <hOffset>250</hOffset>! <vOffset>100</vOffset>! <alignment>center</alignment>! <onMouseUp>! sun1.opacity = (sun1.opacity / 100) * 90;! </onMouseUp>! </text>! </widget>! Questions? tweet @zipfianacademy #pydata Data Formats
  117. 117. • Ad hoc data formats • Fixed-width (Census data) • Graph Edgelists • Voting records • etc. Questions? tweet @zipfianacademy #pydata Data Formats
  118. 118. • 7-5-5 format •Sam foo bar! •Roger baz 6! •Jane 314 99 Questions? tweet @zipfianacademy #pydata Data Formats
  119. 119. • Directed Graph Format 1 2! 1 3! 1 4! 2 3! 4 4 Questions? tweet @zipfianacademy #pydata Data Formats
  120. 120. • Directed Graph Format 1 2! 1 3! 1 4! 2 3! 4 4 Questions? tweet @zipfianacademy #pydata Data Formats
  121. 121. Programming languages like Python, Ruby, and R have built in parsers for data formats such as JSON and CSV. For other esoteric formats you will probably have to write your own Questions? tweet @zipfianacademy #pydata Data Formats

×