Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Architecting for Data Science

959 views

Published on

Presented at the O'Reilly Software Architecture Conference, Boston, March 19, 2015.

Published in: Software
  • Be the first to comment

Architecting for Data Science

  1. 1. Architecting for Data Science johann@ifwe.co@jssmith github.com/ifwe Johann Schleier-Smith CTO, if(we) O’Reilly Software Architecture Conference
 Boston, March 19, 2015
  2. 2. Data Science
  3. 3. value from data
  4. 4. Alternative Definitions extraction of knowledge from data making discoveries in the world of big data statistics + machine learning + scalable computation + visualization + computer science + business acumen + skilled communication
  5. 5. Related and Alternative Language business intelligence statistics data mining forecasting business reporting predictive modeling analyticsknowledge extraction
  6. 6. value
  7. 7. Types of Value understanding revenue product improvements projections new inspirations predictions customer satisfaction
  8. 8. Today’s Examples
  9. 9. • >10 million candidates to draw from • >1000 updates/sec • Must be responsive to current activity • Users expect instant query results Recommendation engine
 for dating product
  10. 10. • Real-time is challenging • Human behavior is complicated, especially in social context • Previous interactions are perhaps our best hope for predicting future interactions
  11. 11. • Human connections • User engagement ecosystem • Subscription and other revenues Value ♥
  12. 12. Kaggle competition
 with Best Buy data https://www.kaggle.com/c/acm-sf-chapter-hackathon-small
  13. 13. Kaggle competition
 with Best Buy data
  14. 14. “outgoing and social (heavy messaging --- especially distant recipients and opposite gender, many outgoing comments, many friend requests to distant people), doesn’t play Pets much” “receives many messages, active user, views many profiles, doesn't use meet me, sends many messages to distant people” Heavy user overall, (pets, meet me, messaging)! “heavy user overall, (pets, meet me, messaging)”
  15. 15. value
  16. 16. data
  17. 17. • Vote history • Social interaction history • Profile information Dating product data
  18. 18. { “timestamp” : “2011-10-31 09:48:46”, “query” : “Assassin’s Creed”, “skuSelected” : “2670133” } product views
  19. 19. { “sku” : “1032361”, “regularPrice” : “19.99”, “name” : “Need for Speed: Hot Pursuit”, “description” : “Fasten your seatbelt and
 get ready to drive like your life depends
 on it...” ... } product updates
  20. 20. Formats for Data log files web services relational databases unstructured documents spreadsheets xml files
  21. 21. Types of Data technical data government data usage records sensor data academic data reference data yet uncollected data
  22. 22. Vasant Dhar. 2013. Data science and prediction. Commun. ACM 56, 12 (December 2013), 64-73.
 And International Telecommunication Union (ITU) and United Nations Population Division via www.internetlivestats.com/internet-users/
  23. 23. Trends data quantity machine learning maturing data variety data velocity
  24. 24. Machine Learning classification decision trees supervised methods unsupervised methodsclustering what matters most is mapping of data
  25. 25. Machine Learning Techniques • Classification - (Logistic Regression, Decision Trees, Random Forests) • Prediction - (Generalized Linear Models, Support Vector Regression, …) • Clustering - (K-Means, Hierarchical, Latent Dirichlet Allocation, …) • Collaborative filtering, … Features often matter more than choice of algorithm
  26. 26. data
  27. 27. tools of the trade
  28. 28. • Created in 1993 • Implementation of S language but also inherits from Scheme • Object oriented code is possible but not encouraged • Vast high-quality package ecosystem • Data is vectors and data frames
  29. 29. Demo
  30. 30. • Statistics • Visualization • Machine learning • REPL, scripts, interactive IDE • In-memory data sets
  31. 31. http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html
  32. 32. • More of a general purpose language than R • Arrays and matrices as basic data structures • Supports data frames through Pandas • Sophisticated machine learning libraries • Generally limited to in-memory data sets
  33. 33. • Leverages commodity hardware to store large data sets at low cost • Vibrant and diverse ecosystem • Popular but not always best solution • Probably best viewed as marketing terminology, as opposed to technology
  34. 34. https://hadoopecosystemtable.github.io/ Category Number of projects Distributed Filesystem 7 Distributed Programming 18 NoSQL Database 4 Document Data Model 3 Stream Data Model 1 Key-Value Data Model 4 Graph Data Model 3 NewSQL 9 SQL-On-Hadoop 11 Data Ingestion 11 Service Programming 7 Scheduling 3 Machine Learning 6 Benchmarking 5 Security 3 System Deployment 12 Applications 5 Development Frameworks 2 Categorize Pending 16 130 freely licensed open source projects listed in the Hadoop Ecosystem Table
  35. 35. Hadoop for Data Scientists • Pulling data from repository (SQL, Hive) • MapReduce programming (Java, Scala, Pig, Python) • Spark in-memory framework is gaining adoption rapidly
  36. 36. tools rarely used in data science
  37. 37. version control automated testing automated deployment shared code agile methodology code review software architecture
  38. 38. the cycle
 of data science
  39. 39. Data ProcessingData Collection Models, Algorithms Data-driven Product Features Data Analysis & Understanding Reports & Visualizations Product Improvements
  40. 40. data science
 at
  41. 41. • Profitable startup actively pursuing big opportunities in social apps • Millions of users on existing products • Thousands of social contacts per second
  42. 42. what it should look like
  43. 43. 1. Gain understanding of the product usage 2. See opportunity to make the product better 3. Create training data 4. Train predictive models 5. Put models in production 6. See improvements
  44. 44. what it often looks like
  45. 45. 1. Gain understanding of the product usage 2. See opportunity to make the product better 3. Pull records from relational database to create interesting features (usually aggregates) 4. Train predictive models 5. Go implement models for production 6. See improvements
  46. 46. 1. Gain understanding of the product usage 2. See opportunity to make the product better 3. Pull records from relational database to create interesting features (usually aggregates) 4. Train predictive models 5. Go implement models for production 6. See improvements 3-6
 months
  47. 47. 1. Gain understanding of the product usage 2. See opportunity to make the product better 3. Pull records from relational database to create interesting features (usually aggregates) 4. Train predictive models 5. Go implement models for production 6. See improvements Cool!
 Was it worth it?
  48. 48. implementation
 pain points
  49. 49. • Data scientist hands model description to software engineer • May need to translate features from SQL to Java • Aggregate features require batch processing • May need to adjust features and model to achieve real-time updates • Fast scoring requires high-performance in- memory data structures
  50. 50. new thinking
  51. 51. new architecture
  52. 52. one right way to data
  53. 53. event history one right way to data
  54. 54. everything is an event
  55. 55. Bob registers Alice registers Alice updates profile Bob opens app Bob sees Alice in recommendations Bob swipes yes on Alice Alice receives push notification Alice sees Bob swiped yes Alice swipes yes Alice sends message to Bob
  56. 56. architecture comparison
  57. 57. Database' Applica-on' Web'API' Ranking' Solr'Search' Indexing' Service' Change'logs' Occasional' index'rebuilds' Change'logs' Produc'on) Development) Exploratory' Analysis' Training'&' Backtes-ng'
  58. 58. Database' Applica-on' Web'API' Ranking' Solr'Search' Indexing' Service' Change'logs' Occasional' index'rebuilds' Change'logs' Produc'on) Development) Exploratory' Analysis' Training'&' Backtes-ng'
  59. 59. Database' Applica-on' Web'API' Ranking' Solr'Search' Indexing' Service' Change'logs' Occasional' index'rebuilds' Change'logs' Produc'on) Development) Exploratory' Analysis' Training'&' Backtes-ng'
  60. 60. Database' Applica-on' Web'API' Ranking' Solr'Search' Indexing' Service' Change'logs' Occasional' index'rebuilds' Change'logs' Produc'on) Development) Exploratory' Analysis' Training'&' Backtes-ng'
  61. 61. Database' Applica-on' Web'API' Ranking' Solr'Search' Indexing' Service' Change'logs' Occasional' index'rebuilds' Change'logs' Produc'on) Development) Exploratory' Analysis' Training'&' Backtes-ng'
  62. 62. Database' Applica-on' Web'API' Ranking' Event'History'Repository' Solr'Search' Real=-me' State'Updates' Change'logs'Real=-me'events' Occasional' index'rebuilds' Exploratory' Analysis'&' Visualiza-on' Produc'on) Development) Training'&' Backtes-ng' State'Updates' Monitoring'
  63. 63. Database' Applica-on' Web'API' Ranking' Event'History'Repository' Solr'Search' Real=-me' State'Updates' Change'logs'Real=-me'events' Occasional' index'rebuilds' Exploratory' Analysis'&' Visualiza-on' Produc'on) Development) Training'&' Backtes-ng' State'Updates' Monitoring'
  64. 64. Database' Applica-on' Web'API' Ranking' Event'History'Repository' Solr'Search' Real=-me' State'Updates' Change'logs'Real=-me'events' Occasional' index'rebuilds' Exploratory' Analysis'&' Visualiza-on' Produc'on) Development) Training'&' Backtes-ng' State'Updates' Monitoring'
  65. 65. Database' Applica-on' Web'API' Ranking' Event'History'Repository' Solr'Search' Real=-me' State'Updates' Change'logs'Real=-me'events' Occasional' index'rebuilds' Exploratory' Analysis'&' Visualiza-on' Produc'on) Development) Training'&' Backtes-ng' State'Updates' Monitoring'
  66. 66. Database' Applica-on' Web'API' Ranking' Event'History'Repository' Solr'Search' Real=-me' State'Updates' Change'logs'Real=-me'events' Occasional' index'rebuilds' Exploratory' Analysis'&' Visualiza-on' Produc'on) Development) Training'&' Backtes-ng' State'Updates' Monitoring'
  67. 67. Event History API trait EventHistory { def publishEvent(e: Event) " def getEvents( startTime: Date, endTime: Date, eventFilter: EventFilter, eventHandler: EventHandler ) }
  68. 68. Event History API trait EventHistory { def publishEvent(e: Event) " def getEvents( startTime: Date, endTime: Date, eventFilter: EventFilter, eventHandler: EventHandler ) }
  69. 69. Event History API trait EventHistory { def publishEvent(e: Event) " def getEvents( startTime: Date, endTime: Date, eventFilter: EventFilter, eventHandler: EventHandler ) } +∞ for
 real-time
 streaming
  70. 70. training data comparison
  71. 71. Database' Applica-on' Web'API' Ranking' Solr'Search' Indexing' Service' Change'logs' Occasional' index'rebuilds' Change'logs' Produc'on) Development) Exploratory' Analysis' Training'&' Backtes-ng'
  72. 72. Events'State'Snapshots'' Training' Features' Training' Outcomes' Time'
  73. 73. Database' Applica-on' Web'API' Ranking' Event'History'Repository' Solr'Search' Real=-me' State'Updates' Change'logs'Real=-me'events' Occasional' index'rebuilds' Exploratory' Analysis'&' Visualiza-on' Produc'on) Development) Training'&' Backtes-ng' State'Updates' Monitoring'
  74. 74. Updates( Training( features( Training( outcome( Time( Online(State( Events( Training(Data(
  75. 75. 1. Gain understanding of the product usage 2. See opportunity to make the product better 3. Create training data 4. Train predictive models 5. Put models in production 6. See improvements Fast cycles!!
  76. 76. Live Demo https://www.kaggle.com/c/acm-sf-chapter-hackathon-small https://github.com/ifwe/antelope
  77. 77. • Open source implementation derived from if(we)’s proprietary platform • Not ready scale or production, but useful for demonstration purposes • Seeking collaborators
  78. 78. product update events { “timestamp” : “2012-05-03 6:43:15”, “eventType” : “ProductUpdate”, “eventProperties” : { “sku” : “1032361”, “regularPrice” : “19.99”, “name” : “Need for Speed: Hot Pursuit”, “description” : “Fasten your seatbelt and
 get ready to drive like your life depends
 on it...” ... } }
  79. 79. product view events { “timestamp” : “2011-10-31 09:48:46”, “eventType” : “ProductView”, “eventProperties” : { “query” : “Modern warfare”, “skuSelected” : “2670133” } }
  80. 80. demo Try it yourself, code and instructions at:
 https://github.com/ifweco/antelope/blob/master/doc/demo.md
  81. 81. new tools :)
  82. 82. Architecture for
 Data Science
  83. 83. data warehousing
  84. 84. Database' Applica-on' Web'API' Ranking' Solr'Search' Indexing' Service' Change'logs' Occasional' index'rebuilds' Change'logs' Produc'on) Development) Exploratory' Analysis' Training'&' Backtes-ng'
  85. 85. Data$ Warehouse$ Extract$ Load$ Transform$ Opera7onal$ Data$Store$ Staging$
  86. 86. log transform use extract transform load
  87. 87. Data Architecture dimensional modeling relational modeling
  88. 88. Warehouse Design normalization slowly changing
 dimensions denormalization star schema
  89. 89. slowly changing dimensions
  90. 90. type 1: overwrite the old data type 2: multiple rows with versioning type 3: extra columns for older versions
  91. 91. slowly changing dimensions event history
  92. 92. Database' Applica-on' Web'API' Ranking' Event'History'Repository' Solr'Search' Real=-me' State'Updates' Change'logs'Real=-me'events' Occasional' index'rebuilds' Exploratory' Analysis'&' Visualiza-on' Produc'on) Development) Training'&' Backtes-ng' State'Updates' Monitoring'
  93. 93. Event& Indexing& Applica1on& Web&API& Ranking& Event&History&Repository& Solr&Search& Real>1me& State&Updates& Change&logs&Real>1me&events& Exploratory& Analysis&&& Visualiza1on& Produc'on) Development) Training&&& Backtes1ng& State&Updates& Monitoring&
  94. 94. trait EventHistory { def publishEvent(e: Event) " def getEvents( startTime: Date, endTime: Date, eventFilter: EventFilter, eventHandler: EventHandler ) }
  95. 95. event history design
  96. 96. Data Architecture ok to denormalize log a lot think about the types
  97. 97. • Make sure that events are simple facts • Files are ok for event history, don’t really need a database • Use an object hierarchy to model events in code • Use online features that are efficient to update incrementally • Write efficient implementations before than scaling out • Functional style makes it easier • Encourage reactive processing
  98. 98. Data Quality • Matters more than transformations, more than algorithms • Data that doesn’t make sense often indicates an application bug • Do assertions, e.g., make sure things aren’t happening out of order
  99. 99. • All data in form of events – no exceptions! • Same feature code in production and development • Emphasis on creative feature engineering • Quick cycles between ideas and production github.com/ifwe/antelope @jssmith Try the Antelope Demo:
 https://github.com/ifwe/antelope/blob/master/doc/demo.md

×