Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Human-centric Machine Learning Infrastructure @Netflix

499 views

Published on

Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2AfrJ94.

Ville Tuulos talks about choices and reasoning the Netflix Machine Learning Infrastructure team made on developing tooling for the data scientists and some of the challenges and solutions made to create a paved road for machine learning models to production. Filmed at qconsf.com.

Ville Tuulos is a software architect in the Machine Learning Infrastructure team at Netflix. He has been building ML systems at various startups, including one that he founded, and large corporations for over 15 years.

Published in: Technology

Human-centric Machine Learning Infrastructure @Netflix

  1. 1. Human CentricHuman Centric Machine Learning InfrastructureMachine Learning Infrastructure @@ Ville Tuulos QCon SF, November 2018
  2. 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-ml-infrastructure
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. Meet Alex, a new chief data scientist at Caveman CupcakesMeet Alex, a new chief data scientist at Caveman Cupcakes You are hired!
  5. 5. We need a dynamic pricing model.
  6. 6. We need a dynamic pricing model. Optimal pricing model
  7. 7. Great job! The model works perfectly!
  8. 8. Could you predict churn too?
  9. 9. Optimal pricing modelOptimal churn model Alex's model
  10. 10. Optimal pricing modelOptimal churn model Alex's model
  11. 11. Good job again! Promising results!
  12. 12. Can you include a causal attribution model for marketing?
  13. 13. Optimal pricing model Optimal churn model Alex's model Attribution model
  14. 14. Are you sure these results make sense?
  15. 15. Take twoTake two
  16. 16. Meet the new data science team at Caveman CupcakesMeet the new data science team at Caveman Cupcakes You are hired!
  17. 17. Pricing model Churn modelAttribution model
  18. 18. VS
  19. 19. the human is the bottleneckthe human is the bottleneck VS
  20. 20. the human is the bottleneckthe human is the bottleneck the human is the bottleneckthe human is the bottleneck VS
  21. 21. BuildBuild
  22. 22. Data WarehouseData Warehouse BuildBuild
  23. 23. Data WarehouseData Warehouse Compute ResourcesCompute Resources BuildBuild
  24. 24. Data WarehouseData Warehouse Compute ResourcesCompute Resources Job SchedulerJob Scheduler BuildBuild
  25. 25. Data WarehouseData Warehouse Compute ResourcesCompute Resources Job SchedulerJob Scheduler VersioningVersioning BuildBuild
  26. 26. Data WarehouseData Warehouse Compute ResourcesCompute Resources Job SchedulerJob Scheduler VersioningVersioning Collaboration ToolsCollaboration Tools BuildBuild
  27. 27. Data WarehouseData Warehouse Compute ResourcesCompute Resources Job SchedulerJob Scheduler VersioningVersioning Collaboration ToolsCollaboration Tools Model DeploymentModel Deployment BuildBuild
  28. 28. Data WarehouseData Warehouse Compute ResourcesCompute Resources Job SchedulerJob Scheduler VersioningVersioning Collaboration ToolsCollaboration Tools Model DeploymentModel Deployment Feature EngineeringFeature Engineering BuildBuild
  29. 29. Data WarehouseData Warehouse Compute ResourcesCompute Resources Job SchedulerJob Scheduler VersioningVersioning Collaboration ToolsCollaboration Tools Model DeploymentModel Deployment Feature EngineeringFeature Engineering ML LibrariesML Libraries BuildBuild
  30. 30. Data WarehouseData Warehouse Compute ResourcesCompute Resources Job SchedulerJob Scheduler VersioningVersioning Collaboration ToolsCollaboration Tools Model DeploymentModel Deployment Feature EngineeringFeature Engineering ML LibrariesML Libraries How muchHow much data scientistdata scientist carescares BuildBuild
  31. 31. Data WarehouseData Warehouse Compute ResourcesCompute Resources Job SchedulerJob Scheduler VersioningVersioning Collaboration ToolsCollaboration Tools Model DeploymentModel Deployment Feature EngineeringFeature Engineering ML LibrariesML Libraries How muchHow much data scientistdata scientist carescares How muchHow much infrastructureinfrastructure is neededis needed BuildBuild
  32. 32. DeployDeploy
  33. 33. DeployDeploy No plan survives contact with enemyNo plan survives contact with enemy
  34. 34. DeployDeploy No plan survives contact with enemyNo plan survives contact with enemy No model survives contact with realityNo model survives contact with reality
  35. 35. our ML infra supportsour ML infra supports twotwo humanhuman activities:activities: buildingbuilding andand deployingdeploying data science workflows.data science workflows.
  36. 36. Screenplay Analysis Using NLP Fraud Detection Title Portfolio Optimization Estimate Word-of-Mouth Effects Incremental Impact of Marketing Classify Support Tickets Predict Quality of Network Content Valuation Cluster Tweets Intelligent Infrastructure Machine Translation Optimal CDN Caching Predict Churn Content Tagging Optimize Production Schedules
  37. 37. Notebooks:Notebooks: NteractNteract Job Scheduler:Job Scheduler: MesonMeson Compute Resources:Compute Resources: TitusTitus Query Engine:Query Engine: SparkSpark Data Lake:Data Lake: S3S3 ML LibML Libraries: Rraries: R, XGBoost, TF etc., XGBoost, TF etc.
  38. 38. Notebooks:Notebooks: NteractNteract Job Scheduler:Job Scheduler: MesonMeson Compute Resources:Compute Resources: TitusTitus Query Engine:Query Engine: SparkSpark Data Lake:Data Lake: S3S3 ML LibML Libraries: Rraries: R, XGBoost, TF etc., XGBoost, TF etc.
  39. 39. Notebooks:Notebooks: NteractNteract Job Scheduler:Job Scheduler: MesonMeson Compute Resources:Compute Resources: TitusTitus Query Engine:Query Engine: SparkSpark Data Lake:Data Lake: S3S3 { data compute prototyping ML LibML Libraries: Rraries: R, XGBoost, TF etc., XGBoost, TF etc.models
  40. 40. Bad Old DaysBad Old Days
  41. 41. Bad Old DaysBad Old Days Data Scientist built an NLP model in Python. Easy and fun!Data Scientist built an NLP model in Python. Easy and fun!
  42. 42. Bad Old DaysBad Old Days Data Scientist built an NLP model in Python. Easy and fun!Data Scientist built an NLP model in Python. Easy and fun! How to run at scale? Custom Titus executor.
  43. 43. Bad Old DaysBad Old Days Data Scientist built an NLP model in Python. Easy and fun!Data Scientist built an NLP model in Python. Easy and fun! How to run at scale? Custom Titus executor. How to access data at scale? Slow!
  44. 44. Bad Old DaysBad Old Days Data Scientist built an NLP model in Python. Easy and fun!Data Scientist built an NLP model in Python. Easy and fun! How to run at scale? Custom Titus executor. How to schedule the model to update daily? Learn about the job scheduler. How to access data at scale? Slow!
  45. 45. Bad Old DaysBad Old Days Data Scientist built an NLP model in Python. Easy and fun!Data Scientist built an NLP model in Python. Easy and fun! How to run at scale? Custom Titus executor. How to schedule the model to update daily? Learn about the job scheduler. How to access data at scale? Slow! How to expose the model to a custom UI? Custom web backend.
  46. 46. Bad Old DaysBad Old Days Data Scientist built an NLP model in Python. Easy and fun!Data Scientist built an NLP model in Python. Easy and fun! How to run at scale? Custom Titus executor. How to schedule the model to update daily? Learn about the job scheduler. How to access data at scale? Slow! How to expose the model to a custom UI? Custom web backend. Time to production: 4 months
  47. 47. Bad Old DaysBad Old Days Data Scientist built an NLP model in Python. Easy and fun!Data Scientist built an NLP model in Python. Easy and fun! How to run at scale? Custom Titus executor. How to schedule the model to update daily? Learn about the job scheduler. How to access data at scale? Slow! How to expose the model to a custom UI? Custom web backend. Time to production: 4 months How to monitor models in production? How to iterate on a new version without breaking the production version? How to let another data scientist iterate on her version of the model safely? How to debug yesterday's failed production run? How to backfill historical data? How to make this faster?
  48. 48. Notebooks:Notebooks: NteractNteract Job Scheduler:Job Scheduler: MesonMeson Compute Resources:Compute Resources: TitusTitus Query Engine:Query Engine: SparkSpark Data Lake:Data Lake: S3S3 { data compute prototyping ML LibML Libraries: Rraries: R, XGBoost, TF etc., XGBoost, TF etc.models ML Wrapping:ML Wrapping: MetaflowMetaflow
  49. 49. MetaflowMetaflow
  50. 50. BuildBuild
  51. 51. def compute(input): output = my_model(input) return outputoutputinput compute How toHow to get started?get started?
  52. 52. def compute(input): output = my_model(input) return outputoutputinput compute How toHow to get started?get started? # python myscript.py
  53. 53. from metaflow import FlowSpec, step class MyFlow(FlowSpec): @step def start(self): self.next(self.a, self.b) @step def a(self): self.next(self.join) @step def b(self): self.next(self.join) @step def join(self, inputs): self.next(self.end) MyFlow() How toHow to structure my code? structure my code? start B A join end
  54. 54. from metaflow import FlowSpec, step class MyFlow(FlowSpec): @step def start(self): self.next(self.a, self.b) @step def a(self): self.next(self.join) @step def b(self): self.next(self.join) @step def join(self, inputs): self.next(self.end) MyFlow() How toHow to structure my code? structure my code? start B A join end # python myscript.py run
  55. 55. metaflow("MyFlow") %>% step( step = "start", next_step = c("a", "b") ) %>% step( step = "A", r_function = r_function(a_func), next_step = "join" ) %>% step( step = "B", r_function = r_function(b_func), next_step = "join" ) %>% step( step = "Join", r_function = r_function(join, join_step = TRUE), How toHow to deal with models deal with models written in R?written in R? start B A join end
  56. 56. metaflow("MyFlow") %>% step( step = "start", next_step = c("a", "b") ) %>% step( step = "A", r_function = r_function(a_func), next_step = "join" ) %>% step( step = "B", r_function = r_function(b_func), next_step = "join" ) %>% step( step = "Join", r_function = r_function(join, join_step = TRUE), How toHow to deal with models deal with models written in R?written in R? start B A join end # RScript myscript.R
  57. 57. MetaflowMetaflow adoptionadoption at Netflixat Netflix 134 projects on Metaflow as of November 2018
  58. 58. @step def start(self): self.x = 0 self.next(self.a, self.b) @step def a(self): self.x += 2 self.next(self.join) @step def b(self): self.x += 3 self.next(self.join) @step def join(self, inputs): self.out = max(i.x for i in inputs) self.next(self.end) How toHow to prototype and testprototype and test my code locally?my code locally? start B A join end x=0 x+=2 x+=3 max(A.x, B.x)
  59. 59. @step def start(self): self.x = 0 self.next(self.a, self.b) @step def a(self): self.x += 2 self.next(self.join) @step def b(self): self.x += 3 self.next(self.join) @step def join(self, inputs): self.out = max(i.x for i in inputs) self.next(self.end) How toHow to prototype and testprototype and test my code locally?my code locally? start B A join end # python myscript.py resume B x=0 x+=2 x+=3 max(A.x, B.x)
  60. 60. @titus(cpu=16, gpu=1) @step def a(self): tensorflow.train() self.next(self.join) @titus(memory=200000) @step def b(self): massive_dataframe_operation() self.next(self.join) How toHow to get access to more CPUs, get access to more CPUs, GPUs, or memory?GPUs, or memory? start B A join end 16 cores, 1GPU 200GB RAM
  61. 61. @titus(cpu=16, gpu=1) @step def a(self): tensorflow.train() self.next(self.join) @titus(memory=200000) @step def b(self): massive_dataframe_operation() self.next(self.join) How toHow to get access to more CPUs, get access to more CPUs, GPUs, or memory?GPUs, or memory? start B A join end 16 cores, 1GPU 200GB RAM # python myscript.py run
  62. 62. @step def start(self): self.grid = [’x’,’y’,’z’] self.next(self.a, foreach=’grid’) @titus(memory=10000) @step def a(self): self.x = ord(self.input) self.next(self.join) @step def join(self, inputs): self.out = max(i.x for i in inputs) self.next(self.end) How toHow to distribute work over distribute work over many parallel jobs?many parallel jobs? start A join end
  63. 63. 40%40% of projects run steps outside their dev environment.of projects run steps outside their dev environment. How quickly they start using Titus?How quickly they start using Titus?
  64. 64. from metaflow import Table @titus(memory=200000, network=20000) @step def b(self): # Load data from S3 to a dataframe # at 10Gbps df = Table('vtuulos', 'input_table') self.next(self.end) How toHow to access large amounts of access large amounts of input data?input data? start B A join end S3
  65. 65. Case Study:Case Study: Marketing Cost per Incremental WatcherMarketing Cost per Incremental Watcher 1. Build a separate model for every new title with marketing spend.  Parallel foreach. 2. Load and prepare input data for each model. Download Parquet directly from S3. Total amount of model input data: 890GB. 3. Fit a model. Train each model on an instance with 400GB of RAM, 16 cores. The model is written in R. 4. Share updated results. Collect results of individual models, write to a table. Results shown on a Tableau dashboard.
  66. 66. DeployDeploy
  67. 67. # Access Savin's runs namespace('user:savin') run = Flow('MyFlow').latest_run print(run.id) # = 234 print(run.tags) # = ['unsampled_model'] # Access David's runs namespace('user:david') run = Flow('MyFlow').latest_run print(run.id) # = 184 print(run.tags) # = ['sampled_model'] # Access everyone's runs namespace(None) run = Flow('MyFlow').latest_run print(run.id) # = 184 How toHow to version my results and version my results and access results by others?access results by others? start B A join end david: sampled_model savin: unsampled_model
  68. 68. How toHow to deploy my workflow to deploy my workflow to production?production? start B A join end
  69. 69. How toHow to deploy my workflow to deploy my workflow to production?production? start B A join end #python myscript.py meson create
  70. 70. 26%26% of projects get deployed to the production scheduler.of projects get deployed to the production scheduler. How quickly the first deployment happens?How quickly the first deployment happens?
  71. 71. How toHow to monitor models andmonitor models and examine results?examine results? start B A join end x=0 x+=2 x+=3 max(A.x, B.x)
  72. 72. How toHow to deploy results asdeploy results as a microservice?a microservice?   start B A join end x=0 x+=2 x+=3 max(A.x, B.x) Metaflow hosting from metaflow import WebServiceSpec from metaflow import endpoint class MyWebService(WebServiceSpec): @endpoint def show_data(self, request_dict): # TODO: real-time predict here result = self.artifacts.flow.x return {'result': result}
  73. 73. How toHow to deploy results asdeploy results as a microservice?a microservice?   start B A join end x=0 x+=2 x+=3 max(A.x, B.x) Metaflow hosting from metaflow import WebServiceSpec from metaflow import endpoint class MyWebService(WebServiceSpec): @endpoint def show_data(self, request_dict): # TODO: real-time predict here result = self.artifacts.flow.x return {'result': result} {"result": 3}{ # curl http://host/show_data 
  74. 74. Case Study:Case Study: Launch Date Schedule OptimizationLaunch Date Schedule Optimization 1. Batch optimize launch date schedules for new titles daily.  Batch optimization deployed on Meson. 2. Serve results through a custom UI. Results deployed on Metaflow Hosting. 3. Support arbitrary what-if scenarios in the custom UI. Run optimizer in real-time in a custom web endpoint.
  75. 75. MetaflowMetaflow
  76. 76. diversediverse problemsproblems
  77. 77. diversediverse problemsproblems diversediverse peoplepeople
  78. 78. diversediverse problemsproblems diversediverse peoplepeople diversediverse modelsmodels
  79. 79. diversediverse problemsproblems diversediverse peoplepeople helphelp peoplepeople buildbuild diversediverse modelsmodels
  80. 80. diversediverse problemsproblems diversediverse peoplepeople helphelp peoplepeople buildbuild helphelp peoplepeople deploydeploy diversediverse modelsmodels
  81. 81. diversediverse problemsproblems diversediverse peoplepeople helphelp peoplepeople buildbuild helphelp peoplepeople deploydeploy diversediverse modelsmodels happyhappy peoplepeople, healthy, healthy businessbusiness
  82. 82. thank you!thank you!   @vtuulos@vtuulos vtuulos@netflix.comvtuulos@netflix.com
  83. 83. Bruno Coldiori https://www.flickr.com/photos/br1dotcom/8900102170/ https://www.maxpixel.net/Isolated-Animal-Hundeportrait-Dog-Nature-3234285 Photo CreditsPhoto Credits
  84. 84. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-ml-infrastructure

×