Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ai pipelines powered by jupyter notebooks

492 views

Published on

The Jupyter Notebook has become the de facto platform used by data scientists and AI engineers to build interactive applications and develop their AI/ML models. In this scenario, it’s very common to decompose various phases of the development into multiple notebooks to simplify the development and management of the model lifecycle.

Luciano Resende details how to schedule together these multiple notebooks that correspond to different phases of the model lifecycle into notebook-based AI pipelines and walk you through scenarios that demonstrate how to reuse notebooks via parameterization.

Published in: Data & Analytics
  • Be the first to comment

Ai pipelines powered by jupyter notebooks

  1. 1. AI pipelines powered by Jupyter notebooks Luciano Resende Open Source AI Platform Architect @lresende1975
  2. 2. About me - Luciano Resende Open Source AI Platform Architect – IBM – CODAIT • Senior Technical Staff Member at IBM, contributing to open source for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Toree, Apache Spark among other projects related to AI/ML platforms lresende@us.ibm.com https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende IBM Developer / © 2019 IBM Corporation 2
  3. 3. IBM Open Source Participation IBM Developer / © 2019 IBM Corporation Learn Open Source @ IBM Program touches 78,000 IBMers annually Consume Virtually all IBM products contain some open source • 40,363 pkgs Per Year Contribute • >62K OS Certs per year • ~10K IBM commits per month Connect > 1000 active IBM Contributors Working in key OS projects 3
  4. 4. IBM Open Source Participation IBM generated open source innovation • 137 IBM Open Code projects w/1000+ Github projects • Projects graduates into full open governance: Node-Red, OpenWhisk, SystemML, Blockchain fabric among others • developer.ibm.com/code/open/code/ Community • IBM focused on 18 strategic communities • Drive open governance in “Centers of Gravity” • IBM Leaders drive key technologies and assure freedom of action The IBM OS Way is now open sourced • Training, Recognition, Tooling • Organization, Consuming, Contributing 4IBM Developer / © 2019 IBM Corporation
  5. 5. Technology leaders do more than just consume OSS 19 1998 “For more than 20 years, IBM and Red Hat have paved the way for open communities to power innovative IT solutions.” – Red Hat Long IBM history of actively fostering balanced community participation 5 © 2019 IBM Corporation
  6. 6. Center for Open Source Data and AI Technologies 6 CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission 6IBM Developer / © 2019 IBM Corporation CODAIT codait.org codait (French) = coder/coded https://m.interglot.com/fr/en/codait
  7. 7. IBM Data Asset eXchange (DAX) 7 • Curated free and open datasets under open data licenses • Standardized dataset formats and metadata • Ready for use in enterprise AI applications • Complement to the Model Asset eXchange (MAX) Data Asset eXchange ibm.biz/data-asset-exchange Model Asset eXchange ibm.biz/model-exchange
  8. 8. AGENDA Jupyter Notebooks Analytic Workloads Pipelines • IPython %run magic • Jupyter NBConverter • Papermill • Apache Flow AI/Deep Learning Workloads Pipelines • AI Platforms • Kubeflow and Kubeflow Pipelines Announcements Resources IBM Developer / © 2019 IBM Corporation 8
  9. 9. Jupyter Notebooks 9IBM Developer / © 2019 IBM Corporation
  10. 10. Jupyter Notebooks Notebooks are interactive computational environments, in which you can combine code execution, rich text, mathematics, plots and rich media. 10IBM Developer / © 2019 IBM Corporation
  11. 11. Jupyter Notebook 11 Simple, but Powerful As simple as opening a web page, with the capabilities of a powerful, multilingual, development environment. Interactive widgets Code can produce rich outputs such as images, videos, markdown, LaTeX and JavaScript. Interactive widgets can be used to manipulate and visualize data in real-time. Language of choice Jupyter Notebooks have support for over 50 programming languages, including those popular in Data Science, Data Engineer, and AI such as Python, R, Julia and Scala. Big Data Integration Leverage Big Data platforms such as Apache Spark from Python, R and Scala. Explore the same data with pandas, scikit-learn, ggplot2, dplyr, etc. Share Notebooks Notebooks can be shared with others using e-mail, Dropbox, Google Drive, GitHub, etc
  12. 12. Jupyter Notebook Platform Architecture Notebook UI runs on the browser The Notebook Server serves the ’Notebooks’ Kernels interpret/execute cell contents – Are responsible for code execution – Abstracts different languages – 1:1 relationship with Notebook – Runs and consume resources as long as notebook is running 12IBM Developer / © 2019 IBM Corporation
  13. 13. Jupyter Notebook Analytic Workloads 13IBM Developer / © 2019 IBM Corporation
  14. 14. Analytic Workloads Large amount of data Shared across organization in Data Lakes Multiple workload types – Data cleansing – Data Warehouse – Machine Learning and Insights 14IBM Developer / © 2019 IBM Corporation
  15. 15. Analytic Workloads Decompose Schedule/Run
  16. 16. Homegrown pipelines 16IBM Developer / © 2019 IBM Corporation
  17. 17. Notebook Pipelines using %run %run built-in IPython magic - Enables execution of notebooks or python scripts IBM Developer / © 2019 IBM Corporation 17 Notebook Orchestrator %run %run %run
  18. 18. Notebook Pipelines using %run %run built-in IPython magic - Enables execution of notebooks or python scripts Limitations - Available in the IPython kernel only - Static - No command line integration IBM Developer / © 2019 IBM Corporation 18
  19. 19. Notebook Pipelines using NBConvert IBM Developer / © 2019 IBM Corporation 19 input notebook(s) orchestrator result_1.ipynb result_2.ipynb result_3.html result_4.pdf output file(s) ipynb, html, pdf NBConvert Jupyter NBConvert https://nbconvert.readthedocs.io/en/latest/ Jupyter NBConvert enables executing and converting notebooks to different file formats.
  20. 20. Notebook Pipelines using NBConvert $ pip install nbconvert $ jupyter nbconvert --to html --execute overview_with_run.ipynb [NbConvertApp] Converting notebook overview_with_run.ipynb to html [NbConvertApp] Executing notebook with kernel: python3 [NbConvertApp] Writing 300558 bytes to overview_with_run.html $ open overview_with_run.html IBM Developer / © 2019 IBM Corporation 20 Jupyter NBConvert https://nbconvert.readthedocs.io/en/latest/ Jupyter NBConvert enables executing and converting notebooks to different file formats. Advantages – Support notebook chaining – Convert results to immutable formats Limitations – No support for parameters
  21. 21. Notebook Pipelines with Papermill 21IBM Developer / © 2019 IBM Corporation
  22. 22. Papermill Papermill is an open source tool contributed by Netflix which enables parameterizing, executing, and analyzing Jupyter Notebooks. Papermill lets you: - Parameterize notebooks - Execute notebooks IBM Developer / © 2019 IBM Corporation 22 input notebook orchestrator result_1.ipynb result_2.ipynb result_3.html result_4.pdf output file(s) ipynb, html, pdf
  23. 23. Papermill Papermill provides programmatic interface so you can integrate with your applications IBM Developer / © 2019 IBM Corporation 23 import papermill as pm pm.execute_notebook('input_nb.ipynb', 'outputs/20190402_run.ipynb') ... # Each run can be placed in a unique / sortable path pprint(files_in_directory('outputs')) outputs/ ... 20190401_run.ipynb 20190402_run.ipynb
  24. 24. Papermill Papermill provides a CLI that enables easy integration with external tools and simple schedulers as crontab. IBM Developer / © 2019 IBM Corporation 24 $ papermill input_notebook.ipynb outputs/{run_id}_out.ipynb $ papermill input.ipynb report.ipynb -y '{"foo":"bar"}' && jupyter nbconvert --to html report.ipynb
  25. 25. Notebook Pipelines with Apache Airflow 25IBM Developer / © 2019 IBM Corporation
  26. 26. Apache Airflow Airflow is a platform to programmatically author, schedule and monitor workflows. It’s enterprise ready and used to build large and complex workload pipelines. IBM Developer / © 2019 IBM Corporation 26 Python Code DAG (Workflow)
  27. 27. Apache Airflow Airflow is a platform to programmatically author, schedule and monitor workflows. It’s enterprise ready and used to build large and complex workload pipelines. Airflow Papermill operator enables Jupyter Notebooks to be integrated into Airflow workflows/pipelines. IBM Developer / © 2019 IBM Corporation 27 More information à https://airflow.readthedocs.io/en/latest/howto/operator/papermill.html
  28. 28. Analytic Workloads Decompose Schedule/Run
  29. 29. Analytic Workloads
  30. 30. Analytic Workloads Pipelines Summary %run NBConvert Papermill Apache Airflow Notebook Kernels IPython Multiple Multiple Multiple Static versus Dynamic Static Dynamic Dynamic Dynamic Programmatic APIs Yes Yes Notebook Parameters Yes Yes Heterogeneous pipelines/workflows Yes
  31. 31. Jupyter Notebook AI / Deep Learning Workloads 31IBM Developer / © 2019 IBM Corporation
  32. 32. AI / Deep Learning Workloads Resource intensive workloads Requires expensive hardware (GPU, TPU) Long Running training jobs – Simple MINIST takes over one hour WITHOUT a decent GPU – Other non complex deep learning model training can easily take over a dat WITH GPUs 32IBM Developer / © 2019 IBM Corporation
  33. 33. Training/Deploying Models requires a lot of DevOPS 33 Model Serving Monitoring Resource Management Configuration Hyperparameter Optimization Reproducibility IBM Developer / © 2019 IBM Corporation
  34. 34. AI / Deep Learning Workloads Challenges • How to isolate the training environments to multiple jobs, based on different deep learning frameworks (and/or releases) can be submitted/trained on the same time. • Ability to allocate individual system level resources such as GPUs, TPUs, etc with different kernels for a period of time. • Ability to allocate and free up system level resources such as GPUs, TPUs, etc as they stop being used or when they are idle for a period of time. IBM Developer / © 2019 IBM Corporation 34
  35. 35. AI / Deep Learning Workloads Source: https://github.com/Langhalsdino/Kubernetes-GPU-Guide IBM Developer / © 2019 IBM Corporation 35 Containers and Kubernetes Platform - Containers simplify management of complicated and heterogenous AI/Deep Learning infrastructure providing a required isolation layer to different pods running different Deep Learning frameworks - Containers provides a flexible way to deploy applications and are here to stay - Kubernetes enables easy management of containerized applications and resources with the benefit of Elasticity and Quality of Services
  36. 36. AI Platforms AI/Deep Learning Platforms aim to abstract the DevOPS tasks from the Data Scientist providing a consistent way to develop AI models independent of the toolkit/framework being used. IBM Developer / © 2019 IBM Corporation 36 FfDL
  37. 37. Kubeflow • ML Toolkit for Kubernetes • Open source and community driven • Support multiple ML Frameworks • End-to-end workflows that can be shared, scaled and deployed IBM Developer / © 2019 IBM Corporation 37
  38. 38. Kubeflow Pipelines Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers. • End-to-end orchestration: enabling and simplifying the orchestration of machine learning pipelines. • Easy experimentation: making it easy for you to try numerous ideas and techniques and manage your various trials/experiments. • Easy re-use: enabling you to re-use components and pipelines to quickly create end-to-end solutions without having to rebuild each time. IBM Developer / © 2019 IBM Corporation 38
  39. 39. Kubeflow Pipelines IBM Developer / © 2019 IBM Corporation 39 Two key takeaways : A Pipeline and a Pipeline Component A pipeline is a description of a machine learning (ML) workflow, including all of the components of the workflow and how they work together.
  40. 40. Kubeflow Pipelines IBM Developer / © 2019 IBM Corporation 40 A pipeline component is an implementation of a pipeline task. A component represents a step in the workflow.
  41. 41. Kubeflow Pipelines IBM Developer / © 2019 IBM Corporation 41 Each pipeline component is a container that contains a program to perform the task required for that particular step of your workflow.
  42. 42. Kubeflow Pipelines IBM Developer / © 2019 IBM Corporation 42
  43. 43. AI Workloads and Kubeflow Pipelines Decompose Schedule/Run
  44. 44. Learn more about Kubeflow Pipelines IBM Developer / © 2019 IBM Corporation 44 Building a secure and transparent ML pipeline using open source technologies Animesh Singh (IBM), Svetlana Levitan (IBM), Tommy Li (IBM) 1:30pm–5:00pm Tuesday, July 16, 2019 Incorporating Artificial Intelligence Location: C123-124
  45. 45. Community Announcements IBM Developer / © 2019 IBM Corporation 45 Jupyter Notebook 6.0 Release Availability pip install --upgrade notebook
  46. 46. Community Resources IBM Developer / © 2019 IBM Corporation 46 Jupyter.org https://jupyter.org/ JupyterLab https://jupyterlab.readthedocs.io/en/stable/ Papermill https://github.com/nteract/papermill Kubeflow https://kubeflow.org https://github.com/kubeflow/
  47. 47. Thank you! @lresende1975 47IBM Developer / © 2019 IBM Corporation
  48. 48. Fabric for Deep Learning FfDL provides a scalable, resilient, and fault tolerant deep-learning framework • Fabric for Deep Learning or FfDL (pronounced as ‘fiddle’) is an open source project which aims at making Deep Learning easily accessible to the people it matters the most i.e. Data Scientists, and AI developers. • FfDL Provides a consistent way to deploy, train and visualize Deep Learning jobs across multiple frameworks like TensorFlow, Caffe, PyTorch, Keras etc. • FfDL is being developed in close collaboration with IBM Research and IBM Watson. It forms the core of Watson`s Deep Learning service in open source. IBM Developer / © 2019 IBM Corporation 48 FfDL Github Page https://github.com/IBM/FfDL FfDL Technical Architecture Blog http://developer.ibm.com/code/2018/03/20/democratize-ai-with- fabric-for-deep-learning Deep Learning as a Service within Watson Studio https://www.ibm.com/cloud/deep-learning Research paper: “Scalable Multi-Framework Management of Deep Learning Training Jobs” http://learningsys.org/nips17/assets/papers/paper_29.pdf FfDL 48
  49. 49. 49 FfDL: Architecture 2018 / © 2018 IBM Corporation
  50. 50. 50 https://arxiv.org/abs/1709.05871 FfDL: Research Papers 2018 / © 2018 IBM Corporation

×