Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017

531 views

Published on

IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale. Behind this service, there are various components that power this platform, including Jupyter Notebooks, an enterprise gateway that manages the execution of the Jupyter Kernels and an Apache Spark cluster that power the computation. In this session we will describe our experience and best practices putting together this analytical platform as a service based on Jupyter Notebooks and Apache Spark, in particular how we built the Enterprise Gateway that enables all the Notebooks to share the Spark cluster computational resources.

Published in: Technology
  • Be the first to comment

The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017

  1. 1. IBM SparkTechnology Center Big Data Spain – Nov 2017 The Analytic Platform behind IBM’s Watson Data Platform Luciano Resende IBM | Spark Technology Center
  2. 2. 2 Data Science Platform Architect – IBM – Spark Technology Center • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Spark, Apache Toree among other projects related to Apache Spark ecosystem lresende@apache.org http://lresende.blogspot.com/ https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende @ About me - Luciano Resende
  3. 3. Open Source Community Leadership Spark Technology Center Founding Partner 188+ Project Committers 77+ Projects Key Open source steering committee memberships OSS Advisory Board Open Source
  4. 4. IBM SparkTechnology Center IBM Spark Technology Center Founded in 2015. Location: Physical: 505 Howard St., San Francisco CA Web: http://spark.tc Twitter: @apachespark_tc Mission: Contribute intellectual and technical capital to the Apache Spark community. Make the core technology enterprise- and cloud-ready. Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com Key statistics: About 40 developers, co-located with 25 IBM designers. Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now a top level Apache project ! Founding member of UC Berkeley AMPLab and RISE Lab Member of R Consortium and Scala Center 4
  5. 5. IBM SparkTechnology Center Agenda IBM Data Science Experience IBM Analytics Engine Challenges faced building Analytic Platform Jupyter Enterprise Gateway References 5
  6. 6. IBM SparkTechnology Center IBM Data Science Experience is an environment that brings together everything that a Data Scientist needs to be more productive, including tools, data and content Be a better data scientist IBM Data Science Experience (DSX)
  7. 7. IBM SparkTechnology Center DSX is built on a foundation of open source, primarily Jupyter notebooks Notebooks are interactive computational environments, in which you can combine code execution, rich text, mathematics, plots and rich media.
  8. 8. IBM SparkTechnology Center Jupyter Notebook Platform Architecture • Notebook UI runs on the browser • The Notebook Server serves the ’Notebooks’ • Kernels interpret/execute cell contents • Are responsible for code execution • Abstracts different languages 8
  9. 9. IBM SparkTechnology Center Follow-ups TRY IT: datascience.ibm.com Event registration URL: https://ibm.biz/BdjJUw
  10. 10. IBM SparkTechnology Center IBM Analytics Engine IBM Analytics Engine
  11. 11. IBM SparkTechnology Center IBM Analytics Engine - Characteristics IBM Analytics Engine is built on open source Apache Hadoop and Apache Spark. It provides users flexibility of open source and an opportunity to expand on their existing open source investments IBM Analytics Engine helps Data scientists, Data engineers, and Developers to focus on building data models and business solutions while simplifying cluster administration through easy to use interfaces for management and integration IBM Analytics Engine deploys clusters in minutes with enterprise-level security, reliability, and powerful integration capabilities for data management, monitoring, and dashboards.
  12. 12. IBM SparkTechnology Center Capabilities Separation of compute and storage • Scale compute and storage independently for better economics • Separate compute and storage ensure no data- loss in cases of cluster failure • Ease of incorporating patches or upgrades by creating new clusters • Spin up use case specific clusters using different instance sizes for different use cases • Uniform governance and collaboration through WDP services Ease of use and administration • Access and administer through multiple interfaces – Cloud Foundry CLI, REST APIs on public interface, and GUI • Enhanced flexibility for configuring and clusters, including installing 3rd party libraries through bootstrap scripts • Deploy and scale clusters within minutes, in a few clicks, including propagating libraries and configurations to all nodes of the cluster
  13. 13. IBM SparkTechnology Center Capabilities * Roadmap item Enhanced reliability and security • ‘Auto-heal’ capability recovers processes from failure * • Geo-replicated object store for disaster avoidance • Encrypted object store, data-at-rest, and data- in-motion encryption* provide enhanced levels of security Flexibility and innovation of open source • Built on ODPi compliant Apache Spark and Apache Hadoop stack for portability between open source environments • Integrate analytics tools using standard, open source libraries and drivers
  14. 14. IBM SparkTechnology Center Enterprise/Cloud Analytics Platform Characteristics Large pool of shared computing resources • Enterprise Cloud, Public Cloud or Hybrid • Data in the cloud (Data Lakes/Object Storage) Distributed Consumers • Notebooks running local (users laptop) or as a service Different Resource Utilization Patterns • High number of idle resources 14
  15. 15. IBM SparkTechnology Center Analytics Platform – Current state of the art Open Source Jupyter based Notebook Platform • Single User sharing the same distributed filesystem and privileges • Jupyter Kernels running as local process • Resources are limited by what is available on the one single node that runs all Kernels and associated Spark drivers. • No security, users can see and control each others process using Jupyter’s administration utilities. 15
  16. 16. IBM SparkTechnology Center Analytics Platform Today – Shared Cluster Allows Jupyter notebooks running outside of the cluster to run Jupyter kernels inside the cluster sharing it’s resources. • All Jupyter kernels run under a shared, “service” user ID. • Users can see and control each others’ kernels using Jupyter’s administration utilities. • All kernels and their associated Spark drivers run on a single (configurable) node of the cluster. 16 Spark Cluster Bob’s Desktop Multiple Notebooks Jupyter Kernel Gateway (Sandboxed by service user privileges) Jupyter Kernel Gateway Jupyter Notebook Server (with NB2KG) Executors (as Alice)Executors (as Alice)Spark Executors (as JNBG Service User) Kernel [Spark Driver] (yarn-client mode as JNBG Service User) YARN Workers Bob’s Desktop Multiple Notebooks Jupyter Notebook Server (with NB2KG) Security Layer Kernel [Spark Driver] (yarn-client mode as JNBG Service User) Executors (as Alice)Executors (as Alice)Spark Executors (as JNBG Service User)
  17. 17. IBM SparkTechnology Center Analytics Platform Today – Single User Cluster Allows Jupyter notebooks running outside of the cluster to run Jupyter kernels in a cluster created specially to the user. • Expensive as clusters are created for every individual user 17 Spark Cluster Bob’s Desktop Multiple Notebooks Jupyter Kernel Gateway (Sandboxed by service user privileges) Jupyter Kernel Gateway Jupyter Notebook Server (with NB2KG) Executors (as Alice)Executors (as Alice)Spark Executors (as JNBG Service User) Kernel [Spark Driver] (yarn-client mode as JNBG Service User) YARN Workers
  18. 18. 1 8 Jupyter Enterprise Gateway
  19. 19. IBM SparkTechnology Center Jupyter Enterprise Gateway A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across an Apache Spark cluster aiming on Enterprise/Cloud requirements and use cases 19
  20. 20. IBM SparkTechnology Center Jupyter Enterprise Gateway – Goals Optimized Resource Allocation •Run Spark in YARN Cluster Mode to better utilize cluster resources. •Pluggable architecture for additional Resource Managers Enhanced Security •Enable TLS for all socket communications •Any HTTP communication should be encrypted (SSL) Multiuser support with user impersonation •Enhance security and sandboxing by enabling user impersonation when running kernels. •Individual HDFS home folder for each notebook user. •Use the same user ID for notebook and batch jobs. 20
  21. 21. IBM SparkTechnology Center Jupyter Enterprise Gateway Supported Platforms • Python/Spark 2.x using IPython kernel • With Spark Context delayed initialization • Scala 2.11/ Spark 2.x using Apache Toree kernel • With Spark Context delayed initialization • R / Spark 2.x with IRkernel 21
  22. 22. IBM SparkTechnology Center Jupyter Enterprise Gateway 22 Kernel scalability comparison: Cluster mode vs Client mode
  23. 23. IBM SparkTechnology Center Jupyter Enterprise Gateway Jupyter Enterprise Gateway Functionality • Enable running kernels remotely in a cluster • Pluggable kernel lifecycle management • Enhanced security • Multiuser leveraging user impersonation 23 Jupyter Enterprise Gateway Jupyter Kernel Gateway Jupyter Notebook Server
  24. 24. IBM SparkTechnology Center Spark Cluster Jupyter Enterprise Gateway 24 Security Layer YARN Workers Jupyter EnterpriseGateway Multitenancy Remote kernels and Kernel Lifecycle management Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Impersonation: Alice’s kernel runs under Alice’s user ID.
  25. 25. IBM SparkTechnology Center Jupyter Enterprise Gateway – Roadmap • Kernel Configuration Profile • Enable client to request different resource configuration for kernels (e.g. small, medium, large) • Profiles should be defined by Administrators and enabled for user/group of users. • Administration UI • Dashboard with running kernels and administration actions • Time running, stop/kill, Profile Management, etc • Add support for other resource managers • User Environments • High Availability 25
  26. 26. IBM SparkTechnology Center Jupyter Enterprise Gateway Jupyter Enterprise Gateway at IBM Code https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/ Jupyter Enterprise Gateway no GitHub https://github.com/jupyter-incubator/enterprise_gateway Jupyter Enterprise Gateway Documentation http://jupyter-enterprise-gateway.readthedocs.io/en/latest/ 26 Jupyter Enterprise Gateway 0.7 release coming out today

×