SlideShare a Scribd company logo
Introducing Krylov
eBay AI Platform - Machine Learning Made Easy
GPU Technology Conference, 2018
Henry Saputra
Technical Lead for Krylov - eBay Unified AI Platform
1. Data Science and Machine Learning at eBay
2. Introducing Krylov
3. Compute Cluster and Accelerator Support with Nvidia GPU
4. Quickstart Example
5. Future Roadmap
6. Q & A
Agenda
Data Science and Machine Learning at eBay
eBay Patterns - Tools and Frameworks
Tools
• Languages: R, Python, Scala, C++
• IDE-like: RStudio, Notebooks (Juptyer), Python IDE
• Frameworks: NumPy, SciPy, matplotlib, Scikit-learn, Spark MLLib, H2O
Weka, XGBoost, Moses
• Pipelines: Cron, Luigi, Apache Airflow, Apache Oozie
Patterns for ML Training
• Single node
• Distributed training
• Deep learning (GPUs)
Deep LearningDistributed Training Key takeaway = CHOICE
1. Flexibility of software
2. Flexibility of hardware
configuration
1. 50%-70% is plumbing work
a. Accessing and moving secured data
b. Environment and tools setup
c. Sub-optimal compute instances - NVIDIA GPUs and High memory/ CPUs instances
d. Long wait time from platform and infrastructure
2. Lost of productivity and opportunities
a. ML lifecycle management of models and features
b. Building robust training model pipelines: prepare data, algorithm, hyperparameters tuning, cross
validation
3. Collaborations almost impossible
4. Research vs Applied ML
Problems and Challenges
Introducing Krylov: Unified eBay AI
Platform
● Krylov is the core project of the eBay unified AI Platform initiative to enable easy to use and
powerful cloud-based data science and machine learning platform.
● The objective of the project is to enable machine learning jobs with easy access to
secured-data and eBay cloud computing resources.
● The main goals for the Krylov initiative are:
○ Easy and secure access to training datasets
○ Access to compute in high performance machines, such as GPUs, or cluster of
machines.
○ Familiar tools and flexible software to run machine learning model training jobs
○ Interactive data analysis and visualization, with multi-tenancy support to allow quick
prototyping of algorithms and data access
○ Sharing and collaboration of ML work between teams in eBay
Overview
ML Lifecycle Management
Lifecycle
MODEL INFERENCING
Deployable, Scalable
MODEL BUILDING
Interactive, iterative
MODEL RE-FITTING
Interactive, iterative
MODEL RE-TRAINING
Interactive, iterative
Data + Lifecycle Management
MODEL TRAINING
Automatable, repeatable, scalable
Krylov Staircase Design for AI Platform
eBay AI Platform Components
Infrastructure - Krylov
AI Engine - Krylov
Learning
Pipelines
Model
Experimentation
Data Scientist
Workspaces
Model Lifecycle
Management
GPU Tall instances
Fast Storage
Data
Preparation
Movement
Discovery
Access
AI Hub
(Shared
Repository)
AI
Modules
Speech Recognition Machine Translation
Computer Vision Information Retrieval
Natural Language Understanding …
Inferencing
Krylov High Level Architecture
1. Client Command Line Interface (CLI) via krylovctl program
2. ML Application and Run Specification
3. ML Pipelines: Workflow and Workspace
4. Namespaces - For quota and data isolation
5. Jobs and Runs - Managed by Krylov Tools and Minions
6. Secure Data Access - HDFS, NFS, OpenStack Swift, Custom
Krylov Main Features and Concepts
Krylov CLI - krylovctl
● Krylov ML Application is a versioned unit of deployment that contains declaration of the
developers’ programs
● Implemented as client project used as source to build deployment artifact
● Three main parts:
○ mlapplication.json and artifact.sjon configuration files
○ Source code of the programs
○ Dependencies management via Dockerfile
● Supported types of programs: JVM languages (Java, Scala), Python, Shell script
● Using the ML Application as source, developers can build deployment artifact that can be
used by the Run Specification file to deploy it into one of the nodes in the cluster
Krylov ML Application
{
"tasks": {
"prepare_data": {
"program": "com.ebay.oss.krylov.workflow.JvmMainProgram",
"parameters": {
"className": "com.ebay.krylov.helloai.HelloWorld"
}
},
"train_model": {
"program": "com.ebay.oss.krylov.workflow.PythonProgram",
"parameters": {
"file": "helloai-python/helloai/helloworld.py",
"args": []
}
},
...
Krylov ML Application Example
● The Krylov Run Specification is a runtime configuration to add override configuration and
parameter passing for each Task in the ML Application job submissions
● It tells Krylov master API server of which the artifact created by ML Application will be used in
the compute cluster
● Defined as runspec.json file or can be passed as argument to krylovctl client program.
● The runspec.json file also has definition for the compute resources, such as which NVIDIA
GPUs to use, CPU, memory, and which Docker image for dependencies used in ML
Application programs
Krylov Run Specification
{
"jobName": "job-sample",
"artifact": "myartifact",
"artifactTag": "latest",
"mlApplication": "com.ebay.oss.krylov.workflow.app.GenericMLApplication",
"applicationParameters": {
},
"tasks": {
"prepare_data": {
"taskParameters": {
"prepare_data_parameter_key": "prepare_data_parameter_value"
}
}
}
Krylov Run Specification Example
● Krylov ML batch lifecycle pipeline is defined as Krylov Workflow definition
○ Declarative
○ Default Generic Workflow
● Important concepts for Krylov Workflow:
○ Workflow - A single pipeline defined within Krylov and the unit of deployment for an ML Application
■ Each Workflow contains one or more Tasks
■ The Tasks are connected to each other as Directed Acyclic Graph (DAG) structure
○ Task - smallest unit of execution that run developers’ Program and executed in a single machine
○ Flows - Contains one or more key-value pairs of name and declaration of Tasks DAGs
○ Flow - The chosen key that will be run from possible selection in the Flows definition
Krylov ML Pipelines: Workflow
{
"tasks": {
...
},
"flows": {
"sample_flow": {
"prepare_data":
["train_model"],
"train_model":
["output"]
}
},
"flow": "sample_flow"
}
Workflow Example in mlapplication.json
Workflow Runs Flow
● A Workspace is an interactive web application to allow developers to use web
browser to do ML model prototyping, data preparation and exploration
● The Workspace is run as Jupyter Notebook servers and launched on high CPU/
memory or NVIDIA GPU instances
● Enhance the JupyterHub project to allow distributed launching of multi-tenants
Jupyter Notebook servers in Krylov compute cluster using Kubernetes
● Krylov Workspace uses configuration file on creation time to override and
customize default parameters
Krylov ML Pipelines: Workspace
Workspace Deployment Flow
Krylov Compute Cluster
Krylov Cluster Infrastructure
Krylov Compute Cluster Deployment
● Metrics - Grafana, InfluxDb, and Telegraf for GPU monitoring
Krylov Cluster Monitoring
Krylov Metrics Management Flow
Krylov Compute Resources Management
Quickstart Example
1. Download krylovctl program from Krylov release repository
2. Run `krylovctl project create` to create new project in the local machine
3. Update or add code to the Krylov project for the machine learning programs
4. Register them as Program within a Task in the mlapplication.json
5. Add new Flow for the defined Tasks to construct the Workflow as a Directed Acyclic Graph (DAG)
6. Run `krylovctl project build` to build the project.
7. Run `krylovctl artifact create` to copy the runnables of the program into an artifact file
8. Run `krylovctl artifact upload` to upload the artifact file for remote execution
9. Run `krylovctl job run` for local execution, or `krylovctl job submit` for running it in the computing
cluster
Steps to Submit Krylov Workflow Job with CLI
● Here we go ...
Demo Time
Future Roadmap
1. Inferencing Platform
2. Exploration and documentation of RESTful APIs for job management
3. Data Source and Dataset abstraction via Krylov SDKs
4. Managed ML Pipelines - Computer Vision, NLP, Machine Translation
5. Distributed Deep Learning
6. AutoML - Hyper Parameters Tuning
7. AI Hub to share ML Applications and Datasets
Future Roadmap
Question?

More Related Content

What's hot

私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
yoshimotot
 
Oracle Cloud Infrastructure:2021年11月度サービス・アップデート
Oracle Cloud Infrastructure:2021年11月度サービス・アップデートOracle Cloud Infrastructure:2021年11月度サービス・アップデート
Oracle Cloud Infrastructure:2021年11月度サービス・アップデート
オラクルエンジニア通信
 
DPDK & Cloud Native
DPDK & Cloud NativeDPDK & Cloud Native
DPDK & Cloud Native
Michelle Holley
 
Oracle Analytics Server のご紹介【2021年3月版】
Oracle Analytics Server のご紹介【2021年3月版】Oracle Analytics Server のご紹介【2021年3月版】
Oracle Analytics Server のご紹介【2021年3月版】
オラクルエンジニア通信
 
Automation testing using Ruby with Cucumber in Docker
Automation testing using Ruby with Cucumber in DockerAutomation testing using Ruby with Cucumber in Docker
Automation testing using Ruby with Cucumber in Docker
Viacheslav Horbovskykh
 
Active Directory and Single Sign On for Oracle Analytics Cloud AnDSummit19
Active Directory and Single Sign On for Oracle Analytics Cloud AnDSummit19Active Directory and Single Sign On for Oracle Analytics Cloud AnDSummit19
Active Directory and Single Sign On for Oracle Analytics Cloud AnDSummit19
Becky Wagner
 
Microsoft の Sustainability への取り組み
Microsoft の Sustainability への取り組みMicrosoft の Sustainability への取り組み
Microsoft の Sustainability への取り組み
Daiyu Hatakeyama
 
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 1.14.0対応)
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 1.14.0対応)NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 1.14.0対応)
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 1.14.0対応)
fisuda
 
「あの企業は実際どうやってるの?」顧客実例で語るデータ・ドリブンの実像とは (Oracle Cloudウェビナーシリーズ: 2021年9月1日)
「あの企業は実際どうやってるの?」顧客実例で語るデータ・ドリブンの実像とは (Oracle Cloudウェビナーシリーズ: 2021年9月1日)「あの企業は実際どうやってるの?」顧客実例で語るデータ・ドリブンの実像とは (Oracle Cloudウェビナーシリーズ: 2021年9月1日)
「あの企業は実際どうやってるの?」顧客実例で語るデータ・ドリブンの実像とは (Oracle Cloudウェビナーシリーズ: 2021年9月1日)
オラクルエンジニア通信
 
第18回しゃちほこオラクル俱楽部
第18回しゃちほこオラクル俱楽部第18回しゃちほこオラクル俱楽部
第18回しゃちほこオラクル俱楽部
オラクルエンジニア通信
 
The State of libfabric in Open MPI
The State of libfabric in Open MPIThe State of libfabric in Open MPI
The State of libfabric in Open MPI
Jeff Squyres
 
Oracle Data Masking and Subsettingのご紹介
Oracle Data Masking and Subsettingのご紹介Oracle Data Masking and Subsettingのご紹介
Oracle Data Masking and Subsettingのご紹介
オラクルエンジニア通信
 
Calico and BGP
Calico and BGPCalico and BGP
Calico and BGP
Anirban Sen Chowdhary
 
製品コンフィグレーションガイド
製品コンフィグレーションガイド製品コンフィグレーションガイド
製品コンフィグレーションガイド
エクストリーム ネットワークス / Extreme Networks Japan
 

What's hot (14)

私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
私はここでつまづいた! Oracle database 11g から 12cへのアップグレードと Oracle Database 12c の新機能@201...
 
Oracle Cloud Infrastructure:2021年11月度サービス・アップデート
Oracle Cloud Infrastructure:2021年11月度サービス・アップデートOracle Cloud Infrastructure:2021年11月度サービス・アップデート
Oracle Cloud Infrastructure:2021年11月度サービス・アップデート
 
DPDK & Cloud Native
DPDK & Cloud NativeDPDK & Cloud Native
DPDK & Cloud Native
 
Oracle Analytics Server のご紹介【2021年3月版】
Oracle Analytics Server のご紹介【2021年3月版】Oracle Analytics Server のご紹介【2021年3月版】
Oracle Analytics Server のご紹介【2021年3月版】
 
Automation testing using Ruby with Cucumber in Docker
Automation testing using Ruby with Cucumber in DockerAutomation testing using Ruby with Cucumber in Docker
Automation testing using Ruby with Cucumber in Docker
 
Active Directory and Single Sign On for Oracle Analytics Cloud AnDSummit19
Active Directory and Single Sign On for Oracle Analytics Cloud AnDSummit19Active Directory and Single Sign On for Oracle Analytics Cloud AnDSummit19
Active Directory and Single Sign On for Oracle Analytics Cloud AnDSummit19
 
Microsoft の Sustainability への取り組み
Microsoft の Sustainability への取り組みMicrosoft の Sustainability への取り組み
Microsoft の Sustainability への取り組み
 
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 1.14.0対応)
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 1.14.0対応)NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 1.14.0対応)
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 1.14.0対応)
 
「あの企業は実際どうやってるの?」顧客実例で語るデータ・ドリブンの実像とは (Oracle Cloudウェビナーシリーズ: 2021年9月1日)
「あの企業は実際どうやってるの?」顧客実例で語るデータ・ドリブンの実像とは (Oracle Cloudウェビナーシリーズ: 2021年9月1日)「あの企業は実際どうやってるの?」顧客実例で語るデータ・ドリブンの実像とは (Oracle Cloudウェビナーシリーズ: 2021年9月1日)
「あの企業は実際どうやってるの?」顧客実例で語るデータ・ドリブンの実像とは (Oracle Cloudウェビナーシリーズ: 2021年9月1日)
 
第18回しゃちほこオラクル俱楽部
第18回しゃちほこオラクル俱楽部第18回しゃちほこオラクル俱楽部
第18回しゃちほこオラクル俱楽部
 
The State of libfabric in Open MPI
The State of libfabric in Open MPIThe State of libfabric in Open MPI
The State of libfabric in Open MPI
 
Oracle Data Masking and Subsettingのご紹介
Oracle Data Masking and Subsettingのご紹介Oracle Data Masking and Subsettingのご紹介
Oracle Data Masking and Subsettingのご紹介
 
Calico and BGP
Calico and BGPCalico and BGP
Calico and BGP
 
製品コンフィグレーションガイド
製品コンフィグレーションガイド製品コンフィグレーションガイド
製品コンフィグレーションガイド
 

Similar to S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and Engineering Teams

Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and KubeflowKostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
IT Arena
 
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Animesh Singh
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learning
Antje Barth
 
AI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with KnativeAI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with Knative
Animesh Singh
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
Abhinav Joshi
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes
Tushar Katarki
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Iulian Pintoiu
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
ScyllaDB
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
 
NextGenML
NextGenML NextGenML
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
Animesh Singh
 
ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing plat...
ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing plat...ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing plat...
ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing plat...
UA DevOps Conference
 
03_aiops-1.pptx
03_aiops-1.pptx03_aiops-1.pptx
03_aiops-1.pptx
FarazulHoda2
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
Data Science Milan
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Akash Tandon
 
Hydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on KubeflowHydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on Kubeflow
Rustem Zakiev
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 

Similar to S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and Engineering Teams (20)

Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and KubeflowKostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
 
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learning
 
AI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with KnativeAI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with Knative
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
NextGenML
NextGenML NextGenML
NextGenML
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing plat...
ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing plat...ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing plat...
ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing plat...
 
03_aiops-1.pptx
03_aiops-1.pptx03_aiops-1.pptx
03_aiops-1.pptx
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
 
Hydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on KubeflowHydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on Kubeflow
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 

Recently uploaded

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and Engineering Teams

  • 1. Introducing Krylov eBay AI Platform - Machine Learning Made Easy GPU Technology Conference, 2018 Henry Saputra Technical Lead for Krylov - eBay Unified AI Platform
  • 2. 1. Data Science and Machine Learning at eBay 2. Introducing Krylov 3. Compute Cluster and Accelerator Support with Nvidia GPU 4. Quickstart Example 5. Future Roadmap 6. Q & A Agenda
  • 3. Data Science and Machine Learning at eBay
  • 4. eBay Patterns - Tools and Frameworks Tools • Languages: R, Python, Scala, C++ • IDE-like: RStudio, Notebooks (Juptyer), Python IDE • Frameworks: NumPy, SciPy, matplotlib, Scikit-learn, Spark MLLib, H2O Weka, XGBoost, Moses • Pipelines: Cron, Luigi, Apache Airflow, Apache Oozie Patterns for ML Training • Single node • Distributed training • Deep learning (GPUs) Deep LearningDistributed Training Key takeaway = CHOICE 1. Flexibility of software 2. Flexibility of hardware configuration
  • 5. 1. 50%-70% is plumbing work a. Accessing and moving secured data b. Environment and tools setup c. Sub-optimal compute instances - NVIDIA GPUs and High memory/ CPUs instances d. Long wait time from platform and infrastructure 2. Lost of productivity and opportunities a. ML lifecycle management of models and features b. Building robust training model pipelines: prepare data, algorithm, hyperparameters tuning, cross validation 3. Collaborations almost impossible 4. Research vs Applied ML Problems and Challenges
  • 6. Introducing Krylov: Unified eBay AI Platform
  • 7. ● Krylov is the core project of the eBay unified AI Platform initiative to enable easy to use and powerful cloud-based data science and machine learning platform. ● The objective of the project is to enable machine learning jobs with easy access to secured-data and eBay cloud computing resources. ● The main goals for the Krylov initiative are: ○ Easy and secure access to training datasets ○ Access to compute in high performance machines, such as GPUs, or cluster of machines. ○ Familiar tools and flexible software to run machine learning model training jobs ○ Interactive data analysis and visualization, with multi-tenancy support to allow quick prototyping of algorithms and data access ○ Sharing and collaboration of ML work between teams in eBay Overview
  • 8. ML Lifecycle Management Lifecycle MODEL INFERENCING Deployable, Scalable MODEL BUILDING Interactive, iterative MODEL RE-FITTING Interactive, iterative MODEL RE-TRAINING Interactive, iterative Data + Lifecycle Management MODEL TRAINING Automatable, repeatable, scalable
  • 9. Krylov Staircase Design for AI Platform
  • 10. eBay AI Platform Components Infrastructure - Krylov AI Engine - Krylov Learning Pipelines Model Experimentation Data Scientist Workspaces Model Lifecycle Management GPU Tall instances Fast Storage Data Preparation Movement Discovery Access AI Hub (Shared Repository) AI Modules Speech Recognition Machine Translation Computer Vision Information Retrieval Natural Language Understanding … Inferencing
  • 11. Krylov High Level Architecture
  • 12. 1. Client Command Line Interface (CLI) via krylovctl program 2. ML Application and Run Specification 3. ML Pipelines: Workflow and Workspace 4. Namespaces - For quota and data isolation 5. Jobs and Runs - Managed by Krylov Tools and Minions 6. Secure Data Access - HDFS, NFS, OpenStack Swift, Custom Krylov Main Features and Concepts
  • 13. Krylov CLI - krylovctl
  • 14. ● Krylov ML Application is a versioned unit of deployment that contains declaration of the developers’ programs ● Implemented as client project used as source to build deployment artifact ● Three main parts: ○ mlapplication.json and artifact.sjon configuration files ○ Source code of the programs ○ Dependencies management via Dockerfile ● Supported types of programs: JVM languages (Java, Scala), Python, Shell script ● Using the ML Application as source, developers can build deployment artifact that can be used by the Run Specification file to deploy it into one of the nodes in the cluster Krylov ML Application
  • 15. { "tasks": { "prepare_data": { "program": "com.ebay.oss.krylov.workflow.JvmMainProgram", "parameters": { "className": "com.ebay.krylov.helloai.HelloWorld" } }, "train_model": { "program": "com.ebay.oss.krylov.workflow.PythonProgram", "parameters": { "file": "helloai-python/helloai/helloworld.py", "args": [] } }, ... Krylov ML Application Example
  • 16. ● The Krylov Run Specification is a runtime configuration to add override configuration and parameter passing for each Task in the ML Application job submissions ● It tells Krylov master API server of which the artifact created by ML Application will be used in the compute cluster ● Defined as runspec.json file or can be passed as argument to krylovctl client program. ● The runspec.json file also has definition for the compute resources, such as which NVIDIA GPUs to use, CPU, memory, and which Docker image for dependencies used in ML Application programs Krylov Run Specification
  • 17. { "jobName": "job-sample", "artifact": "myartifact", "artifactTag": "latest", "mlApplication": "com.ebay.oss.krylov.workflow.app.GenericMLApplication", "applicationParameters": { }, "tasks": { "prepare_data": { "taskParameters": { "prepare_data_parameter_key": "prepare_data_parameter_value" } } } Krylov Run Specification Example
  • 18. ● Krylov ML batch lifecycle pipeline is defined as Krylov Workflow definition ○ Declarative ○ Default Generic Workflow ● Important concepts for Krylov Workflow: ○ Workflow - A single pipeline defined within Krylov and the unit of deployment for an ML Application ■ Each Workflow contains one or more Tasks ■ The Tasks are connected to each other as Directed Acyclic Graph (DAG) structure ○ Task - smallest unit of execution that run developers’ Program and executed in a single machine ○ Flows - Contains one or more key-value pairs of name and declaration of Tasks DAGs ○ Flow - The chosen key that will be run from possible selection in the Flows definition Krylov ML Pipelines: Workflow
  • 19. { "tasks": { ... }, "flows": { "sample_flow": { "prepare_data": ["train_model"], "train_model": ["output"] } }, "flow": "sample_flow" } Workflow Example in mlapplication.json
  • 21. ● A Workspace is an interactive web application to allow developers to use web browser to do ML model prototyping, data preparation and exploration ● The Workspace is run as Jupyter Notebook servers and launched on high CPU/ memory or NVIDIA GPU instances ● Enhance the JupyterHub project to allow distributed launching of multi-tenants Jupyter Notebook servers in Krylov compute cluster using Kubernetes ● Krylov Workspace uses configuration file on creation time to override and customize default parameters Krylov ML Pipelines: Workspace
  • 26. ● Metrics - Grafana, InfluxDb, and Telegraf for GPU monitoring Krylov Cluster Monitoring
  • 30. 1. Download krylovctl program from Krylov release repository 2. Run `krylovctl project create` to create new project in the local machine 3. Update or add code to the Krylov project for the machine learning programs 4. Register them as Program within a Task in the mlapplication.json 5. Add new Flow for the defined Tasks to construct the Workflow as a Directed Acyclic Graph (DAG) 6. Run `krylovctl project build` to build the project. 7. Run `krylovctl artifact create` to copy the runnables of the program into an artifact file 8. Run `krylovctl artifact upload` to upload the artifact file for remote execution 9. Run `krylovctl job run` for local execution, or `krylovctl job submit` for running it in the computing cluster Steps to Submit Krylov Workflow Job with CLI
  • 31. ● Here we go ... Demo Time
  • 33. 1. Inferencing Platform 2. Exploration and documentation of RESTful APIs for job management 3. Data Source and Dataset abstraction via Krylov SDKs 4. Managed ML Pipelines - Computer Vision, NLP, Machine Translation 5. Distributed Deep Learning 6. AutoML - Hyper Parameters Tuning 7. AI Hub to share ML Applications and Datasets Future Roadmap