The Krylov Project is the key component in eBay's AI Platform initiative that provides an easy to use, open, and fast AI orchestration engine that is deployed as managed services in eBay cloud.
Using Krylov, AI scientists can access eBay's massive datasets; build and train AI models; spin up powerful compute (high-memory or GPU instances) on the Krylov compute cluster; and set up machine learning pipelines, such as using declarative constructs that stitch together pipeline lifecycle.
SEBA: SDN Enabled Broadband Access - Transporting SDN principles to PON NetworksLiz Warner
SEBA is both a Reference Design and an exemplar implementation based on the reference design. This talk will mainly focus on the Exemplar implementation developed by ONF, AT&T's Atlanta Foundry and the SEBA and VOLTHA community with origins in R-CORD and composed of VOLTHA, ONOS apps etc. We will tall about how they all fit together in a modular way and there will be a quick demo to show the current and futures developments in SEBS.
SEBA: SDN Enabled Broadband Access - Transporting SDN principles to PON NetworksLiz Warner
SEBA is both a Reference Design and an exemplar implementation based on the reference design. This talk will mainly focus on the Exemplar implementation developed by ONF, AT&T's Atlanta Foundry and the SEBA and VOLTHA community with origins in R-CORD and composed of VOLTHA, ONOS apps etc. We will tall about how they all fit together in a modular way and there will be a quick demo to show the current and futures developments in SEBS.
Cloud native architecture is emerging for Telecom workloads. To support these emerging trends, Intel is targeting enhancements to the Dataplane Development Kit (DPDK). The enhancements would target network service mesh with dedicated sidecar accelerators and the mechanism to build the mesh dynamically.
Speaker: Gerald Rogers. Gerald Rogers is a Principal Engineer in the Network Products Group focused on virtual switching, network function virtualization and Data Plane Development Kit (DPDK). After joining Intel in 2005, Gerald has worked as a software engineer and architect in the embedded and networking groups. For the past 7 years Gerald has led the network virtual switching software and hardware acceleration effort to drive Intel architecture into the networking and telecommunications industry. Gerald holds a Bachelor’s degree in Electrical Engineering and a Master’s degree in Computer Science, and has 20 years of experience in the networking and telecommunications industry.
Slides presented by Jeff Squyres at the 2015 OpenFabrics Software Developers' Workshop. This talk discusses Cisco's experiences implementing an ultra-low latency Ethernet plugin / provider for the Linux Verbs API and for for the Libfabric API.
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and KubeflowIT Arena
Kostiantyn Bokhan, a technical lead at N-IX, focuses on data science projects. He leads data science projects in several areas: Computer vision, NLP, and signal processing as well as consults clients regarding digital transformations with AI. When free, he conducts research in the deep machine learning area. Kostiantyn has been an associate professor and faculty member of several universities since 2002. His research focuses on machine learning, deep learning, signal, and image processing. He received a PhD degree in network and telecommunications systems with research in digital signal processing in 2013. He has served on the scientific committees and review boards of several conferences.
Speech Overview:
Applying machine learning to make business applications and services intelligent is more than just training models and serving them. It requires implementing end-to-end and continuously repeatable cycles of training, testing, deploying, monitoring, and operating the models. Continuous delivery for machine learning (CD4ML) is a technique that enables reliable end-to-end cycles of development, deploying, and monitoring machine learning models. There are a lot of tools and frameworks that can be used to implement CD4ML. One of them is Kubeflow. Our experience of using Kubeflow for implementing CD4ML for the manufacturing area based on Azure Kubernetes service will be described in this speech.
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Animesh Singh
Kubeflow Pipelines and TensorFlow Extended (TFX) together is end-to-end platform for deploying production ML pipelines. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system. In this talk we describe how how to run TFX in hybrid cloud environments.
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...Abhinav Joshi
This deck provide an overview of containers and Kubernetes, and how these technologies can help solve the challenges faced by data scientists, ML engineers, and application developers. Next, it showcases the key capabilities required in a containers and kubernetes platform to help data scientists easily use technologies like Jupyter Notebooks, ML frameworks, programming languages to innovate faster. Finally it discusses the available platform options (e.g. KubeFlow, Open Data Hub, etc.), and some examples of how data scientists are accelerating their ML initiatives with containers and kubernetes platform.
Scaling AI/ML with Containers and Kubernetes Tushar Katarki
AI is popular and yet faces several challenges in the industry: 1) self-service and automation 2) Deployment into production 3) Access to data. These challenges can be addressed with containers and Kubernetes. They help you build AI-as-a-service with open source tools and Kuberentes. Data Scientists can use the service for data, experimentation and to deliver models into production iteratively with self-service and automation. Using Kubernetes, one is able to run massive machine learning pipelines iteratively in an automated fashion that can be repeated.
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
Cloud native architecture is emerging for Telecom workloads. To support these emerging trends, Intel is targeting enhancements to the Dataplane Development Kit (DPDK). The enhancements would target network service mesh with dedicated sidecar accelerators and the mechanism to build the mesh dynamically.
Speaker: Gerald Rogers. Gerald Rogers is a Principal Engineer in the Network Products Group focused on virtual switching, network function virtualization and Data Plane Development Kit (DPDK). After joining Intel in 2005, Gerald has worked as a software engineer and architect in the embedded and networking groups. For the past 7 years Gerald has led the network virtual switching software and hardware acceleration effort to drive Intel architecture into the networking and telecommunications industry. Gerald holds a Bachelor’s degree in Electrical Engineering and a Master’s degree in Computer Science, and has 20 years of experience in the networking and telecommunications industry.
Slides presented by Jeff Squyres at the 2015 OpenFabrics Software Developers' Workshop. This talk discusses Cisco's experiences implementing an ultra-low latency Ethernet plugin / provider for the Linux Verbs API and for for the Libfabric API.
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and KubeflowIT Arena
Kostiantyn Bokhan, a technical lead at N-IX, focuses on data science projects. He leads data science projects in several areas: Computer vision, NLP, and signal processing as well as consults clients regarding digital transformations with AI. When free, he conducts research in the deep machine learning area. Kostiantyn has been an associate professor and faculty member of several universities since 2002. His research focuses on machine learning, deep learning, signal, and image processing. He received a PhD degree in network and telecommunications systems with research in digital signal processing in 2013. He has served on the scientific committees and review boards of several conferences.
Speech Overview:
Applying machine learning to make business applications and services intelligent is more than just training models and serving them. It requires implementing end-to-end and continuously repeatable cycles of training, testing, deploying, monitoring, and operating the models. Continuous delivery for machine learning (CD4ML) is a technique that enables reliable end-to-end cycles of development, deploying, and monitoring machine learning models. There are a lot of tools and frameworks that can be used to implement CD4ML. One of them is Kubeflow. Our experience of using Kubeflow for implementing CD4ML for the manufacturing area based on Azure Kubernetes service will be described in this speech.
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Animesh Singh
Kubeflow Pipelines and TensorFlow Extended (TFX) together is end-to-end platform for deploying production ML pipelines. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system. In this talk we describe how how to run TFX in hybrid cloud environments.
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...Abhinav Joshi
This deck provide an overview of containers and Kubernetes, and how these technologies can help solve the challenges faced by data scientists, ML engineers, and application developers. Next, it showcases the key capabilities required in a containers and kubernetes platform to help data scientists easily use technologies like Jupyter Notebooks, ML frameworks, programming languages to innovate faster. Finally it discusses the available platform options (e.g. KubeFlow, Open Data Hub, etc.), and some examples of how data scientists are accelerating their ML initiatives with containers and kubernetes platform.
Scaling AI/ML with Containers and Kubernetes Tushar Katarki
AI is popular and yet faces several challenges in the industry: 1) self-service and automation 2) Deployment into production 3) Access to data. These challenges can be addressed with containers and Kubernetes. They help you build AI-as-a-service with open source tools and Kuberentes. Data Scientists can use the service for data, experimentation and to deliver models into production iteratively with self-service and automation. Using Kubernetes, one is able to run massive machine learning pipelines iteratively in an automated fashion that can be repeated.
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...ScyllaDB
SmartDeployAI builds data workflow pipelines for running large scale Industrial IoT applications. Their software platform is a shared multi-tenant Kubernetes cluster environment where multiple workflow pipelines can be bootstrapped and scheduled to run concurrently. Learn how IoT sensors and devices are provisioned on their platform. This process requires them to track markers in their metadata store or parameters to run various pipeline models. They need to persist this data and make it available throughout the entire data workflow pipeline life-cycle.
Learn how their journey led to Scylla, and how they minimized latencies, maintained data storage isolation for each workflow pipeline in a shared Kubernetes cluster, bootstrapped pipeline artifacts and resources on demand and reduced their resource consumption footprint.
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
ML platform meetups are quarterly meetups, where we discuss and share advanced technology on machine learning infrastructure. Companies involved include Airbnb, Databricks, Facebook, Google, LinkedIn, Netflix, Pinterest, Twitter, and Uber.
Revolutionary container based hybrid cloud solution for MLPlatform
Ness' data science platform, NextGenML, puts the entire machine learning process: modelling, execution and deployment in the hands of data science teams.
The entire paradigm approaches collaboration around AI/ML, being implemented with full respect for best practices and commitment to innovation.
Kubernetes (onPrem) + Docker, Azure Kubernetes Cluster (AKS), Nexus, Azure Container Registry(ACR), GlusterFS
Workflow
Argo->Kubeflow
DevOps
Helm, kSonnet, Kustomize,Azure DevOps
Code Management & CI/CD
Git, TeamCity, SonarQube, Jenkins
Security
MS Active Directory, Azure VPN, Dex (K8s) integrated with GitLab
Machine Learning
TensorFlow (model training, boarding, serving), Keras, Seldon
Storage (Azure)
Storage Gen1 & Gen2, Data Lake, File Storage
ETL (Azure)
Databricks, Spark on K8, Data Factory (ADF), HDInsight (Kafka and Spark), Service Bus (ASB)
Lambda functions & VMs, Cache for Redis
Monitoring and Logging
Graphana, Prometeus, GrayLog
When it comes to Large Scale data processing and Machine Learning, Apache Spark is no doubt one of the top battle-tested frameworks out there for handling batched or streaming workloads. The ease of use, built-in Machine Learning modules, and multi-language support makes it a very attractive choice for data wonks. However bootstrapping and getting off the ground could be difficult for most teams without leveraging a Spark cluster that is already pre-provisioned and provided as a managed service in the Cloud, while this is a very attractive choice to get going, in the long run, it could be a very expensive option if it’s not well managed.
As an alternative to this approach, our team has been exploring and working a lot with running Spark and all our Machine Learning workloads and pipelines as containerized Docker packages on Kubernetes. This provides an infrastructure-agnostic abstraction layer for us, and as a result, it improves our operational efficiency and reduces our overall compute cost. Most importantly, we can easily target our Spark workload deployment to run on any major Cloud or On-prem infrastructure (with Kubernetes as the common denominator) by just modifying a few configurations.
In this talk, we will walk you through the process our team follows to make it easy for us to run a production deployment of our Machine Learning workloads and pipelines on Kubernetes which seamlessly allows us to port our implementation from a local Kubernetes set up on the laptop during development to either an On-prem or Cloud Kubernetes environment
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageAnimesh Singh
With the breadth of sheer functionalities which need to be addressed in the Machine Learning world around building, training, serving and managing models, getting it done in a consistent, composable, portable, and scalable manner is hard. The Kubernetes framework is well suited to address these issues, which is why it's a great foundation for deploying ML workloads. Kubeflow is designed to take advantage of these benefits. In this talk, we are going to address how to make it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and support the full lifecycle Machine Learning using open source technologies like Kubeflow, Tensorflow, PyTorch,Tekton, Knative, Istio and others. We are going to discuss how to enable distributed training of models, model serving, canary rollouts, drift detection, model explainability, metadata management, pipelines and others. Additionally we will discuss Watson productization in progress based on Kubeflow Pipelines and Tekton, and point to Kubeflow Dojo materials and follow-on workshops.
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Akash Tandon
ML solutions in production start from data ingestion and extend upto the actual deployment step. We want this workflow to be scalable, portable and simple. Containers and kubernetes are great at the former two but not the latter if you aren't a devops practitioner. We'll explore how you can leverage the Kubeflow project to deploy best-of-breed open-source systems for ML to diverse infrastructures.
Hydrosphere.io for ODSC: Webinar on KubeflowRustem Zakiev
Webinar video: https://www.youtube.com/watch?v=Y3_fcJBgpMw
Kubeflow and Beyond: Automation of Model Training, Deployment, Testing, Monitoring, and Retraining
Speakers:
Stepan Pushkarev, CTO, Hydrosphere.io and Ilnur Garifullin is an ML Engineer, Hydrosphere.io
Abstract: Very often a workflow of training models and delivering them to the production environment contains loads of manual work. Those could be either building a Docker image and deploying it to the Kubernetes cluster or packing the model to the Python package and installing it to your Python application. Or even changing your Java classes with the defined weights and re-compiling the whole project. Not to mention that all of this should be followed by testing your model's performance. It hardly could be named "continuous delivery" if you do it all manually. Imagine you could run the whole process of assembling/training/deploying/testing/running model via a single command in your terminal. In this webinar, we will present a way to build the whole workflow of data gathering/model training/model deployment/model testing into a single flow and run it with a single command.
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
Do you know The Cloud Girl? She makes the cloud come alive with pictures and storytelling.
The Cloud Girl, Priyanka Vergadia, Chief Content Officer @Google, joins us to tell us about Scaleable Data Analytics in Google Cloud.
Maybe, with her explanation, we'll finally understand it!
Priyanka is a technical storyteller and content creator who has created over 300 videos, articles, podcasts, courses and tutorials which help developers learn Google Cloud fundamentals, solve their business challenges and pass certifications! Checkout her content on Google Cloud Tech Youtube channel.
Priyanka enjoys drawing and painting which she tries to bring to her advocacy.
Check out her website The Cloud Girl: https://thecloudgirl.dev/ and her new book: https://www.amazon.com/Visualizing-Google-Cloud-Illustrated-References/dp/1119816327
Similar to S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and Engineering Teams (20)
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and Engineering Teams
1. Introducing Krylov
eBay AI Platform - Machine Learning Made Easy
GPU Technology Conference, 2018
Henry Saputra
Technical Lead for Krylov - eBay Unified AI Platform
2. 1. Data Science and Machine Learning at eBay
2. Introducing Krylov
3. Compute Cluster and Accelerator Support with Nvidia GPU
4. Quickstart Example
5. Future Roadmap
6. Q & A
Agenda
4. eBay Patterns - Tools and Frameworks
Tools
• Languages: R, Python, Scala, C++
• IDE-like: RStudio, Notebooks (Juptyer), Python IDE
• Frameworks: NumPy, SciPy, matplotlib, Scikit-learn, Spark MLLib, H2O
Weka, XGBoost, Moses
• Pipelines: Cron, Luigi, Apache Airflow, Apache Oozie
Patterns for ML Training
• Single node
• Distributed training
• Deep learning (GPUs)
Deep LearningDistributed Training Key takeaway = CHOICE
1. Flexibility of software
2. Flexibility of hardware
configuration
5. 1. 50%-70% is plumbing work
a. Accessing and moving secured data
b. Environment and tools setup
c. Sub-optimal compute instances - NVIDIA GPUs and High memory/ CPUs instances
d. Long wait time from platform and infrastructure
2. Lost of productivity and opportunities
a. ML lifecycle management of models and features
b. Building robust training model pipelines: prepare data, algorithm, hyperparameters tuning, cross
validation
3. Collaborations almost impossible
4. Research vs Applied ML
Problems and Challenges
7. ● Krylov is the core project of the eBay unified AI Platform initiative to enable easy to use and
powerful cloud-based data science and machine learning platform.
● The objective of the project is to enable machine learning jobs with easy access to
secured-data and eBay cloud computing resources.
● The main goals for the Krylov initiative are:
○ Easy and secure access to training datasets
○ Access to compute in high performance machines, such as GPUs, or cluster of
machines.
○ Familiar tools and flexible software to run machine learning model training jobs
○ Interactive data analysis and visualization, with multi-tenancy support to allow quick
prototyping of algorithms and data access
○ Sharing and collaboration of ML work between teams in eBay
Overview
8. ML Lifecycle Management
Lifecycle
MODEL INFERENCING
Deployable, Scalable
MODEL BUILDING
Interactive, iterative
MODEL RE-FITTING
Interactive, iterative
MODEL RE-TRAINING
Interactive, iterative
Data + Lifecycle Management
MODEL TRAINING
Automatable, repeatable, scalable
10. eBay AI Platform Components
Infrastructure - Krylov
AI Engine - Krylov
Learning
Pipelines
Model
Experimentation
Data Scientist
Workspaces
Model Lifecycle
Management
GPU Tall instances
Fast Storage
Data
Preparation
Movement
Discovery
Access
AI Hub
(Shared
Repository)
AI
Modules
Speech Recognition Machine Translation
Computer Vision Information Retrieval
Natural Language Understanding …
Inferencing
12. 1. Client Command Line Interface (CLI) via krylovctl program
2. ML Application and Run Specification
3. ML Pipelines: Workflow and Workspace
4. Namespaces - For quota and data isolation
5. Jobs and Runs - Managed by Krylov Tools and Minions
6. Secure Data Access - HDFS, NFS, OpenStack Swift, Custom
Krylov Main Features and Concepts
14. ● Krylov ML Application is a versioned unit of deployment that contains declaration of the
developers’ programs
● Implemented as client project used as source to build deployment artifact
● Three main parts:
○ mlapplication.json and artifact.sjon configuration files
○ Source code of the programs
○ Dependencies management via Dockerfile
● Supported types of programs: JVM languages (Java, Scala), Python, Shell script
● Using the ML Application as source, developers can build deployment artifact that can be
used by the Run Specification file to deploy it into one of the nodes in the cluster
Krylov ML Application
16. ● The Krylov Run Specification is a runtime configuration to add override configuration and
parameter passing for each Task in the ML Application job submissions
● It tells Krylov master API server of which the artifact created by ML Application will be used in
the compute cluster
● Defined as runspec.json file or can be passed as argument to krylovctl client program.
● The runspec.json file also has definition for the compute resources, such as which NVIDIA
GPUs to use, CPU, memory, and which Docker image for dependencies used in ML
Application programs
Krylov Run Specification
18. ● Krylov ML batch lifecycle pipeline is defined as Krylov Workflow definition
○ Declarative
○ Default Generic Workflow
● Important concepts for Krylov Workflow:
○ Workflow - A single pipeline defined within Krylov and the unit of deployment for an ML Application
■ Each Workflow contains one or more Tasks
■ The Tasks are connected to each other as Directed Acyclic Graph (DAG) structure
○ Task - smallest unit of execution that run developers’ Program and executed in a single machine
○ Flows - Contains one or more key-value pairs of name and declaration of Tasks DAGs
○ Flow - The chosen key that will be run from possible selection in the Flows definition
Krylov ML Pipelines: Workflow
21. ● A Workspace is an interactive web application to allow developers to use web
browser to do ML model prototyping, data preparation and exploration
● The Workspace is run as Jupyter Notebook servers and launched on high CPU/
memory or NVIDIA GPU instances
● Enhance the JupyterHub project to allow distributed launching of multi-tenants
Jupyter Notebook servers in Krylov compute cluster using Kubernetes
● Krylov Workspace uses configuration file on creation time to override and
customize default parameters
Krylov ML Pipelines: Workspace
30. 1. Download krylovctl program from Krylov release repository
2. Run `krylovctl project create` to create new project in the local machine
3. Update or add code to the Krylov project for the machine learning programs
4. Register them as Program within a Task in the mlapplication.json
5. Add new Flow for the defined Tasks to construct the Workflow as a Directed Acyclic Graph (DAG)
6. Run `krylovctl project build` to build the project.
7. Run `krylovctl artifact create` to copy the runnables of the program into an artifact file
8. Run `krylovctl artifact upload` to upload the artifact file for remote execution
9. Run `krylovctl job run` for local execution, or `krylovctl job submit` for running it in the computing
cluster
Steps to Submit Krylov Workflow Job with CLI
33. 1. Inferencing Platform
2. Exploration and documentation of RESTful APIs for job management
3. Data Source and Dataset abstraction via Krylov SDKs
4. Managed ML Pipelines - Computer Vision, NLP, Machine Translation
5. Distributed Deep Learning
6. AutoML - Hyper Parameters Tuning
7. AI Hub to share ML Applications and Datasets
Future Roadmap