Hopsworks is a platform for designing and operating End to End Machine Learning using PySpark and TensorFlow/PyTorch. Early access is now available on GCP. Hopsworks includes the industry's first Feature Store. Hopsworks is open-source.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
Spark AI Summit Europe 2019 talk: Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy. How can you do directed search efficiently with Spark? The answer is Maggy - asynchronous directed search on PySpark.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
Spark AI Summit Europe 2019 talk: Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy. How can you do directed search efficiently with Spark? The answer is Maggy - asynchronous directed search on PySpark.
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013
(http://2013.datadaytexas.com/schedule)
Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib
MADlib: http://madlib.net
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.
Autodeploy a complete end-to-end machine learning pipeline on Kubernetes using tools like Spark, TensorFlow, HDFS, etc. - it requires a running Kubernetes (K8s) cluster in the cloud or on-premise.
Managed Feature Store for Machine LearningLogical Clocks
All hyperscale AI companies build their machine learning platforms around a Feature Store.
A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. A Feature Store is a central place to store curated features within an organization.
Feature Stores are a fuel for AI systems as we use them to train machine learning models so that we can make predictions for feature values that we have never seen before.
During this presentation you learn:
- About the concept of a Feature Store and how it can help manage feature data for Enterprises and ease the path of data from backend systems and data-lakes to Data Scientists.
- Our take on Feature Stores, including best practices and use cases and:
- How to ensure Consistent Features in both Training and Serving
Governance, Access-Control, and Versioning
- To create Training Data in the File Format of your Choice
Eliminate Inconsistency between Features in Training and Inferencing
Watch the webinar with a demo: https://www.logicalclocks.com/webinars
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
Hopsworks - The Platform for Data-Intensive AIQAware GmbH
Cloud Native Night July 2019, Munich: Talk by Steffen Srohsschmiedt (@grohsschmiedt, Head of Cloud at LogicalClocks)
=== Please download slides if blurred! ===
Abstract: Machine Learning (ML) pipelines are the fundamental building block for productionizing ML code. Building such pipelines with Big Data is a complex process. The different stages in ML pipelines also need to be orchestrated, from data ingestion and data transformation, to feature engineering, to model training, serving and monitoring.
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning (ML) pipelines. A key part of our pipelines is the world's first open-source Feature Store, that acts as a data warehouse for features, providing a natural API between data engineers - who write feature engineering code - and Data Scientists, who select features from the feature store to generate training/test data for models.
Join us next time: https://www.meetup.com/Cloud-Native-muc/events
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013
(http://2013.datadaytexas.com/schedule)
Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib
MADlib: http://madlib.net
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.
Autodeploy a complete end-to-end machine learning pipeline on Kubernetes using tools like Spark, TensorFlow, HDFS, etc. - it requires a running Kubernetes (K8s) cluster in the cloud or on-premise.
Managed Feature Store for Machine LearningLogical Clocks
All hyperscale AI companies build their machine learning platforms around a Feature Store.
A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. A Feature Store is a central place to store curated features within an organization.
Feature Stores are a fuel for AI systems as we use them to train machine learning models so that we can make predictions for feature values that we have never seen before.
During this presentation you learn:
- About the concept of a Feature Store and how it can help manage feature data for Enterprises and ease the path of data from backend systems and data-lakes to Data Scientists.
- Our take on Feature Stores, including best practices and use cases and:
- How to ensure Consistent Features in both Training and Serving
Governance, Access-Control, and Versioning
- To create Training Data in the File Format of your Choice
Eliminate Inconsistency between Features in Training and Inferencing
Watch the webinar with a demo: https://www.logicalclocks.com/webinars
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
Hopsworks - The Platform for Data-Intensive AIQAware GmbH
Cloud Native Night July 2019, Munich: Talk by Steffen Srohsschmiedt (@grohsschmiedt, Head of Cloud at LogicalClocks)
=== Please download slides if blurred! ===
Abstract: Machine Learning (ML) pipelines are the fundamental building block for productionizing ML code. Building such pipelines with Big Data is a complex process. The different stages in ML pipelines also need to be orchestrated, from data ingestion and data transformation, to feature engineering, to model training, serving and monitoring.
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning (ML) pipelines. A key part of our pipelines is the world's first open-source Feature Store, that acts as a data warehouse for features, providing a natural API between data engineers - who write feature engineering code - and Data Scientists, who select features from the feature store to generate training/test data for models.
Join us next time: https://www.meetup.com/Cloud-Native-muc/events
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
Slides from the TensorFlow meetup hosted on October 9th at the ML6 offices in Ghent. Join our Meetup group for updates and future sessions: https://www.meetup.com/TensorFlow-Belgium/
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
The main goal of the session is to showcase approaches that greatly simplify the work of a data analyst when performing data analytics, or when employing machine learning algorithms, over Big Data. The session will include presentations on
(a) How data analytics workflows can be easily and graphically composed, and then optimized for execution,
(b) How raw data with great variety can be easily queried using SQL interfaces, and
(c) How complex machine learning operations can be performed efficiently in distributed settings.
After these presentations, the speakers will participate in a discussion with the audience, in order to discuss further tools that could make the work of a data analyst more simple.
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
This talk describes the scale-out, consistent metadata architecture of Hopsworks and how we use it to support custom metadata and provenance for ML Pipelines with Hopsworks Feature Store, NDB, and ePipe . The talk is here: https://www.youtube.com/watch?v=oPp8PJ9QBnU&feature=emb_logo
Building an MLOps Stack for Companies at Reasonable ScaleMerelda
A practical talk on showing the following:
1. Challenges of Deploying ML today
2. How to do MLOps:
- Principles over Technology
- Convention over Configuration
3. What's a reasonable MLOps Stack
4. Demo on Google Collab to Deployed Endpoint
ApacheCon 2021 Apache Deep Learning 302Timothy Spann
ApacheCon 2021 Apache Deep Learning 302
Tuesday 18:00 UTC
Apache Deep Learning 302
Timothy Spann
This talk will discuss and show examples of using Apache Hadoop, Apache Kudu, Apache Flink, Apache Hive, Apache MXNet, Apache OpenNLP, Apache NiFi and Apache Spark for deep learning applications. This is the follow up to previous talks on Apache Deep Learning 101 and 201 and 301 at ApacheCon, Dataworks Summit, Strata and other events. As part of this talk, the presenter will walk through using Apache MXNet Pre-Built Models, integrating new open source Deep Learning libraries with Python and Java, as well as running real-time AI streams from edge devices to servers utilizing Apache NiFi and Apache NiFi - MiNiFi. This talk is geared towards Data Engineers interested in the basics of architecting Deep Learning pipelines with open source Apache tools in a Big Data environment. The presenter will also walk through source code examples available in github and run the code live on Apache NiFi and Apache Flink clusters.
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache NiFi, Apache Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
* https://github.com/tspannhw/ApacheDeepLearning302/
* https://github.com/tspannhw/nifi-djl-processor
* https://github.com/tspannhw/nifi-djlsentimentanalysis-processor
* https://github.com/tspannhw/nifi-djlqa-processor
* https://www.linkedin.com/pulse/2021-schedule-tim-spann/
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Hail hydrate! from stream to lake using open sourceTimothy Spann
(VIRTUAL) Hail Hydrate! From Stream to Lake Using Open Source - Timothy J Spann, StreamNative
https://osselc21.sched.com/event/lAPi?iframe=no
A cloud data lake that is empty is not useful to anyone. How can you quickly, scalably and reliably fill your cloud data lake with diverse sources of data you already have and new ones you never imagined you needed. Utilizing open source tools from Apache, the FLiP stack enables any data engineer, programmer or analyst to build reusable modules with low or no code. FLiP utilizes Apache NiFi, Apache Pulsar, Apache Flink and MiNiFi agents to load CDC, Logs, REST, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach you how to fish in the deep end of the lake and return a data engineering hero. Let's hope everyone is ready to go from 0 to Petabyte hero.
https://osselc21.sched.com/event/lAPi/virtual-hail-hydrate-from-stream-to-lake-using-open-source-timothy-j-spann-streamnative
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
Metaflow: The ML Infrastructure at NetflixBill Liu
Metaflow was started at Netflix to answer a pressing business need: How to enable an organization of data scientists, who are not software engineers by training, build and deploy end-to-end machine learning workflows and applications independently. We wanted to provide the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure: data, compute, orchestration, and versioning.
Today, the open-source Metaflow powers hundreds of business-critical ML projects at Netflix and other companies from bioinformatics to real estate.
In this talk, you will learn about:
- What to expect from a modern ML infrastructure stack.
- Using Metaflow to boost the productivity of your data science organization, based on lessons learned from Netflix.
- Deployment strategies for a full stack of ML infrastructure that plays nicely with your existing systems and policies.
https://www.aicamp.ai/event/eventdetails/W2021080510
Building and deploying LLM applications with Apache AirflowKaxil Naik
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data.
In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.
https://airflowsummit.org/sessions/2023/keynote-llm/
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Akash Tandon
ML solutions in production start from data ingestion and extend upto the actual deployment step. We want this workflow to be scalable, portable and simple. Containers and kubernetes are great at the former two but not the latter if you aren't a devops practitioner. We'll explore how you can leverage the Kubeflow project to deploy best-of-breed open-source systems for ML to diverse infrastructures.
Data Science und Machine Learning im Kubernetes-Ökosysteminovex GmbH
Wer professionell Datenprodukte bereitstellen möchte, benötigt Plattformen für Data Science, Machine Learning und AI, die Nachvollziehbarkeit, Reproduzierbarkeit, Flexibilität und Skalierbarkeit gewährleisten. Dazu werden skalierbare Storage- und Compute-Kapazitäten genauso benötigt wie eine Workflowsteuerung für ML-Pipelines, ein Analysefrontend oder ein Metadatenmanagement.
Wir zeigen, wie auf Basis von Kubernetes eine solche Plattformen aufgebaut werden kann, die sich durch CI/CD-Pipelines flexibel konfigurieren lässt und verschiedene ML-Frameworks und Programmiersprachen unterstützt.
Vorkenntnisse
Grundlegendes Verständnis von Daten, Data-Science-Prozessen und Container-Plattformen ist hilfreich.
Lernziele
Die Zuhörer haben einen Überblick über die Anforderungen an Data-Science-Plattformen und wie eine Architektur mit Kubernetes aussehen kann.
Event: data2day, 26.09.2018
Speaker: Dr. Stefan Igel, Hans-Peter Zorn
Mehr Tech-Vorträge: https://www.inovex.de/de/content-pool/vortraege/
Mehr Tech-Artikel: https://www.inovex.de/blog/
Similar to Hopsworks at Google AI Huddle, Sunnyvale (20)
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
This talk is a mental map for building ML systems as ML Pipelines that are factored into Feature Pipelines, Training Pipelines, and Inference Pipelines.
Building Hopsworks, a cloud-native managed feature store for machine learning Jim Dowling
Cloud Native London talk about the control layer of Hopsworks.ai and our choice of cloud native services. We built our own multi-tenant services as cloud native services, for the most part.
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsJim Dowling
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs, including AllReduce, Horovod, and how commodity GPU servers, such as DeepLearning11, will gain adoption.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
2. Hopsworks Technical Milestones
“If you’re working with big data and Hadoop, this one paper could repay your investment in
the Morning Paper many times over.... HopsFS is a huge win.”
- Adrian Colyer, The Morning Paper
World’s first Hadoop
platform to support
GPUs-as-a-Resource
World’s fastest
HDFS Published at
USENIX FAST with
Oracle and Spotify
World’s First
Open Source Feature
Store for Machine
Learning
World’s First
Distributed Filesystem to
store small files in
metadata on NVMe disks
Winner of IEEE
Scale
Challenge 2017
with HopsFS -
1.2m ops/sec
2017
World’s most scalable
POSIX-like Hierarchical
Filesystem with
Multi Data Center Availability
with 1.6m ops/sec on GCP
2018 2019
First non-Google ML
Platform with
TensorFlow Extended
(TFX) support through
Beam/Flink
World’s First
Unified Hyperparam
and Ablation Study
Parallel Prog.`
Framework
3. Evolution of Distributed Filesystems
POSIXPast
Present
NFS, HDFS S3 GCS HopsFS
Single DC,
Strongly
Consistent
Metadata
Multi-DC,
Eventually
Consistent
Metadata
Multi-DC,
Strongly
Consistent
Metadata
Object
Store
POSIX-like
Filesystem
4. Why HopsFS?
● Distributed FS
○ Needed for Parallel ML Experiments / Dist Training / FeatureStore
● Provenance/Free-text-search
○ Change Data Capture API
● Performance
○ > 1.6m ops/sec over 3 Azes on GCP using Spotify’s Hadoop workload
○ NVMe for small files stored in metadata
● HDFS API
○ TensorFlow/Keras/PySpark/Beam/Flink/PyTorch (Petastorm)
11. What is Hopsworks?
Elasticity & Performance Governance & ComplianceDevelopment & Operations
Secure Multi-Tenancy
Project-based restricted access
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Notebooks for Development
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, MLeap, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow
Feature Store
Data warehouse for ML
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe speed with Big Data
Horizontally Scalable
Ingestion, DataPrep,
Training, Serving
FS
12. Which services require Distributed Metadata (HopsFS)?
Elasticity & Performance Governance & ComplianceDevelopment & Operations
Secure Multi-Tenancy
Project-based restricted access
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Notebooks for Development
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, MLeap, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow
Feature Store
Data warehouse for ML
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe speed with Big Data
Horizontally Scalable
Ingestion, DataPrep,
Training, Serving
FS
21. Hopsworks Feature Store
Feature Mgmt Storage Access
Statistics
Online
Features
Discovery
Offline
Features
Data Scientist
Online Apps
Data Engineer
MySQL Cluster
(Metadata,
Online Features)
Apache Hive
Columnar DB
(Offline Features)
Feature Data
Ingestion
Hopsworks Feature Store
Training Data
(S3, HDFS)
Batch Apps
Discover features,
create training data,
save models,
read online/offline/on-
demand features,
historical feature values.
Models
HopsFS
JDBC
(SAS, R, etc)
Feature
CRUD
Add/remove features,
access control,
feature data validation.
Access
Control
Time Travel
Data
Validation
Pandas or
PySpark
DataFrame
External DB
Feature Defn
select ..
23. Register a Feature Group with the Feature Store
titanic_df = # Spark or Pandas Dataframe
# Do feature engineering on ‘titanic_data’
# Register Dataframe as FeatureGroup
featurestore.create_featuregroup(titanic_df,
"titanic_data_passengers“)
24. Create Training Datasets using the Feature Store
sample_data = featurestore.get_features([“name”, “Pclass”,
“Sex”, “balance”])
featurestore.create_training_dataset(sample_data,
“titanic_training_dataset", data_format="tfrecords“,
training_dataset_version=1)
# Use the training dataset
dataset_dir = featurestore.get_training_dataset_path(
"titanic_training_dataset")
s = featurestore.get_training_dataset_tf_record_schema(
“titanic_training_dataset”)
31. Explicit vs Implicit Provenance
● Explicit: TFX, MLFlow
○ Wrap existing code in components
that execute a stage in the pipeline
○ Interceptors in components inject
metadata to a metadata store as
data flows through the pipeline
○ Store metadata about artefacts and
executions
● Implicit: Hopsworks with ePipe*
DataPrep Train
HopsFS
Experiment
/Training_
Datasets
/Experiments /Models
Elasticsearch
ePipe: ChangeDataCapture API
*ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.
32. Provenance in Hopsworks
● Implicit Provenance
○ Applications read/write files to HopsFS – infer artefacts from pathname conventions and xattrs
● Explicit Provenance in Hops API
○ Executions:
hops.experiment.grid_search(…), hops.experiment.collective_allreduce(…)
○ FeatureStore:
hops.featurestore.create_training_dataset(….)
○ Saving and deploying models:
hops.serving.export(export_path, “model_name", 1)
hops.serving.deploy(“…/model.pb”, destination)
36. /Models
● Named/versioned model
management for:
TensorFlow/Keras
Scikit Learn
● A Models dataset can be
securely shared with other
projects or the whole cluster
● The provenance API returns the
conda.yml and execution
used to train a given model
/Projects/MyProj
└ Models
└ <name>
└ <version>
├─ saved_model.pb
└─ variables/
...
37. Model Serving
● Model Serving on Google
Cloud AI Platform
○ Export models to Cloud
Storage buckets
● On-Premise
○ Serve models on
Kuberenetes
○ TensorFlow Serving,
Scikit-Learn
39. Principles of Development with Notebooks
● No throwaway code
● Code can be run either in Notebooks or as Jobs (in Pipelines)
● Notebooks/jobs should be parameterizable
● No external configuration for program logic
○ HPARAMs are part of the program
● Core training loop should be the same for all stages of development
○ Data exploration, hparam tuning, dist training,
40. Problem in ML Pipeline Development Process?
Explore Data,
Train model
Hyperparam
Opt.
Distributed
Training
Notebook Python + YML in Gitport/rewrite code
Iteration is hard/impossible/a-bad-idea
41. Notebooks offer more than Prototyping
● Familiar web-based development
environment
● Interactive development/debugging
● Reporting platform for ML
applications (Papermill by Netflix)
● Parameterizable as part of Pipelines
(Papermill by Netflix)
Disclaimer: Notebooks are not for everyone
42. ML Pipelines of Jupyter Notebooks with Airflow
Select
Features,
File Format
Feature
Engineering
Validate &
Deploy Model
Experiment,
Train Model
Airflow
Dataprep Pipeline Training and Deployment Pipeline
Feature
Store
46. GPU(s) in PySpark Executor, Driver coordinates
PySpark makes it easier to
write TensorFlow/Keras/
PyTorch code that can either
be run on a single GPU or
scale to run on lots of GPUS
for Parallel Experiments or
Distributed Training.
Executor Executor
Driver
47. Executor 1 Executor N Driver
HopsFS
• Training/Test Datasets
• Model checkpoints, Trained Models
• Experiment run data
• Provenance data
• Application logs
Need Distributed Filesystem for Coordination
Model
Serving
TensorBoard
50. Leave code unchanged, but configure 4 Executors
print(“Hello
from GPU”)
Driver
print(“Hello
from GPU”)
print(“Hello
from GPU”)
print(“Hello
from GPU”)
55. TensorFlow Distributed Training with PySpark
def train():
# Separate shard of dataset per worker
# create Estimator w/ DistribStrategy
# as CollectiveAllReduce
# train model, evaluate
return loss
# Driver code below here
# builds TF_CONFIG and shares to workers
from hops import experiment
experiment.collective_allreduce(train)
More details: http//github.com/logicalclocks/hops-examples
56. Undirected Hyperparam Search with PySpark
def train(dropout):
# Same dataset for all workers
# create model and optimizer
# add this worker’s value of dropout
# train model and evaluate
return loss
# Driver code below here
from hops import experiment
args={“dropout”:[0.1, 0.4, 0.8]}
experiment.grid_search(train,args)
More details: http//github.com/logicalclocks/hops-examples
57. Directed Hyperparam Search with PySpark
def train(dropout):
# Same dataset for all workers
# create model and optimizer
optimizer.apply(dropout)
# train model and evaluate
return loss
from hops import experiment
args={“dropout”: “0.1-0.8”}
experiment.diff_ev(train,args)
More details: http//github.com/logicalclocks/hops-examples
60. PClass Sexname survive
Iterative Model Development
Dataset
Machine
Learning
Model
Optimizer
Evaluate
Problem Definition
Data Preparation
Model Selection
Hyperparameters
Repeat if
needed
Model Training
61. Maggy: Unified Hparam Opt & Ablation Programming
Machine
Learning
System
Hyperparameter
Optimizer
New Hyperparameter Values
Evaluate
New Dataset/Model-Architecture
Ablation Study
Controller
Synchronous or
Asynchronous Trials
Directed or
Undirected Search
User-Defined
Search/Optimizers
62. Maggy – Async, Parallel ML experiments using PySpark
● Experiment-Driven, interactive
development of ML applications
● Parallel Experimentation
○ Hyperparameter Optimization
○ Ablation Studies
○ Leave-one-Feature-out
● Interactive Debugging
○ Experiment/executor logs shown both
in Jupyter notebooks and logging
63. Feature Ablation
● Uses the Feature Store to access the dataset metadata
● Generates Python callables that once called will create a dataset
○ Removes one-feature-at-a-time
Sexname survivePClass Sexname survive
64. Layer Ablation
● Uses a base model function
● Generates Python callables that once called with create a modified model
○ Uses the model configuration to find and remove layer(s)
○ Removes one-layer-at-a-time (or one-layer-group-at-a-time)
70. Hopsworks on GCP
● Hopsworks available as an Image for GCP
○ Blog post coming soon with details
● Next Steps
○ Cloud Native Features
■ Elastic add/remove resources (compute/GPU)
■ API integration with Google Model serving
71. That was Hopsworks
Efficiency & Performance Security & GovernanceDevelopment & Operations
Secure Multi-Tenancy
Project-based restricted access
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Development Environment
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow
Feature Store
Data warehouse for ML
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe speed with Big Data
Horizontally Scalable
Ingestion, DataPrep,
Training, Serving
FS
73. Acknowledgements and References
Slides and Diagrams from colleagues:
● Maggy: Moritz Meister and Sina Sheikholeslami
● Feature Store: Kim Hammar
● Beam/Flink on Hopsworks: Theofilos Kakantousis
References
● HopsFS: Scaling hierarchical file system metadata …, USENIX FAST 2017.
● Size matters: Improving the performance of small files …, ACM Middleware 2018.
● ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.
● Hopsworks Demo, SysML 2019.
74. Thank you!
470 Ramona St
Palo Alto
https://www.logicalclocks.com
Register for a free account at
www.hops.site
Twitter
@logicalclocks
@hopsworks
GitHub
https://github.com/logicalclocks/hopsworks
https://github.com/hopshadoop/hops