StreamSQL Feature Store (Apache Pulsar Summit)Simba Khadder
Input features are the building blocks for machine learning models. You cannot have a great model without great features. By building on top of Apache Pulsar's infinite retention of events, we built infrastructure to serve features in production and to generate training datasets. It allowed our machine learning teams to change, test, and deploy personalization features at an extraordinary rate to 10s of millions of end-users.
This talk will discuss:
- What event-sourcing is and why it's so powerful for machine learning infrastructure.
- How we built the StreamSQL feature store on top of Pulsar, Flink, and Cassandra.
- How a feature store accelerates ML development.
Managed Feature Store for Machine LearningLogical Clocks
All hyperscale AI companies build their machine learning platforms around a Feature Store.
A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. A Feature Store is a central place to store curated features within an organization.
Feature Stores are a fuel for AI systems as we use them to train machine learning models so that we can make predictions for feature values that we have never seen before.
During this presentation you learn:
- About the concept of a Feature Store and how it can help manage feature data for Enterprises and ease the path of data from backend systems and data-lakes to Data Scientists.
- Our take on Feature Stores, including best practices and use cases and:
- How to ensure Consistent Features in both Training and Serving
Governance, Access-Control, and Versioning
- To create Training Data in the File Format of your Choice
Eliminate Inconsistency between Features in Training and Inferencing
Watch the webinar with a demo: https://www.logicalclocks.com/webinars
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
This talk describes the scale-out, consistent metadata architecture of Hopsworks and how we use it to support custom metadata and provenance for ML Pipelines with Hopsworks Feature Store, NDB, and ePipe . The talk is here: https://www.youtube.com/watch?v=oPp8PJ9QBnU&feature=emb_logo
StreamSQL Feature Store (Apache Pulsar Summit)Simba Khadder
Input features are the building blocks for machine learning models. You cannot have a great model without great features. By building on top of Apache Pulsar's infinite retention of events, we built infrastructure to serve features in production and to generate training datasets. It allowed our machine learning teams to change, test, and deploy personalization features at an extraordinary rate to 10s of millions of end-users.
This talk will discuss:
- What event-sourcing is and why it's so powerful for machine learning infrastructure.
- How we built the StreamSQL feature store on top of Pulsar, Flink, and Cassandra.
- How a feature store accelerates ML development.
Managed Feature Store for Machine LearningLogical Clocks
All hyperscale AI companies build their machine learning platforms around a Feature Store.
A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. A Feature Store is a central place to store curated features within an organization.
Feature Stores are a fuel for AI systems as we use them to train machine learning models so that we can make predictions for feature values that we have never seen before.
During this presentation you learn:
- About the concept of a Feature Store and how it can help manage feature data for Enterprises and ease the path of data from backend systems and data-lakes to Data Scientists.
- Our take on Feature Stores, including best practices and use cases and:
- How to ensure Consistent Features in both Training and Serving
Governance, Access-Control, and Versioning
- To create Training Data in the File Format of your Choice
Eliminate Inconsistency between Features in Training and Inferencing
Watch the webinar with a demo: https://www.logicalclocks.com/webinars
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
This talk describes the scale-out, consistent metadata architecture of Hopsworks and how we use it to support custom metadata and provenance for ML Pipelines with Hopsworks Feature Store, NDB, and ePipe . The talk is here: https://www.youtube.com/watch?v=oPp8PJ9QBnU&feature=emb_logo
Hopsworks - The Platform for Data-Intensive AIQAware GmbH
Cloud Native Night July 2019, Munich: Talk by Steffen Srohsschmiedt (@grohsschmiedt, Head of Cloud at LogicalClocks)
=== Please download slides if blurred! ===
Abstract: Machine Learning (ML) pipelines are the fundamental building block for productionizing ML code. Building such pipelines with Big Data is a complex process. The different stages in ML pipelines also need to be orchestrated, from data ingestion and data transformation, to feature engineering, to model training, serving and monitoring.
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning (ML) pipelines. A key part of our pipelines is the world's first open-source Feature Store, that acts as a data warehouse for features, providing a natural API between data engineers - who write feature engineering code - and Data Scientists, who select features from the feature store to generate training/test data for models.
Join us next time: https://www.meetup.com/Cloud-Native-muc/events
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment.
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.
Bio:
Fabio Buso is the head of engineering at Logical Clocks AB, where he leads the Feature Store development. Fabio holds a master's degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin.
Topics: feature store, MLOps.
Modern machine learning systems may be very complex and may fall into many pitfalls. It's very easy to unintendedly introduce technical debt into such a complex structure. One of the approaches solving some of anti-patterns is a feature store. Feature store is a missing piece filling a gap between raw data and machine learning models. Not only it will help you to handle technical debt, but even more importantly speeds up time to develop new model.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
When OLAP Meets Real-Time, What Happens in eBay?DataWorks Summit
OLAP Cube is about pre-aggregations, it reduces the query latency by spending more time and resources on data preparation. But for real-time analytics, data preparation and visibility latency are critical. What happens when OLAP cube meets real-time use cases?
Can we pre-build the cubes in real-time with a quick and more cost effective way? This is hard but still doable.
In eBay,we built our own real-time OLAP solution based on Apache Kylin & Apache Kafka. We read unbounded events from Kafka cluster then divide the streaming data into 3 stages, In-Memory Stage (Continuously In-Memory Aggregations) , On Disk Stage (Flush to disk, columnar based storage and indexes) and Full Cubing Stage (with MR or Spark, save to HBase). Data are aggregated to different layers in different stage, but all query able. Data will be transformed from 1 stage to another stage automatically and transparent to user.
This solution is built to support quite a few realtime analytics use cases in eBay, we will share some use cases like site speed monitoring and eBay site deal performance in this session as well.
Speaker:
Qiaoneng Qian, Senior Product Manager, eBay
Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.
http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleDatabricks
This talk will walk you through the typical workflow of a data scientist or a data analyst at Uber, how they get access to Uber's Big data and fast data sources for ad hoc and experimental analysis, how the data platforms will make it easy to discover datasets, run interactive queries against our petabyte scale data lake to identify the features you're interested in, wrangle and prepare data for advanced analytics and machine learning. Our platforms also provide capabilities to do iterative machine learning and deep learning training seamless on single nodes and distributed on our Big data and GPU clusters, analyze, visualize and share the results of their experiments with colleagues and peers to get feedback, and even productionize data analytics jobs and ML models all without a degree in CS. Interested? Come, learn how Uber's Big data platforms and Data science workbench put the power of Spark in the hands of our Data scientists and data analysts for advanced analytics and ML/DL use cases.
Streaming Inference with Apache Beam and TFXDatabricks
In this session we will be using an LSTM Encoder-Decoder Anomaly Detection model as an example, to show the building and retraining of a model which uses the tfx-bsl package to run continuous inference. We will also emphasize the importance of the hermetic seal between training and inference paths.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
Script Transformation is an important and growing use-case for Apache Spark at Facebook. Spark’s script transforms allow users to run custom scripts and binaries directly from SQL and serves as an important means of stitching Facebook’s custom business logic with existing data pipelines.
Along with Spark SQL + UDFs, a growing number of our custom pipelines leverage Spark’s script transform operator to run user-provided binaries for applications such as indexing, parallel training and inference at scale. Spawning custom processes from the Spark executors introduces new challenges in production ranging from external resources allocation/management, structured data serialization, and external process monitoring.
In this session, we will talk about the improvements to Spark SQL (and the resource manager) to support running reliable and performant script transformation pipelines. This includes:
1) cgroup v2 containers for CPU, Memory and IO enforcement,
2) Transform jail for processes namespace management,
3) Support for complex types in Row format delimited SerDe,
4) Protocol Buffers for fast and efficient structured data serialization. Finally, we will conclude by sharing our results, lessons learned and future directions (e.g., transform pipelines resource over-subscription).
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.
At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.
Hamburg Data Science Meetup - MLOps with a Feature StoreMoritz Meister
MLOps is a trend in machine learning (ML) engineering that unifies ML system development (Dev) and ML system operation (Ops). Some ML lifecycle frameworks, such as TensorFlow Extended, are based around end-to-end pipelines that start with raw data and end in production models. During this talk we will introduce the concept of a feature store as the missing piece of ML infrastructure that enables faster lower cost deployment of models. We will show how the Hopsworks Feature Store - factors monolithic end-to-end ML pipelines into feature and model training pipelines that can each run at different cadences. We will show examples of ingestion and training pipelines including hyperparameter optimization and model deployment.
Building a Feature Store around Dataframes and Apache SparkDatabricks
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform.
Hopsworks - The Platform for Data-Intensive AIQAware GmbH
Cloud Native Night July 2019, Munich: Talk by Steffen Srohsschmiedt (@grohsschmiedt, Head of Cloud at LogicalClocks)
=== Please download slides if blurred! ===
Abstract: Machine Learning (ML) pipelines are the fundamental building block for productionizing ML code. Building such pipelines with Big Data is a complex process. The different stages in ML pipelines also need to be orchestrated, from data ingestion and data transformation, to feature engineering, to model training, serving and monitoring.
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning (ML) pipelines. A key part of our pipelines is the world's first open-source Feature Store, that acts as a data warehouse for features, providing a natural API between data engineers - who write feature engineering code - and Data Scientists, who select features from the feature store to generate training/test data for models.
Join us next time: https://www.meetup.com/Cloud-Native-muc/events
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment.
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.
Bio:
Fabio Buso is the head of engineering at Logical Clocks AB, where he leads the Feature Store development. Fabio holds a master's degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin.
Topics: feature store, MLOps.
Modern machine learning systems may be very complex and may fall into many pitfalls. It's very easy to unintendedly introduce technical debt into such a complex structure. One of the approaches solving some of anti-patterns is a feature store. Feature store is a missing piece filling a gap between raw data and machine learning models. Not only it will help you to handle technical debt, but even more importantly speeds up time to develop new model.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
When OLAP Meets Real-Time, What Happens in eBay?DataWorks Summit
OLAP Cube is about pre-aggregations, it reduces the query latency by spending more time and resources on data preparation. But for real-time analytics, data preparation and visibility latency are critical. What happens when OLAP cube meets real-time use cases?
Can we pre-build the cubes in real-time with a quick and more cost effective way? This is hard but still doable.
In eBay,we built our own real-time OLAP solution based on Apache Kylin & Apache Kafka. We read unbounded events from Kafka cluster then divide the streaming data into 3 stages, In-Memory Stage (Continuously In-Memory Aggregations) , On Disk Stage (Flush to disk, columnar based storage and indexes) and Full Cubing Stage (with MR or Spark, save to HBase). Data are aggregated to different layers in different stage, but all query able. Data will be transformed from 1 stage to another stage automatically and transparent to user.
This solution is built to support quite a few realtime analytics use cases in eBay, we will share some use cases like site speed monitoring and eBay site deal performance in this session as well.
Speaker:
Qiaoneng Qian, Senior Product Manager, eBay
Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.
http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleDatabricks
This talk will walk you through the typical workflow of a data scientist or a data analyst at Uber, how they get access to Uber's Big data and fast data sources for ad hoc and experimental analysis, how the data platforms will make it easy to discover datasets, run interactive queries against our petabyte scale data lake to identify the features you're interested in, wrangle and prepare data for advanced analytics and machine learning. Our platforms also provide capabilities to do iterative machine learning and deep learning training seamless on single nodes and distributed on our Big data and GPU clusters, analyze, visualize and share the results of their experiments with colleagues and peers to get feedback, and even productionize data analytics jobs and ML models all without a degree in CS. Interested? Come, learn how Uber's Big data platforms and Data science workbench put the power of Spark in the hands of our Data scientists and data analysts for advanced analytics and ML/DL use cases.
Streaming Inference with Apache Beam and TFXDatabricks
In this session we will be using an LSTM Encoder-Decoder Anomaly Detection model as an example, to show the building and retraining of a model which uses the tfx-bsl package to run continuous inference. We will also emphasize the importance of the hermetic seal between training and inference paths.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
Script Transformation is an important and growing use-case for Apache Spark at Facebook. Spark’s script transforms allow users to run custom scripts and binaries directly from SQL and serves as an important means of stitching Facebook’s custom business logic with existing data pipelines.
Along with Spark SQL + UDFs, a growing number of our custom pipelines leverage Spark’s script transform operator to run user-provided binaries for applications such as indexing, parallel training and inference at scale. Spawning custom processes from the Spark executors introduces new challenges in production ranging from external resources allocation/management, structured data serialization, and external process monitoring.
In this session, we will talk about the improvements to Spark SQL (and the resource manager) to support running reliable and performant script transformation pipelines. This includes:
1) cgroup v2 containers for CPU, Memory and IO enforcement,
2) Transform jail for processes namespace management,
3) Support for complex types in Row format delimited SerDe,
4) Protocol Buffers for fast and efficient structured data serialization. Finally, we will conclude by sharing our results, lessons learned and future directions (e.g., transform pipelines resource over-subscription).
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.
At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.
Hamburg Data Science Meetup - MLOps with a Feature StoreMoritz Meister
MLOps is a trend in machine learning (ML) engineering that unifies ML system development (Dev) and ML system operation (Ops). Some ML lifecycle frameworks, such as TensorFlow Extended, are based around end-to-end pipelines that start with raw data and end in production models. During this talk we will introduce the concept of a feature store as the missing piece of ML infrastructure that enables faster lower cost deployment of models. We will show how the Hopsworks Feature Store - factors monolithic end-to-end ML pipelines into feature and model training pipelines that can each run at different cadences. We will show examples of ingestion and training pipelines including hyperparameter optimization and model deployment.
Building a Feature Store around Dataframes and Apache SparkDatabricks
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform.
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
Hopsworks is a platform for designing and operating End to End Machine Learning using PySpark and TensorFlow/PyTorch. Early access is now available on GCP. Hopsworks includes the industry's first Feature Store. Hopsworks is open-source.
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...Piyush Kumar
- Why organization(s) need FeatureStore to remove complex pipeline jungles and to simplify Machine Learning workflows
- How we developed MetaConfig driven FeatureStore @MakeMyTrip, Architecture
- Prediction Serving infrastructure for online & batch
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA
MakeMyTrip - India's #1 online travel platform having more than 70% of the traffic from mobile apps embarked on a journey to revolutionize its customer experience by building a scalable, personalized, machine learning based platform which powers onboarding, in-funnel and post-funnel engagement flows, such as ranking, dynamic pricing, persuasions, cross-sell and propensity models.For a company like MakeMyTrip, the next wave of consumer growth is driven and powered by data products for personalization, context-aware mobile experiences. Having a better data architecture to ingest user activity streams (events), processing and data APIs enable a foundation for real-time feature generation for machine learning models.Topics include:* Why common feature-store, removing dataset fragmentation caused by usecase-by-usecase approach!* Productionizing ML via standardization : MetaConfigs & FeatureCatalog | Reducing Data-Tech Debt* Developing Real-Time Serving store over Spark Streaming, Kafka, RocksDB, Akka HTTP Data APIs* Lifecycle of feature generation | Online(Near Real-Time) & Historical(Batch) Compute* Consistent Feature Engineering & Model Deployment for DSA: DataScience AutomationAs Technology we leverage Kafka, Spark (Streaming, SQL), Scala, Python, AWS (S3, EMR, Glue and other services), DRUID, Hive, Presto, Cassandra, RocksDB, Redis, Akka HTTP
Apache Apex brings you the power to quickly build and run big data batch and stream processing applications. But what about visualizing your data in real time as it flows through the Apache Apex applications? Together, we will review Apache Apex, and how it integrates with Apache Hadoop and Apache Kafka to process your big data with streaming computation. Then we will explore the options available to visualize Apex applications metrics and data, including open-source options like REST and PubSub mechanisms in StrAM, as well as features available in the RTS Console like real-time Dashboards and Widgets. We will also look into ways of packaging dashboards inside your Apache Apex applications.
Keynote of HadoopCon 2014 Taiwan:
* Data analytics platform architecture & designs
* Lambda architecture overview
* Using SQL as DSL for stream processing
* Lambda architecture using SQL
Enterprise guide to building a Data MeshSion Smith
Making Data Mesh simple, Open Source and available to all; without vendor lock-in, without complex tooling and to use an approach centered around ‘specifications’, existing tools and baking in a ‘domain’ model.
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
Amazon EMR is a managed Hadoop service that makes it easy for customers to use big data frameworks and applications like Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3 , Amazon’s highly scalable object storage service. In this webinar, we will introduce the latest release of Amazon EMR. With Amazon EMR release 5.0, customers can now launch the latest versions of popular open source frameworks including Apache Spark 2.0, Hive 2.1, Presto 0.151, Tez 0.8.4, and Apache Hadoop 2.7.2. We will walk through a demo to show you how to deploy a Hadoop environment within minutes. We will cover common use cases and best practices to lower costs using Amazon S3 as your data store and Amazon EC2 Spot Instances, which allow you to bid on space Amazon computing capacity.
Learning Objectives:
• Describe the new features and updated frameworks in Amazon EMR 5.0
• Learn best practices and real-world applications for Amazon EMR
• Understand how to use EC2 Spot pricing to save costs
• Explain the advantages of decoupling storage and compute with Amazon S3 as storage layer for EMR workloads
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.
Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.
We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
This talk is a mental map for building ML systems as ML Pipelines that are factored into Feature Pipelines, Training Pipelines, and Inference Pipelines.
Building Hopsworks, a cloud-native managed feature store for machine learning Jim Dowling
Cloud Native London talk about the control layer of Hopsworks.ai and our choice of cloud native services. We built our own multi-tenant services as cloud native services, for the most part.
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
Spark AI Summit Europe 2019 talk: Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy. How can you do directed search efficiently with Spark? The answer is Maggy - asynchronous directed search on PySpark.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsJim Dowling
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs, including AllReduce, Horovod, and how commodity GPU servers, such as DeepLearning11, will gain adoption.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Key Trends Shaping the Future of Infrastructure.pdf
Hopsworks Feature Store 2.0 a new paradigm
1. Hopsworks
Feature Store 2.0,
a new paradigm
Jim Dowling
Logical Clocks
2020-12-14
1st Global Feature Stores
for ML Meetup
2. Growing Consensus on how to manage complexity of AI
Feature Store Online
Distributed
Training
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Store Offline
Feature
Engineering
Connectors
to External
Data Sources
Data Model Prediction
φ(x)
2
3. Growing Consensus on how to manage complexity of AI
Data validation
Distributed
ENGINEER
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
ML PLATFORM
TRAIN and SERVE
FEATURE
STORE
4. End-to-End ML Pipelines and the Feature Store
Data Lake,
Warehouse,
Kafka
Feature
Store
Model
registry
Feature
Engineering
Model
Serving
Model
Training
Model
Deploy
Features
Validate
Retrieve Feature Values
5. End-to-End ML Pipelines and the Feature Store with CI/CD
Code and
configuration
Data Lake,
Warehouse,
Kafka
Feature
Store
Model
registry
Feature
Engineering
Model
Serving
Model
Training
Model
Deploy
Model
Monitoring
Experiments/
Development
Features
Validate
Retrieve Feature Values
Log Predictions, Retrieve Feature Statistics for Data Drift Detection
6. End-to-End ML Pipelines and the Feature Store with CI/CD and Provenance
Code and
configuration
Data Lake,
Warehouse,
Kafka
Feature
Store
Model
registry
Feature
Engineering
Model
Serving
Model
Training
Model
Deploy
Model
Monitoring
Experiments/
Development
Scaleout
Metadata
Features
Validate
Retrieve Feature Values
Log Predictions, Retrieve Feature Statistics for Data Drift Detection
Elasticsearch
Sync
7. Hopsworks Feature Store Concepts: Features, Feature Groups, and Training Datasets
Features name Pclass Sex Survive Name Balance
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
8. Hopsworks Feature Store Concepts: Features, Feature Groups, and Training Datasets
Features name Pclass Sex Survive Name Balance
Training
Datasets
Survivename PClass Sex Balance
Join
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
9. Hopsworks Feature Store Concepts: Features, Feature Groups, and Training Datasets
Features name Pclass Sex Survive Name Balance
Training
Datasets
Survivename PClass Sex Balance
Join
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
File format
.tfrecord
.npy
.csv
.hdf5,
.petastorm,
etc
Storage
Azure
S3
HopsFS
10. Features are created/updated at different cadences
Click features every 10 secs
CDC data every 30 secs
User profile updates every hour
Featurized weblogs data every day
Online
Feature
Store
Offline
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
User-Entered Features (<2 secs) Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
<10ms
TBs/PBs
11. FeatureGroup Ingestion in Hopsworks
Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
RTFeatureGroup
Online
App
Train,
Batch App
User Clicks
DB Updates
User Profile Updates
Weblogs
Hof: Real-time feature
Engineering
Kafka Output
12. Hopsworks Feature Store V1 API
First Feature Store with a General Purpose DataFrame API
Feature Store is a cache for materialized features, not a library.
Online and Offline Feature Stores to support low latency and scale, respectively
Reuse of Features means JOINS – Spark as a join engine
13. Hopsworks Feature Store V2 API
Enforce feature-group scope and schema+data versioning as best practice
Better support for multiple feature stores - join features from development and
production feature stores
Better support for complex joins of features
First class API support for time-travel
Support any Python or Spark client with a single library
14. Example Ingestion of data into a FeatureGroup
https://docs.hopsworks.ai/
dataframe = spark.read.json("s3://dataset/rain.json")
# do feature engineering on your dataframe
df.withColumn('precipitation', (df.val-min)/(max-min))
fg = fs.create_feature_group("rain",
version=1,
description="Rain features",
primary_key=['date', 'location_id'],
online_enabled=True)
fg.save(dataframe)
fg.add_tag(name=“ingestion, value=“Databricks:jim; Pii;notebook.ipynb”)
15. # Join features across FeatureGroups. Use “on=[..]” to explicitly enter the JOIN
key.
feature_join = rain_fg.select_all()
.join(temperature_fg.select_all(), on=["date", "location_id"])
.join(location_fg.select_all()))
sc = fs.get_storage_connector("myBucket", "S3")
td = fs.create_training_dataset("training_dataset", version=1,
storage_connector=sc,
data_format="tfrecords",
description="Training dataset, TfRecords format",
splits={'train': 0.7, 'test': 0.2, 'validate':
0.1})
td.save(feature_join)
# When training a model, read the training data (use “test” to read test data):
ds = td.read(split="train")
Example Creation of Train/Test Data from a Feature Store
https://docs.hopsworks.ai/