Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...Michael Rush
Presented July 1, 2011, as part of the session "Standards, Information, and Data Exchange" at the 7th International Seminar of Iberian Tradition Archives, Rio de Janeiro, Brazil.
Real-time Machine learning with Redis-ML
Shay Nativ from Redis Labs presented on using Redis and Redis-ML for real-time machine learning model serving. Redis-ML allows training models with tools like Spark and then deploying them to Redis for low-latency serving. This simplifies the ML lifecycle and improves performance and scalability compared to custom model serving. Shay demonstrated building a movie recommendation system using Spark for training random forests on the MovieLens dataset and deploying the models to Redis-ML for real-time recommendations with 60x faster performance than Spark alone.
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...Databricks
Redis-ML is a Redis module for high performance, real-time serving of Spark-ML models. It allows users to train large complex models in Spark, and then store and query the models directly on Redis clusters. The high throughput and low latency of Redis-ML allows users to perform heavy classification operations in real time while using a minimal number of servers. This unique architecture enables significant savings in resources compared to current commonly used methods, without loss in precision or server performance.
This session will demonstrate how to build a production-level recommendation system from the ground up using Spark-ML and Redis-ML. It will also describe performance and accuracy benchmarks, comparing the results with current standard methods.
Deploying Real-Time Decision Services Using Redis with Tague GriffithDatabricks
Most of the energy and attention in machine learning focused on the model training side of the problem. Multiple frameworks, in every language, provide developers with access to a host of data manipulation and training algorithms, but until recently developers had virtually no frameworks to build out predictive engines from trained ML models. Most developers resorted to building custom applications, but building highly available, highly performant applications is difficult.
Redis in conjunction with the Redis-ML module provides a server framework for developers to build predictive engines with familiar, off-the-shelf components. Developers can take advantage of all the features of Redis to deliver faster and more reliable prediction engines with less custom development.
This talk is a technical session which examines how Redis can be used in conjunction with a Spark based training platform to deliver real-time predictive and decision making features as part of a larger system. To set the context for the session, we start with an introduction to the Redis data model and how features of Redis (namespace, replication) can be used to build fast predictive engines (at scale), that are more reliable, more feature rich and easier to manage than custom applications. From there, we look at the model serving capabilities of Redis-ML and how they can be integrated with a Spark-based ML pipeline to automate the entire model development process from training to deployment.
The session ends with a demonstration of a simple machine learning pipeline. Using Spark we train several example models, load them directly into Redis and demonstrate Redis as the predictive engine for making real-time recommendations. At the end of the session, developers should feel confident that they could use Redis as a server framework to build a predictive serving engine for a Spark-based ML pipeline.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...Lucas Jellema
Data has been and will be the key ingredient to enterprise IT. What is changing is the nature, scope and volume of data and the place of data in the IT architecture. BigData, unstructured data and non-relational data stored on Hadoop, in NoSQL databases and held in Elastic Search, Caches and Message Queues complements data in the enterprise RDBMS. Trends such as microservices that contain their own data, BASE, CQRS and Event Sourcing have changed the way we store, share and govern data. This session introduces patterns, technologies and hypes around storing, processing and retrieving data using products such as Oracle Database, Cassandra, MySQL, Neo4J, Kafka, Redis, Elastic Search and Hadoop/Spark -locally,in containers and on the cloud. Key take away: what an application architect and a developer should know about the various types of data in enterprise IT and how to store/manage/query/manipulate them. What products and technologies are at your disposal. How can you make these work together – for a consistent (enough) overall data presentation.
This document provides an agenda for a Microsoft Azure Virtual Training Day on data fundamentals. The training will cover core data concepts, relational and non-relational data services in Azure, and data analytics. It will include modules on relational data with SQL, non-relational data with Azure Storage and Cosmos DB, large-scale data warehousing, streaming analytics, and data visualization with Power BI. Demos will illustrate how to provision Azure database and storage services and visualize data. The goal is to describe fundamental data services and concepts for working with structured, semi-structured and unstructured data on Azure.
Formats for Exchanging Archival Data: An Introduction to EAD, EAC-CPF, and Ar...Michael Rush
Presented July 1, 2011, as part of the session "Standards, Information, and Data Exchange" at the 7th International Seminar of Iberian Tradition Archives, Rio de Janeiro, Brazil.
Real-time Machine learning with Redis-ML
Shay Nativ from Redis Labs presented on using Redis and Redis-ML for real-time machine learning model serving. Redis-ML allows training models with tools like Spark and then deploying them to Redis for low-latency serving. This simplifies the ML lifecycle and improves performance and scalability compared to custom model serving. Shay demonstrated building a movie recommendation system using Spark for training random forests on the MovieLens dataset and deploying the models to Redis-ML for real-time recommendations with 60x faster performance than Spark alone.
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...Databricks
Redis-ML is a Redis module for high performance, real-time serving of Spark-ML models. It allows users to train large complex models in Spark, and then store and query the models directly on Redis clusters. The high throughput and low latency of Redis-ML allows users to perform heavy classification operations in real time while using a minimal number of servers. This unique architecture enables significant savings in resources compared to current commonly used methods, without loss in precision or server performance.
This session will demonstrate how to build a production-level recommendation system from the ground up using Spark-ML and Redis-ML. It will also describe performance and accuracy benchmarks, comparing the results with current standard methods.
Deploying Real-Time Decision Services Using Redis with Tague GriffithDatabricks
Most of the energy and attention in machine learning focused on the model training side of the problem. Multiple frameworks, in every language, provide developers with access to a host of data manipulation and training algorithms, but until recently developers had virtually no frameworks to build out predictive engines from trained ML models. Most developers resorted to building custom applications, but building highly available, highly performant applications is difficult.
Redis in conjunction with the Redis-ML module provides a server framework for developers to build predictive engines with familiar, off-the-shelf components. Developers can take advantage of all the features of Redis to deliver faster and more reliable prediction engines with less custom development.
This talk is a technical session which examines how Redis can be used in conjunction with a Spark based training platform to deliver real-time predictive and decision making features as part of a larger system. To set the context for the session, we start with an introduction to the Redis data model and how features of Redis (namespace, replication) can be used to build fast predictive engines (at scale), that are more reliable, more feature rich and easier to manage than custom applications. From there, we look at the model serving capabilities of Redis-ML and how they can be integrated with a Spark-based ML pipeline to automate the entire model development process from training to deployment.
The session ends with a demonstration of a simple machine learning pipeline. Using Spark we train several example models, load them directly into Redis and demonstrate Redis as the predictive engine for making real-time recommendations. At the end of the session, developers should feel confident that they could use Redis as a server framework to build a predictive serving engine for a Spark-based ML pipeline.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...Lucas Jellema
Data has been and will be the key ingredient to enterprise IT. What is changing is the nature, scope and volume of data and the place of data in the IT architecture. BigData, unstructured data and non-relational data stored on Hadoop, in NoSQL databases and held in Elastic Search, Caches and Message Queues complements data in the enterprise RDBMS. Trends such as microservices that contain their own data, BASE, CQRS and Event Sourcing have changed the way we store, share and govern data. This session introduces patterns, technologies and hypes around storing, processing and retrieving data using products such as Oracle Database, Cassandra, MySQL, Neo4J, Kafka, Redis, Elastic Search and Hadoop/Spark -locally,in containers and on the cloud. Key take away: what an application architect and a developer should know about the various types of data in enterprise IT and how to store/manage/query/manipulate them. What products and technologies are at your disposal. How can you make these work together – for a consistent (enough) overall data presentation.
This document provides an agenda for a Microsoft Azure Virtual Training Day on data fundamentals. The training will cover core data concepts, relational and non-relational data services in Azure, and data analytics. It will include modules on relational data with SQL, non-relational data with Azure Storage and Cosmos DB, large-scale data warehousing, streaming analytics, and data visualization with Power BI. Demos will illustrate how to provision Azure database and storage services and visualize data. The goal is to describe fundamental data services and concepts for working with structured, semi-structured and unstructured data on Azure.
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...Imply
Target is one of the largest retailers in the United States, with brick-and-mortar stores in all 50 states and one of the most-visited ecommerce sites in the country. In addition to typical merchandising functions like assortment planning, pricing and inventory management, Target also operates a large supply chain, financial/banking operations and property management organizations. As a data-driven organization, we need a data analytics platform that can address the unique needs of each of these various business units, while scaling to hundreds of thousands of users and accommodating an ever-increasing amount of data.
In this talk we’ll cover why Target chose to create our own analytics platform and specifically how Druid makes this platform successful. We’ll cover how we utilize key features in Druid, such as union datasources, arbitrary granularities, real-time ingestion, complex aggregation expressions and lightning-fast query response to provide analytics to users at all levels of the organization. We’ll also cover how Druid’s speed and flexibility allow us to provide interactive analytics to front-line, edge-of-business consumers to address hundreds of unique use-cases across several business units.
Memory Analysis of the Dalvik (Android) Virtual MachineAndrew Case
The document summarizes research on analyzing the memory of the Dalvik virtual machine used in Android. It describes acquiring memory from Android devices, locating key data structures in memory like loaded classes and their fields, and analyzing specific Android applications to recover data like call histories, text messages, and location information. The goal is to develop forensics capabilities for investigating Android devices through memory analysis.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, and Jiao Wang, a software engineer on the Big Data Technology team at Intel, demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premise or in the Cloud.
The document provides an overview of the Informatica PowerCenter 7.1 product, describing its major components for ETL development, how to build basic mappings and workflows, and available options for loading target data. It also outlines the course objectives to understand PowerCenter architecture and components, build mappings and workflows, and troubleshoot common problems. Resources available from Informatica like documentation, support, and certification programs are also summarized.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Databricks
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, will demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premises or in the Cloud.
Attendees will also learn how to write a deep learning application that leverages Spark to train image recognition models at scale.
Obtén visibilidad completa y encuentra problemas de seguridad ocultosElasticsearch
Aun las amenazas básicas pueden ser múltiples y complejas, y la visibilidad limitada de tus datos de seguridad simplemente no es suficiente. Ya sea que realices investigaciones o busques amenazas, necesitas todo el contexto relevante para la seguridad. Aprende las prácticas clave en la recopilación y normalización de datos y ve cómo puedes usar Elastic Security para clasificar, verificar y abordar problemas de forma rápida y precisa.
Member privacy is of paramount importance to LinkedIn. The company must protect the sensitive data users provide. On the other hand, our members join LinkedIn to find each other, necessitating the sharing of certain data. This privacy paradox can only be addressed by giving users control over where and how their data is used. While this approach is extremely important, it also presents scaling challenges.
In this talk, we will discuss the challenges behind enforcing compliance at scale as well as LinkedIn's solution. Our comprehensive record-level offline compliance framework includes schema metadata tracking, alternate read-time views of the same dataset, physical purging of data on HDFS, and features for users to define custom filtering rules using SQL, assigning such customizations to specific datasets, groups of datasets, or use cases. We achieve this using many open-source projects like Hadoop, Hive, Gobblin, and Wherehows, as well as a homegrown data access layer called Dali. We also show how the same Hadoop-powered framework can be used for enforcing compliance on other stores like Pinot, Salesforce, and Espresso.
While there is no one-size fits all solution to guaranteeing user data privacy, this talk will provide a blueprint and concrete example of how to enforce compliance at scale, which we hope proves useful to organizations working to improve their privacy commitments. ISSAC BUENROSTRO, Staff Software Engineer, LinkedIn and ANTHONY HSU, Staff Software Engineer, LinkedIn
50 Shades of Data – how, when and why Big,Relational,NoSQL,Elastic,Graph,Even...Lucas Jellema
Data has been and will be the key ingredient to enterprise IT. What is changing is the nature, scope and volume of data and the place of data in the IT architecture. BigData, unstructured data and non-relational data stored on Hadoop, in NoSQL databases and held in Elastic Search, Caches and Message Queues complements data in the enterprise RDBMS. Trends such as microservices that contain their own data, BASE, CQRS and Event Sourcing have changed the way we store, share and govern data. This session introduces patterns, technologies and hypes around storing, processing and retrieving data using products such as Oracle Database, Cassandra, MySQL, Neo4J, Kafka, Redis, Elastic Search and Hadoop/Spark -locally,in containers and on the cloud. Key take away: what an application architect and a developer should know about the various types of data in enterprise IT and how to store/manage/query/manipulate them. What products and technologies are at your disposal. How can you make these work together – for a consistent (enough) overall data presentation.
Toward Easy Export of Imagery Products and Feature Classes as Training Data f...Dawn Wright
American Association of Geographers (AAG) 2018 Symposium on Artificial Intelligence and Deep Learning in Geospatial Research
Whether to train a Deep Learning (DL) model to find objects of interest such as cars or solar panels in satellite or aerial images, or to classify such images into different categories of land-use, or other such tasks, a common starting point is always labeled ground truth or training data. From an industry perspective, an organization such as ESRI has a large user base of roughly 350,000 agencies, universities, non-profits, and other partners, with most of them maintaining and permanently updating their own GIS data. But how to allow this treasure trove of data to be effectively and appropriately used for training new DL models? This talk will provide an overview of new tools to export GIS data from multiple sources into popular DL formats such as KITTI or PASCAL_VOC. These can then be directly used as input to DL frameworks such as Microsoft CNTK or Google TensorFlow in order to train DL models. For example, NAIP images and building footprints of an entire county can be exported as a sequence of equally sized image chips plus one meta data file per image chip containing the bounding boxes around all buildings in KITTI format. From this data a DL model can be trained that detects buildings. The hope is that this new suite of tools will make it easier for DL researchers and students at all levels (from undergraduate to doctoral and beyond) to access existing GIS data and to use them for training new DL models.
50 Shades of Data – how, when and why Big,Relational,NoSQL,Elastic,Graph,Even...Lucas Jellema
Data has been and will be the key ingredient to enterprise IT. What is changing is the nature, scope, and volume of data and its place in the IT architecture. Big data, unstructured data, and nonrelational data stored on Hadoop; NoSQL databases; and in Elasticsearch, caches, and message queues complements data in the enterprise RDBMS. Trends such as microservices that contain their own data, BASE, CQRS, and event sourcing have changed the way we store, share, and govern data. This session introduces patterns, technologies, and hypes for storing, processing, and retrieving data with products such as Oracle Database, Cassandra, MySQL, Neo4J, Kafka, Redis, Elasticsearch, Blockchain (Hyperledger) and Hadoop/Spark—locally, in containers, and in the cloud.
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoData Con LA
This document discusses Amazon DynamoDB, a fully managed NoSQL database service from AWS. It provides three key points:
1. DynamoDB offers fast and predictable performance with single-digit millisecond latency, automatic scaling of storage and throughput capacity, and built-in security, backup and disaster recovery capabilities.
2. DynamoDB uses a flexible data model with key-value and document data structures, and supports both document and relational data structures. It also includes rich query capabilities and SDKs/APIs for developers.
3. The document provides an example of modeling user data and files for a media catalog application using DynamoDB tables and secondary indexes to support various access patterns like searching by
During this session for Teams Day Online 2021 I explained the concepts of eDiscovery and showed how information from Microsoft Teams can be discovered using core and advanced eDiscovery.
How in memory technology will impact machine deep learning services (redis la...Avner Algom
This document discusses how in-memory technology can impact machine and deep learning services using Redis Labs as a case study. It describes how Redis can provide a simple, extensible, and high performance platform for serving machine learning models. Serving complex models at scale is challenging due to their size, lack of standardization, and high costs. Redis-ML module allows predictive models to be stored and evaluated directly in Redis, reducing infrastructure needs by 97% for an ad serving use case compared to a homegrown solution. Co-locating streams, data, and machine learning engines in an in-memory database like Redis can reduce data movement, messages, and latency compared to traditional machine learning pipelines.
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends. How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community.
As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.
This document provides an overview of deep learning and its applications. It discusses how deep learning can be used for image classification and how neural networks learn hierarchical representations from data. The document highlights some of the challenges of deep learning, such as the large amounts of data and computation required. It also covers how deep learning models can be deployed in production using services like Amazon Web Services to ensure low latency, high availability, and continuous learning.
LinkedIn is a large professional social network with 50 million users from around the world. It faces big data challenges at scale, such as caching a user's third degree network of up to 20 million connections and performing searches across 50 million user profiles. LinkedIn uses Hadoop and other scalable architectures like distributed search engines and custom graph engines to solve these problems. Hadoop provides a scalable framework to process massive amounts of user data across thousands of nodes through its MapReduce programming model and HDFS distributed file system.
Domain Identification for Linked Open DataSarasi Sarangi
Linked Open Data (LOD) has emerged as one of the largest collections of interlinked structured datasets on the Web. Although the adoption of such datasets for applications is
increasing, identifying relevant datasets for a specific task or topic is still challenging. As an initial step to make such identification easier, we provide an approach to automatically identify the topic domains of given datasets. Our method utilizes existing knowledge sources, more specifically Freebase, and we present an evaluation which validates the topic domains we can identify with our system. Furthermore, we evaluate the effectiveness of identified topic domains for the purpose of finding relevant datasets, thus showing that our approach improves reusability of LOD datasets.
The document provides an overview of MongoDB administration including its data model, replication for high availability, sharding for scalability, deployment architectures, operations, security features, and resources for operations teams. The key topics covered are the flexible document data model, replication using replica sets for high availability, scaling out through sharding of data across multiple servers, and different deployment architectures including single/multi data center configurations.
Data Modeling for Security, Privacy and Data ProtectionKaren Lopez
Karen Lopez (@datchick/InfoAdvisors) 90-minute presentation on Data Security, Data Privacy, Compliance and how data modelers should discover, assess, and monitor these important data management responsibilities.
QA or the Highway - Component Testing: Bridging the gap between frontend appl...zjhamm304
These are the slides for the presentation, "Component Testing: Bridging the gap between frontend applications" that was presented at QA or the Highway 2024 in Columbus, OH by Zachary Hamm.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...Imply
Target is one of the largest retailers in the United States, with brick-and-mortar stores in all 50 states and one of the most-visited ecommerce sites in the country. In addition to typical merchandising functions like assortment planning, pricing and inventory management, Target also operates a large supply chain, financial/banking operations and property management organizations. As a data-driven organization, we need a data analytics platform that can address the unique needs of each of these various business units, while scaling to hundreds of thousands of users and accommodating an ever-increasing amount of data.
In this talk we’ll cover why Target chose to create our own analytics platform and specifically how Druid makes this platform successful. We’ll cover how we utilize key features in Druid, such as union datasources, arbitrary granularities, real-time ingestion, complex aggregation expressions and lightning-fast query response to provide analytics to users at all levels of the organization. We’ll also cover how Druid’s speed and flexibility allow us to provide interactive analytics to front-line, edge-of-business consumers to address hundreds of unique use-cases across several business units.
Memory Analysis of the Dalvik (Android) Virtual MachineAndrew Case
The document summarizes research on analyzing the memory of the Dalvik virtual machine used in Android. It describes acquiring memory from Android devices, locating key data structures in memory like loaded classes and their fields, and analyzing specific Android applications to recover data like call histories, text messages, and location information. The goal is to develop forensics capabilities for investigating Android devices through memory analysis.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, and Jiao Wang, a software engineer on the Big Data Technology team at Intel, demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premise or in the Cloud.
The document provides an overview of the Informatica PowerCenter 7.1 product, describing its major components for ETL development, how to build basic mappings and workflows, and available options for loading target data. It also outlines the course objectives to understand PowerCenter architecture and components, build mappings and workflows, and troubleshoot common problems. Resources available from Informatica like documentation, support, and certification programs are also summarized.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Databricks
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, will demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premises or in the Cloud.
Attendees will also learn how to write a deep learning application that leverages Spark to train image recognition models at scale.
Obtén visibilidad completa y encuentra problemas de seguridad ocultosElasticsearch
Aun las amenazas básicas pueden ser múltiples y complejas, y la visibilidad limitada de tus datos de seguridad simplemente no es suficiente. Ya sea que realices investigaciones o busques amenazas, necesitas todo el contexto relevante para la seguridad. Aprende las prácticas clave en la recopilación y normalización de datos y ve cómo puedes usar Elastic Security para clasificar, verificar y abordar problemas de forma rápida y precisa.
Member privacy is of paramount importance to LinkedIn. The company must protect the sensitive data users provide. On the other hand, our members join LinkedIn to find each other, necessitating the sharing of certain data. This privacy paradox can only be addressed by giving users control over where and how their data is used. While this approach is extremely important, it also presents scaling challenges.
In this talk, we will discuss the challenges behind enforcing compliance at scale as well as LinkedIn's solution. Our comprehensive record-level offline compliance framework includes schema metadata tracking, alternate read-time views of the same dataset, physical purging of data on HDFS, and features for users to define custom filtering rules using SQL, assigning such customizations to specific datasets, groups of datasets, or use cases. We achieve this using many open-source projects like Hadoop, Hive, Gobblin, and Wherehows, as well as a homegrown data access layer called Dali. We also show how the same Hadoop-powered framework can be used for enforcing compliance on other stores like Pinot, Salesforce, and Espresso.
While there is no one-size fits all solution to guaranteeing user data privacy, this talk will provide a blueprint and concrete example of how to enforce compliance at scale, which we hope proves useful to organizations working to improve their privacy commitments. ISSAC BUENROSTRO, Staff Software Engineer, LinkedIn and ANTHONY HSU, Staff Software Engineer, LinkedIn
50 Shades of Data – how, when and why Big,Relational,NoSQL,Elastic,Graph,Even...Lucas Jellema
Data has been and will be the key ingredient to enterprise IT. What is changing is the nature, scope and volume of data and the place of data in the IT architecture. BigData, unstructured data and non-relational data stored on Hadoop, in NoSQL databases and held in Elastic Search, Caches and Message Queues complements data in the enterprise RDBMS. Trends such as microservices that contain their own data, BASE, CQRS and Event Sourcing have changed the way we store, share and govern data. This session introduces patterns, technologies and hypes around storing, processing and retrieving data using products such as Oracle Database, Cassandra, MySQL, Neo4J, Kafka, Redis, Elastic Search and Hadoop/Spark -locally,in containers and on the cloud. Key take away: what an application architect and a developer should know about the various types of data in enterprise IT and how to store/manage/query/manipulate them. What products and technologies are at your disposal. How can you make these work together – for a consistent (enough) overall data presentation.
Toward Easy Export of Imagery Products and Feature Classes as Training Data f...Dawn Wright
American Association of Geographers (AAG) 2018 Symposium on Artificial Intelligence and Deep Learning in Geospatial Research
Whether to train a Deep Learning (DL) model to find objects of interest such as cars or solar panels in satellite or aerial images, or to classify such images into different categories of land-use, or other such tasks, a common starting point is always labeled ground truth or training data. From an industry perspective, an organization such as ESRI has a large user base of roughly 350,000 agencies, universities, non-profits, and other partners, with most of them maintaining and permanently updating their own GIS data. But how to allow this treasure trove of data to be effectively and appropriately used for training new DL models? This talk will provide an overview of new tools to export GIS data from multiple sources into popular DL formats such as KITTI or PASCAL_VOC. These can then be directly used as input to DL frameworks such as Microsoft CNTK or Google TensorFlow in order to train DL models. For example, NAIP images and building footprints of an entire county can be exported as a sequence of equally sized image chips plus one meta data file per image chip containing the bounding boxes around all buildings in KITTI format. From this data a DL model can be trained that detects buildings. The hope is that this new suite of tools will make it easier for DL researchers and students at all levels (from undergraduate to doctoral and beyond) to access existing GIS data and to use them for training new DL models.
50 Shades of Data – how, when and why Big,Relational,NoSQL,Elastic,Graph,Even...Lucas Jellema
Data has been and will be the key ingredient to enterprise IT. What is changing is the nature, scope, and volume of data and its place in the IT architecture. Big data, unstructured data, and nonrelational data stored on Hadoop; NoSQL databases; and in Elasticsearch, caches, and message queues complements data in the enterprise RDBMS. Trends such as microservices that contain their own data, BASE, CQRS, and event sourcing have changed the way we store, share, and govern data. This session introduces patterns, technologies, and hypes for storing, processing, and retrieving data with products such as Oracle Database, Cassandra, MySQL, Neo4J, Kafka, Redis, Elasticsearch, Blockchain (Hyperledger) and Hadoop/Spark—locally, in containers, and in the cloud.
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoData Con LA
This document discusses Amazon DynamoDB, a fully managed NoSQL database service from AWS. It provides three key points:
1. DynamoDB offers fast and predictable performance with single-digit millisecond latency, automatic scaling of storage and throughput capacity, and built-in security, backup and disaster recovery capabilities.
2. DynamoDB uses a flexible data model with key-value and document data structures, and supports both document and relational data structures. It also includes rich query capabilities and SDKs/APIs for developers.
3. The document provides an example of modeling user data and files for a media catalog application using DynamoDB tables and secondary indexes to support various access patterns like searching by
During this session for Teams Day Online 2021 I explained the concepts of eDiscovery and showed how information from Microsoft Teams can be discovered using core and advanced eDiscovery.
How in memory technology will impact machine deep learning services (redis la...Avner Algom
This document discusses how in-memory technology can impact machine and deep learning services using Redis Labs as a case study. It describes how Redis can provide a simple, extensible, and high performance platform for serving machine learning models. Serving complex models at scale is challenging due to their size, lack of standardization, and high costs. Redis-ML module allows predictive models to be stored and evaluated directly in Redis, reducing infrastructure needs by 97% for an ad serving use case compared to a homegrown solution. Co-locating streams, data, and machine learning engines in an in-memory database like Redis can reduce data movement, messages, and latency compared to traditional machine learning pipelines.
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends. How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community.
As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.
This document provides an overview of deep learning and its applications. It discusses how deep learning can be used for image classification and how neural networks learn hierarchical representations from data. The document highlights some of the challenges of deep learning, such as the large amounts of data and computation required. It also covers how deep learning models can be deployed in production using services like Amazon Web Services to ensure low latency, high availability, and continuous learning.
LinkedIn is a large professional social network with 50 million users from around the world. It faces big data challenges at scale, such as caching a user's third degree network of up to 20 million connections and performing searches across 50 million user profiles. LinkedIn uses Hadoop and other scalable architectures like distributed search engines and custom graph engines to solve these problems. Hadoop provides a scalable framework to process massive amounts of user data across thousands of nodes through its MapReduce programming model and HDFS distributed file system.
Domain Identification for Linked Open DataSarasi Sarangi
Linked Open Data (LOD) has emerged as one of the largest collections of interlinked structured datasets on the Web. Although the adoption of such datasets for applications is
increasing, identifying relevant datasets for a specific task or topic is still challenging. As an initial step to make such identification easier, we provide an approach to automatically identify the topic domains of given datasets. Our method utilizes existing knowledge sources, more specifically Freebase, and we present an evaluation which validates the topic domains we can identify with our system. Furthermore, we evaluate the effectiveness of identified topic domains for the purpose of finding relevant datasets, thus showing that our approach improves reusability of LOD datasets.
The document provides an overview of MongoDB administration including its data model, replication for high availability, sharding for scalability, deployment architectures, operations, security features, and resources for operations teams. The key topics covered are the flexible document data model, replication using replica sets for high availability, scaling out through sharding of data across multiple servers, and different deployment architectures including single/multi data center configurations.
Data Modeling for Security, Privacy and Data ProtectionKaren Lopez
Karen Lopez (@datchick/InfoAdvisors) 90-minute presentation on Data Security, Data Privacy, Compliance and how data modelers should discover, assess, and monitor these important data management responsibilities.
Similar to Serving predictive models with Redis (20)
QA or the Highway - Component Testing: Bridging the gap between frontend appl...zjhamm304
These are the slides for the presentation, "Component Testing: Bridging the gap between frontend applications" that was presented at QA or the Highway 2024 in Columbus, OH by Zachary Hamm.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillLizaNolte
HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable.
In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed:
Key Takeaways:
Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement.
Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers.
Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
High performance Serverless Java on AWS- GoTo Amsterdam 2024Vadym Kazulkin
Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
4. 4
Who I am
• Head of Developer Advocacy for Redis Labs
• Developer and architect turned Evangelist
• Infrastructure and Distributed Systems
• Large Scale Redis Systems
• Former: Apple, Netscape, Yahoo/Flickr, GoPro
• Focus on the Open Source Community
• Education and Support
• Nurture and grow the entire community
5. 5
Redis Labs – Home of Redis
Founded in 2011
HQ in Mountain View CA, R&D center in Tel-Aviv IL
The commercial company behind Open Source Redis
Provider of the Redis Enterprise (Redise) technology,
platform and products
6. 6
Redise Cloud Private
Redis Labs Products
Redise Cloud Redise Pack ManagedRedise Pack
SERVICES SOFTWARE
Fully managed Redise service in
VPCs within AWS, MS Azure, GCP
& IBM Softlayer
Fully managed Redise service on
hosted servers within AWS, MS
Azure, GCP, IBM Softlayer, Heroku,
CF & OpenShift
Downloadable Redise software for
any enterprise datacenter or
cloud environment
Fully managed Redise Pack in
private data centers
&& &
16. 16
Typical Spark Application Structure
16
Spark Training
Data is loaded into Spark Model is saved in files
File System Custom Server
Model is loaded to your
custom app
Serving Client
Client App
19. 19
Redis Modules
• Any C/C++ program can now run on Redis
• Use existing or add new data-structures
• Enjoy simplicity, infinite scalability and high availability while
keeping the native speed of Redis
• Can be created by anyone
New Capabilities
New Commands
New Data Types
20. 20
Redis-ML: Predictive Model Serving Engine
• Predictive models as native Redis types
• Perform evaluation directly in Redis
• Store training output as “hot model”
Spark Training
Data loaded into Spark Model is saved in
Redis-ML
Redis-ML
Serving Client
Client
App
Client
App
Client
App
Any Training
Platform
21. 21
Redis ML Module
Redis Module
Tree Ensembles
Linear Regression
Logistic Regression
Matrix + Vector Operations
More to come...
22. 22
Random Forest Model
• A collection of decision trees
• Supports classification & regression
• Splitter Node can be:
◦ Categorical (e.g. day == “Sunday”)
◦ Numerical (e.g. age < 43)
• Decision is taken by the majority of decision trees
23. 23
Classic Tree Problem: Titanic Survival
YES
Sex =
Male ?
Age <
9.5?
Sibps >
2.5?
Survived
Died
SurvivedDied
NO
• Passenger Data encoded as feature vectors
• ML Algorithm learns the tree rules
• ID3, CART (RPART), etc.
• Tree rules used to infer results
24. 24
Titanic Survival: Random Forest
YES
Sex =
Male ?
Age <
9.5?
*Sibps >
2.5?
Survived
Died
SurvivedDied
NO YES
Country=
US?
State =
CA?
Height>
1.60m?
Survived
Died
SurvivedDied
NO YES
Weight<
80kg?
I.Q<100?
Eye color
=blue?
Survived
Died
SurvivedDied
NO
Tree #1 Tree #2 Tree #3
25. 25
Who Would Survive the Titanic
John:
• Male, 34,
• Married w/ 2 kids (Sibps=3)
• New York, USA
• 1.78m, 78kg
• 110 iq
• Blue eyes
Mathew:
• Male, 6
• 3 Sisters (Sibps=3)
• New York, USA
• 1.06m, 22.7 kg
• 100 iq
• Brown eyes
Let's use our forest to find out
26. 26
Redis: Forest Data Type
Add nodes to a tree in a forest:
Perform classification/regression of a feature vector:
ML.FOREST.ADD <forestId> <treeId> <path>
[ [NUMERIC|CATEGORIC] <splitterAttr> <splitterVal> ] |
[LEAF] <predVal>
ML.FOREST.RUN <forestId> <features>
[CLASSIFICATION|REGRESSION]
27. 27
Real World Challenge
• Ad serving company
• Need to serve 20,000 ads/sec @ 50msec data-center latency
• Runs 1k campaigns → 1K random forest
• Each forest has 15K trees
• On average each tree has 7 levels (depth)
28. 28
Ad Serving costs: Homegrown v. Redis
Homegrown
1,247 x c4.8xlarge 35 x c4.8xlarge
Cut computing infrastructure
by 97%
28
29. 29
Redis ML with Spark ML
Random Forest; 1,000 forests @ 15,000 trees
Classification Time Over Spark
13x Faster
33. 33
Step 1: Get The Data
Download and extract the MovieLens 100K Dataset
The data is organized in separate files:
• Ratings: user id | item id | rating (1-5) | timestamp
• Item (movie) info: movie id | genre info fields (1/0)
• User info: user id | age | gender | occupation
Our classifier should return the expected rating (from 1 to 5) a user would give the movie in question
34. 34
Step 2: Transform
34
The training data for each movie should contain 1 line per user:
• class (rating from 1 to 5 the user gave to this movie)
• user info (age, gender, occupation)
• user ratings of other movies (movie_id:rating ...)
• user genre rating averages (genre:avg_score ...)
Run gen_data.py to transform the files to the desired format
35. 35
Step3: Train and Load to Redis
// Create a new forest instance
val rf = new
RandomForestClassifier().setFeatureSubsetStrategy("auto").setLabelCol("indexedLabel").setFeat
uresCol("indexedFeatures").setNumTrees(500)
…..
// Train model
val model = pipeline.fit(trainingData)
…..
val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
// Load the model to redis
val f = new Forest(rfModel.trees)
f.loadToRedis(”movie-10", "127.0.0.1")
36. 36
Step 4: Execute inference in Redis
Redis-ML
+
Spark
Training
Client App
37. 37
Summary
• Train with Spark, Serve with Redis
• 97% resource cost serving
• Simplify ML lifecycle
• Redise (Cloud or Pack):
‒Scaling, HA, Performance
‒PAYG – cost optimized
‒Ease of use
‒Supported by the teams who created Spark and
Redis
Spark Training
Data loaded into Spark Model is saved in
Redis-ML
Redis-ML
Serving Client
Client
App
Client
App
Client
App
+
38. 38
Where to Find Me
@tague
https://github.com/tague
tague@redislabs.com