Users are constantly searching for new content and to stay competitive organizations must act immediately based on up-to-date data. Outdated recommendations decrease the likelihood of presenting the right offer and make it harder to maintain customer loyalty. In order to provide the most relevant recommendations and increase engagement, organizations must track customer interactions and re-score recommendations on the fly.
Data sources have expanded dramatically to include a wealth of historical data and a constant influx of behavior data. The key to moving from predictive models, applied in batch, to models that provide responses in real time, is to focus on the efficiency of model application. The speed that recommendations can be served is influenced by:
Architecture of the recommendation serving platform
Choice of recommendation algorithm
Datastore access patterns
In this presentation, we’ll discuss how developers can use open source components like HBase and Kiji to develop low-latency recommendation models that can be easily deployed by e-commerce companies. We will give practical advice on how to choose models and design data stores that make use of the architecture and quickly serve new recommendations.
Machine Learning system architecture – Microsoft Translator, a Case Study : ...Vishal Chowdhary
Microsoft Translator currently supports 100+ languages. We constantly improve the translation quality, add new scenarios, all with a constant team size. This session describes a production scale machine learning architecture using MS Translator as a case study. You will learn the mental model to approach your ML problem and concrete Do’s and Don’ts for the various components of the ML system architecture.
The Machine Learning Workflow with AzureIvo Andreev
Machine learning is not black magic but a discipline that involves data analysis, data science and of course – hard work. From searching patterns in data, applying algorithms to converting to usable predictions, you would need background and appropriate tools. In this session, we will go through major approaches to prepare data, build and deploy ML models in Azure (ML Studio, DataScience VM, Jupyter Notebook). Most importantly – based on some examples from the real world, we will provide you with a workflow of best practices.
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
Workday is a leading provider of cloud-based enterprise software products such as Human Capital Management, Talent, Finance, Student, Planning etc. These products produce a wealth of natural language data. However, this data is unstructured and denormalized. Retrieving relevant information from such data is a challenging task. Using simple index-based search methods can only take us so far. The Data Science team at Workday is determined to apply Machine Learning and AI to make search better across Workday’s products.
In this session, we present to you, how we use word embeddings to normalize the data and add structure to it. We will also talk about using word representations to make search intelligent. The specific use cases we will discuss are adding synonyms detection and entity-recommendation.
In this talk, we will focus on the word-embeddings techniques explored, metrics used to evaluate Natural Language Processing Models, tools built, and future work as a part of improving search.
Speaker
Namrata Ghadi, Workday Inc, Software Development Engineer (Data Science)
Adam Baker, Workday Inc, Sr Software Engineer
Predicting Patient Outcomes in Real-Time at HCASri Ambati
Data Scientist Allison Baker and Development Manager of Data Products Cody Hall work with a talented team of data scientists, software engineers, and web developers, and are building the framework and infrastructure to support a real-time prediction application, with the ability to scale across the entire company. Paramount to these efforts has been the capability of integrating the architecture for software production with the predictive models generated by H2O. This talk will review the processes by which HCA is building a pipeline to predict patient outcomes in real-time, heavily relying on H2O’s POJO scoring API and implemented in Clojure data processing. #h2ony
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
How to design your ML application to be production ready from the day one
How to switch from notebooks to deployable and maintainable software
How to deploy, serve and monitor prediction pipelines
How to re-train models in production
How to shift machine learning experimentation phase to production
Machine Learning system architecture – Microsoft Translator, a Case Study : ...Vishal Chowdhary
Microsoft Translator currently supports 100+ languages. We constantly improve the translation quality, add new scenarios, all with a constant team size. This session describes a production scale machine learning architecture using MS Translator as a case study. You will learn the mental model to approach your ML problem and concrete Do’s and Don’ts for the various components of the ML system architecture.
The Machine Learning Workflow with AzureIvo Andreev
Machine learning is not black magic but a discipline that involves data analysis, data science and of course – hard work. From searching patterns in data, applying algorithms to converting to usable predictions, you would need background and appropriate tools. In this session, we will go through major approaches to prepare data, build and deploy ML models in Azure (ML Studio, DataScience VM, Jupyter Notebook). Most importantly – based on some examples from the real world, we will provide you with a workflow of best practices.
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
Workday is a leading provider of cloud-based enterprise software products such as Human Capital Management, Talent, Finance, Student, Planning etc. These products produce a wealth of natural language data. However, this data is unstructured and denormalized. Retrieving relevant information from such data is a challenging task. Using simple index-based search methods can only take us so far. The Data Science team at Workday is determined to apply Machine Learning and AI to make search better across Workday’s products.
In this session, we present to you, how we use word embeddings to normalize the data and add structure to it. We will also talk about using word representations to make search intelligent. The specific use cases we will discuss are adding synonyms detection and entity-recommendation.
In this talk, we will focus on the word-embeddings techniques explored, metrics used to evaluate Natural Language Processing Models, tools built, and future work as a part of improving search.
Speaker
Namrata Ghadi, Workday Inc, Software Development Engineer (Data Science)
Adam Baker, Workday Inc, Sr Software Engineer
Predicting Patient Outcomes in Real-Time at HCASri Ambati
Data Scientist Allison Baker and Development Manager of Data Products Cody Hall work with a talented team of data scientists, software engineers, and web developers, and are building the framework and infrastructure to support a real-time prediction application, with the ability to scale across the entire company. Paramount to these efforts has been the capability of integrating the architecture for software production with the predictive models generated by H2O. This talk will review the processes by which HCA is building a pipeline to predict patient outcomes in real-time, heavily relying on H2O’s POJO scoring API and implemented in Clojure data processing. #h2ony
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
How to design your ML application to be production ready from the day one
How to switch from notebooks to deployable and maintainable software
How to deploy, serve and monitor prediction pipelines
How to re-train models in production
How to shift machine learning experimentation phase to production
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Sri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/xc3j20Om3UM
Description:
Data science is indeed one of the sexy jobs of the 21st century. But it is also a lot of hard work. And the hard work is seldom about the math or the algorithms. It is about building relevant machine learning products for the real world. We will go over some of the must-haves as you take your machine learning model out of the sandbox and make it work in the big, bad world outside.
Speaker's Bio:
Krish Swamy is an experienced professional with deep skills in applying analytics and BigData capabilities to challenging business problems and driving customer insights. Krish's analytic experience includes marketing and pricing, credit risk, digital analytics and most recently, big data analytics and data transformation. His key experiences lie in banking and financial services, the digital customer experience domain, with a background in management consulting. Other key skills include influencing organizational change towards a data and analytics driven culture, and building teams of analysts, statisticians and data scientists.
FrugalML: Using ML APIs More Accurately and CheaplyDatabricks
Offering prediction APIs for fee is a fast growing industry and is an important aspect of machine learning as a service. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data and budget. We take a first step towards addressing this challenge by proposing FrugalML, a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint. Our theoretical analysis shows that natural sparsity in the formulation can be leveraged to make FrugalML efficient. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Baidu and other providers for tasks including facial emotion recognition, sentiment analysis and speech recognition. Across various tasks, FrugalML can achieve up to 90% cost reduction while matching the accuracy of the best single API, or up to 5% better accuracy while matching the best API’s cost.
Machine Learning is approaching a peak of inflated expectations, although we see AI daily and in all contexts. Media pressure is high, governments are overly optimistic, plenty of ventures are putting money in unviable ideas or some brilliant engineers fail to reach business users.
But Microsoft bring all of this under the same roof and unleash the power of AI by integrating Power BI ecosystem with Azure ML and Cognitive services. The result is as simple and effective as great technology at end-user's hand.
This session is not about learning how to do AI but how to make AI usable and add value. Integrating ML models and sophisticated cognitive services in reports, understanding concealed relations and bringing automated ML empowers any business user to exploit AI for better decisions, regardless of his technical skills.
Guiding through a typical Machine Learning PipelineMichael Gerke
Many People are talking about AI and Machine Learning. Here's a quick guideline how to manage ML Projects and what to consider in order to implement machine learning use cases.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi
In this talk, you’ll learn about techniques used to build a feature drift detection as a service capability for your enterprise and beyond. Feature drift monitoring is a way to check volatility of machine learning model inputs. It can trigger investigations for potential model degradation as well as explain why models have shifted.
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Flavio Clesio
Our presentation at Spark Summit EU 2017 - spark-summit.org/eu-2017/events/preventing-revenue-leakage-and-monitoring-distributed-systems-with-machine-learning/
Recent Gartner and Capgemini studies predict only around 25% of data science projects are successful and only around 15% make it to full-scale production. Of these, many degrade in performance and produce disappointing results within months of implementation. How can focusing on the desired business outcomes and business use cases throughout a data science project help overcome the odds?
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Megan Kurka, H2O.ai - AutoDoc with H2O Driverless AI - H2O World 2019 NYCSri Ambati
This talk was recorded in NYC on October 22nd, 2019 and can be viewed here: https://youtu.be/aJJsrQHqsGg
AutoDoc with H2O Driverless AI
Driverless AI with Auto Doc is the next logical step of the data science workflow by taking the final step of automatically documenting and explaining the processes used by the platform. Auto Doc frees up the user from the time consuming task of documenting and summarizing their workflow while building machine learning models. The resulting documentation provides users with insight into machine learning workflow created by Driverless AI including details about the data used, the validation schema selected, model and feature tuning, and the final model created. With this capability in Driverless AI, users can focus on model insights and results.
Bio: Megan is a Customer Data Scientist at H2O. Prior to working at H2O, she worked as a Data Scientist building products driven by machine learning for B2B customers. She has experience working with customers across multiple industries, identifying common problems, and designing robust and automated solutions.
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
Organizations rely heavily on time series metrics to measure and model key aspects of operational and business performance. The ability to reliably detect issues with these metrics is imperative to identifying early indicators of major problems before they become pervasive. This is a difficult machine learning and systems problem because temporal patterns are complex, ever changing, and often very noisy, traditionally requiring significant manual configuration and model maintenance.
At Zillow, we have built an orchestration framework around Luminaire, our open-source python library for hands-off time-series Anomaly Detection. Luminaire provides a suite of models and built-in AutoML capabilities which we process with Spark for distributed training and scoring of thousands of metrics. In this talk, we will cover the architecture of this framework and performance of the Luminaire package across detection and prediction accuracy as well as runtime efficiency.
Towards Personalization in Global Digital HealthDatabricks
The rapid expansion of mobile phone usage in low-income and middle-income countries has created unprecedented opportunities for applying AI to improve individual and population health.
In benshi.ai, a non-profit funded by the Bill and Melinda Gates Foundation, the goal is to transform health outcomes in resource-poor countries through advanced AI applications. We aim to do so by providing personalized predictions and recommendations to support diagnosis to medical care teams and frontline workers, as well as to nudge patients through personalized incentives towards an improvement in disease treatment management and general wellness.
To this end, we have built an operational machine learning platform that provides personalized content and interventions real-time. Multiple engineering and machine learning decisions have been made to overcome different challenges and to build an experimentation engine and a centralized data and model management system for global health. Databricks served as a cornerstone upon which all our data/ML services were built. In particular, MLflow and dbx (an opensource tool from Databricks) have been crucial for the training, tracking and management of our end-to-end model pipelines. From the data science perspective, our challenges involved causal inference analysis, behavioral time series forecasting, micro-randomized trials, and contextual bandits-based experimentation at the individual level.
This talk will focus on how we overcome the technical challenges to build a state-of-the-art machine learning platform that serves to improve global health outcomes.
Abstract
Concurrency is everywhere. Prior to Java 5, concurrency was difficult
and error prone. Since Java 5, it's far more prevalent in our
application code, and through time it's been lurking in open-source
frameworks and containers. Concurrency is also a fundamental part of
Shopzilla's web-site and services ecosystem.
Introduction
Rod Barlow from Shopzilla will explore a brief history of concurrency, and the key
concurrency features and techniques provided by the Java API since
Java 5. Topics covered include Immutability, Atomic References, Blocking
Queues, Locks and Deadlocks. Also covered is Concurrency in
Frameworks, and Shopzilla's Website Concurrency Framework, including
Thread Pools, Executors and Futures.
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Sri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/xc3j20Om3UM
Description:
Data science is indeed one of the sexy jobs of the 21st century. But it is also a lot of hard work. And the hard work is seldom about the math or the algorithms. It is about building relevant machine learning products for the real world. We will go over some of the must-haves as you take your machine learning model out of the sandbox and make it work in the big, bad world outside.
Speaker's Bio:
Krish Swamy is an experienced professional with deep skills in applying analytics and BigData capabilities to challenging business problems and driving customer insights. Krish's analytic experience includes marketing and pricing, credit risk, digital analytics and most recently, big data analytics and data transformation. His key experiences lie in banking and financial services, the digital customer experience domain, with a background in management consulting. Other key skills include influencing organizational change towards a data and analytics driven culture, and building teams of analysts, statisticians and data scientists.
FrugalML: Using ML APIs More Accurately and CheaplyDatabricks
Offering prediction APIs for fee is a fast growing industry and is an important aspect of machine learning as a service. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data and budget. We take a first step towards addressing this challenge by proposing FrugalML, a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint. Our theoretical analysis shows that natural sparsity in the formulation can be leveraged to make FrugalML efficient. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Baidu and other providers for tasks including facial emotion recognition, sentiment analysis and speech recognition. Across various tasks, FrugalML can achieve up to 90% cost reduction while matching the accuracy of the best single API, or up to 5% better accuracy while matching the best API’s cost.
Machine Learning is approaching a peak of inflated expectations, although we see AI daily and in all contexts. Media pressure is high, governments are overly optimistic, plenty of ventures are putting money in unviable ideas or some brilliant engineers fail to reach business users.
But Microsoft bring all of this under the same roof and unleash the power of AI by integrating Power BI ecosystem with Azure ML and Cognitive services. The result is as simple and effective as great technology at end-user's hand.
This session is not about learning how to do AI but how to make AI usable and add value. Integrating ML models and sophisticated cognitive services in reports, understanding concealed relations and bringing automated ML empowers any business user to exploit AI for better decisions, regardless of his technical skills.
Guiding through a typical Machine Learning PipelineMichael Gerke
Many People are talking about AI and Machine Learning. Here's a quick guideline how to manage ML Projects and what to consider in order to implement machine learning use cases.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi
In this talk, you’ll learn about techniques used to build a feature drift detection as a service capability for your enterprise and beyond. Feature drift monitoring is a way to check volatility of machine learning model inputs. It can trigger investigations for potential model degradation as well as explain why models have shifted.
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Flavio Clesio
Our presentation at Spark Summit EU 2017 - spark-summit.org/eu-2017/events/preventing-revenue-leakage-and-monitoring-distributed-systems-with-machine-learning/
Recent Gartner and Capgemini studies predict only around 25% of data science projects are successful and only around 15% make it to full-scale production. Of these, many degrade in performance and produce disappointing results within months of implementation. How can focusing on the desired business outcomes and business use cases throughout a data science project help overcome the odds?
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Megan Kurka, H2O.ai - AutoDoc with H2O Driverless AI - H2O World 2019 NYCSri Ambati
This talk was recorded in NYC on October 22nd, 2019 and can be viewed here: https://youtu.be/aJJsrQHqsGg
AutoDoc with H2O Driverless AI
Driverless AI with Auto Doc is the next logical step of the data science workflow by taking the final step of automatically documenting and explaining the processes used by the platform. Auto Doc frees up the user from the time consuming task of documenting and summarizing their workflow while building machine learning models. The resulting documentation provides users with insight into machine learning workflow created by Driverless AI including details about the data used, the validation schema selected, model and feature tuning, and the final model created. With this capability in Driverless AI, users can focus on model insights and results.
Bio: Megan is a Customer Data Scientist at H2O. Prior to working at H2O, she worked as a Data Scientist building products driven by machine learning for B2B customers. She has experience working with customers across multiple industries, identifying common problems, and designing robust and automated solutions.
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
Organizations rely heavily on time series metrics to measure and model key aspects of operational and business performance. The ability to reliably detect issues with these metrics is imperative to identifying early indicators of major problems before they become pervasive. This is a difficult machine learning and systems problem because temporal patterns are complex, ever changing, and often very noisy, traditionally requiring significant manual configuration and model maintenance.
At Zillow, we have built an orchestration framework around Luminaire, our open-source python library for hands-off time-series Anomaly Detection. Luminaire provides a suite of models and built-in AutoML capabilities which we process with Spark for distributed training and scoring of thousands of metrics. In this talk, we will cover the architecture of this framework and performance of the Luminaire package across detection and prediction accuracy as well as runtime efficiency.
Towards Personalization in Global Digital HealthDatabricks
The rapid expansion of mobile phone usage in low-income and middle-income countries has created unprecedented opportunities for applying AI to improve individual and population health.
In benshi.ai, a non-profit funded by the Bill and Melinda Gates Foundation, the goal is to transform health outcomes in resource-poor countries through advanced AI applications. We aim to do so by providing personalized predictions and recommendations to support diagnosis to medical care teams and frontline workers, as well as to nudge patients through personalized incentives towards an improvement in disease treatment management and general wellness.
To this end, we have built an operational machine learning platform that provides personalized content and interventions real-time. Multiple engineering and machine learning decisions have been made to overcome different challenges and to build an experimentation engine and a centralized data and model management system for global health. Databricks served as a cornerstone upon which all our data/ML services were built. In particular, MLflow and dbx (an opensource tool from Databricks) have been crucial for the training, tracking and management of our end-to-end model pipelines. From the data science perspective, our challenges involved causal inference analysis, behavioral time series forecasting, micro-randomized trials, and contextual bandits-based experimentation at the individual level.
This talk will focus on how we overcome the technical challenges to build a state-of-the-art machine learning platform that serves to improve global health outcomes.
Abstract
Concurrency is everywhere. Prior to Java 5, concurrency was difficult
and error prone. Since Java 5, it's far more prevalent in our
application code, and through time it's been lurking in open-source
frameworks and containers. Concurrency is also a fundamental part of
Shopzilla's web-site and services ecosystem.
Introduction
Rod Barlow from Shopzilla will explore a brief history of concurrency, and the key
concurrency features and techniques provided by the Java API since
Java 5. Topics covered include Immutability, Atomic References, Blocking
Queues, Locks and Deadlocks. Also covered is Concurrency in
Frameworks, and Shopzilla's Website Concurrency Framework, including
Thread Pools, Executors and Futures.
Better Living Through Messaging - Leveraging the HornetQ Message Broker at Sh...Joshua Long
Internally, some projects at Shopzilla have recently started to leverage the HornetQ messaging system to meet performance and scalability requirements. In this talk, Mark Lui and Josh Long review the basic principles of messaging and distributed communication. They demonstrate how loosely coupled, asynchronous communication can improve performance, scalability and reliability and finally touch on Shopzilla-specific use cases for messaging.
Retail Reference Architecture Part 3: Scalable Insight Component Providing Us...MongoDB
During this session we will cover the best practices for implementing the insight component with MongoDB. This includes efficiently ingesting and managing a large volume of user activity logs, such as clickstreams, views, likes and sales. We'll dive into how you can derive user statistics, product maps and trends using different analytics tools like the aggregation framework, map/reduce or the Hadoop connector. We will also cover operational considerations, including low-latency data ingestion and seamless aggregation queries.
An Architecture for Agile Machine Learning in Real-Time ApplicationsJohann Schleier-Smith
Presented at KDD, August 11, 2015.
Abstract of the paper:
Machine learning techniques have proved effective in recommender systems and other applications, yet teams working to deploy them lack many of the advantages that those in more established software disciplines today take for granted. The well-known Agile methodology advances projects in a chain of rapid development cycles, with subsequent steps often informed by production experiments. Support for such workflow in machine learning applications remains primitive.
The platform developed at if(we) embodies a specific machine learning approach and a rigorous data architecture constraint, so allowing teams to work in rapid iterative cycles. We require models to consume data from a time-ordered event history, and we focus on facilitating creative feature engineering. We make it practical for data scientists to use the same model code in development and in production deployment, and make it practical for them to collaborate on complex models.
We deliver real-time recommendations at scale, returning top results from among 10,000,000 candidates with sub-second response times and incorporating new updates in just a few seconds. Using the approach and architecture described here, our team can routinely go from ideas for new models to production-validated results within two weeks.
The slides from the Machine Learning Summers School 2015 in Sydney on Machine Learning for Recommender Systems. Collaborative filtering algorithms, Context-aware methods, Restricted Boltzmann Machines, Recurrent Neural Networks, Tensor Factorization, etc.
Retail Reference Architecture Part 2: Real-Time, Geo Distributed InventoryMongoDB
During this session we will cover the best practices for implementing a real-time inventory with MongoDB. This includes properly model quantities and stores to avoid large numbers of documents being indexed, how to efficiently use geo-indexing to find the closest store with a specific item available and how to run aggregation to gather interesting inventory stats. We will also cover operational considerations, like how to make inventory queries and updates from anywhere be low-latency and resilient to network partitions via tag-aware sharding.
How Lazada ranks products to improve customer experience and conversionEugene Yan Ziyou
Slides from sharing at Strata + Hadoop Singapore 2016 (http://conferences.oreilly.com/strata/hadoop-big-data-sg/public/schedule/detail/54542)
Ecommerce has enabled retailers to make all of their products available to consumers and consumers to access niche products not found in brick-and-mortar stores. This growth provides consumers with unparalleled choice. Nonetheless, the sheer number of products brings with it the challenge of helping users find relevant products with ease.
Lazada has tens of millions of products on its platform, and this number grows by approximately one million monthly. Lazada’s challenge: How can we help users easily discover good quality products they will like? How can we ensure product selection remains fresh and constantly updated?
One way to do this is through the ranking of products. Via ranking, Lazada helps customers easily find products that will delight them by ensuring these products appear in the first few pages. I’ll share how Lazada ranks products on our website. (Note: Google “how amazon ranks products” for some industry background)
Topics include how we:
* Develop methodology (and tricks) to solve not-so-well-defined problems
* Collect and store user-behavior data from our website and app
* Clean and prepare the data (e.g., handling outliers)
* Discover and create features useful features
* Build models to improve customer experience and meet business objectives
* Measure and test outcomes on our website
* Built this end-to-end on our Hadoop infrastructure, with tools including Kafka and Spark
Building a Machine Learning App with AWS LambdaSri Ambati
Ludi Rehaks' meetup on 03.17.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)Amazon Web Services
In this session, we provide programmatic guidance on building tools and applications to detect and manage fraud and unusual activity specific to financial services institutions. Payment fraud is an ongoing concern for merchants and credit card issuers alike and these activities impact all industries, but are specifically detrimental to Financial Services. We provide a step-by-step walkthrough of a reference solution to detect and address credit card fraud in real time by using Apache Apex and Amazon Machine Learning capabilities. We also outline different resource and performance optimization options and how to work data security into the fraud detection workflow.
Cheap data storage and high-performance analytics are going to change the face of retail sector. And big data is going to play pivotal role in this technological revolution. You can find other reports related to Big data at http://www.marketresearchreports.com/big-data
Jaarlijks worden er nieuwe hypes gepresenteerd als de Nieuwe Heilige Graal voor retail. Zo ging het ook met big data. Maar big data is de hype voorbij en te belangrijk geworden om te negeren: het blijkt een waardevolle bron van klantkennis waar online spelers al van profiteren en waar veel traditionele retailers moeite mee hebben.
Even praktisch: wat is big data? En waar zit die kracht van big data en hoe zet je het in? Wat zijn goeie voorbeelden? Daar gaat deze presentatie over.
Continuous Performance Testing and Monitoring in Agile DevelopmentDynatrace
Continuous Performance Testing and Monitoring in Agile Development
Continuous Performance testing and monitoring is the best way to ensure application performance with quicker development cycles. Balancing agile and DevOps velocity with the need for ongoing performance testing and monitoring is essential. We call it Continuous Performance Validation.
In this webinar, we will show how you can get performance guidance and metrics throughout development, making sure apps perform well from inception to production and beyond.
In this webinar you will learn:
• How to automate performance testing and which tools you need to be successful
• How to use APM during load and performance testing
• How to create a continuous performance validation strategy from Dev to QA and Ops
• Ways teams can collaborate to ensure top application performance
Please use the below URL to view recording of this webinar:
http://wso2.com/library/webinars/2015/02/connected-retail-reference-architecture/
The key focus areas of this session are
An overview of the retail IT landscape
What is a connected retail IT architecture
How the WSO2 middleware platform enables a connected retail business
Connected retail L0 architecture
Connected retail L1 architecture with WSO2
Shant Hovsepian, CTO of Arcadia Data and a panel of experts details the trade-offs between a number of architectures that provide self-service access to data, and industry researcher Mark Madsen discusses the pros and cons of architectures, deployment strategies, and customer examples of BI on big data.
Topics include:
- Traditional BI platforms based on semantic layers and SQL/MDX generation
- Server and desktop BI tools based on direct mapping of data
- Distributed BI platforms (e.g., MPP and data native)
- OLAP- and SQL-on-Hadoop engines
presents the foundational aspects of web analytics and some specifics such as the hotel problem. Discusses trace data, behaviorism, and other cool web analytics stuff
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Amazon Web Services
AWS has a large and growing portfolio of big data management and analytics services, designed to be integrated into solution architectures that meet the needs of your business. In this session, we look at analytics through the eyes of a business intelligence analyst, a data scientist, and an application developer, and we explore how to quickly leverage Amazon Redshift, Amazon QuickSight, RStudio, and Amazon Machine Learning to create powerful, yet straightforward, business solutions.
This presentation describes what Search Analytics is, what value it brings to the table, how it can be used, what additional functionality and values can be build with search data, etc.
Creating a Single Source of Truth: Leverage all of your data with powerful an...Looker
With a centralized data store, the entire spectrum of analytics is at your fingertips. Using Looker & Segment, you can collect, store and analyze everything from click-stream and event data to transactional and behavioral data in your data warehouse.
Some of the topics this webinar will include:
-The advantages of a centralized data warehouse with Segment Warehouses
-Creating a data model to get your company on the same page with Looker Blocks
-Putting it all together: Best practices for making your data accessible to your end users
Project Explanation: Book Recommendation System
The goal of this project was to develop a book recommendation system that provides personalized recommendations to users based on their preferences and past reading behavior. The project involved the following key steps:
1. Data Collection: I gathered a comprehensive dataset of books, including information such as titles, authors, genres, and user ratings. This data was obtained from various reliable sources, such as online bookstores or publicly available book datasets.
2. Data Preprocessing: The collected data required cleaning and preprocessing to ensure its quality and consistency. I handled missing values, resolved inconsistencies in book titles or authors, and standardized the data format for further analysis.
3. Exploratory Data Analysis: I performed exploratory data analysis to gain insights into the dataset. This included analyzing book genres, distribution of user ratings, and identifying popular authors or books.
4. Feature Engineering: To capture the preferences and interests of users, I created relevant features from the available data. These features could include book genres, authors, user demographics, or historical reading behavior.
5. Recommendation Model Development: I developed a recommendation model using collaborative filtering techniques or content-based filtering methods. Collaborative filtering utilizes the preferences of similar users to make recommendations, while content-based filtering suggests books based on their attributes and user preferences. I employed popular machine learning algorithms, such as matrix factorization or k-nearest neighbors, to build the recommendation model.
6. Model Evaluation: I evaluated the performance of the recommendation system using metrics such as precision, recall, or mean average precision. I also conducted A/B testing or cross-validation to assess the system's effectiveness and optimize its performance.
7. User Interface Development: I created a user-friendly interface where users could input their preferences and receive personalized book recommendations. The interface provided an intuitive and interactive experience, allowing users to explore recommended books and provide feedback.
8. Deployment and Feedback Loop: The recommendation system was deployed in a production environment, where users could access it and provide feedback on the recommended books. This feedback was incorporated into the system to continually improve its accuracy and relevance over time.
By completing this project, I gained hands-on experience in data collection, preprocessing, exploratory data analysis, and recommendation system development. I demonstrated my ability to leverage machine learning algorithms and user data to build a personalized book recommendation system that enhances user engagement and satisfaction.
Olist Store Analysis
ccording to the data, Olist E-commerce has about 99,440 orders. With about 89,940 orders being delivered, the company has a 90% delivery success rate.
✔Their average product rating is 4.09 stars, with product categories going as high as 4.67 stars and as low as 2.5 stars. 1 Star reviews are on third place in the review score distribution ranking which likely indicates that there could be problems with product quality in some product categories
✔It helps in understanding the spending patterns of customers in sao paulo city .it also helps Olist in identifying high value customers and creating targeted marketing campaigns.
Afternoons with Azure - Azure Machine Learning CCG
Journey through programming languages such as R, and Python that can be used for Machine Learning. Next, explore Azure Machine Learning Studio see the interconnectivity.
For more information about Microsoft Azure, call (813) 265-3239 or visit www.ccganalytics.com/solutions
Similar to Real-time Recommendations for Retail: Architecture, Algorithms, and Design (20)
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
How to Get CNIC Information System with Paksim Ga.pptx
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
1. Juliet Hougland and Jonathan Natkins
REAL-TIME RECOMMENDATIONS FOR RETAIL:
ARCHITECTURE, ALGORITHMS, AND DESIGN
2. Who Are We?
Jonathan Natkins
Field Engineer at
WibiData
Before that, Cloudera
Software Engineer
Before that, Vertica
Software/Field Engineer
Juliet Hougland
Data Scientist, previously
at WibiData
MS in Applied Math
BA in Math-Physics
6. Recommender Contexts
Taste History
Based on everything you know about a user
Interests over months/years
Current Taste
Based on a user’s immediate history
Interests over minutes/hours
Ephemeral
Extreme version of current taste
For example, location
Demographic*
Similar to taste history, but less subjective
Geographic region, age bracket, etc.
9. Requirements for a Real-Time System
General System Requirements
Handle millions of customers/users
Support collection and storage of complex data
Static and event-series
Real-Time System Requirements
Quickly retrieve subsets of data for a single user
Aggregate/derive new, first-class data per user
13. How Can We Make Real-Time Models?
Population interests
change slowly
Individual interests
change quickly
14. How Can We Make Real-Time Models?
Population interests
change slowly
Models don’t need
to be retrained
frequently
Individual interests
change quickly
15. How Can We Make Real-Time Models?
Population interests
change slowly
Models don’t need
to be retrained
frequently
Application of a model
should be fast
Individual interests
change quickly
16. A Common Workflow
Train a model over
the entire dataset
Save fitted model
parameters to a file or
another table
Access the model
parameters when
generating new
recommendations
based on new data
This is
EXPENSIVE
17. Developing Models
KijiExpress
Scala interface for interacting with Kiji data
Uses Scalding for designing complex dataflows
Model Lifecycle
Allows analysts and data scientists to break apart
a model into phases
19. Scoring Models in Real-Time
Batch isn’t real-time
Number of
Users
Number of Interactions
20. Scoring Models in Real-Time
Batch isn’t real-time
Number of
Users
A few users with
many interactions
Number of Interactions
21. Scoring Models in Real-Time
Batch isn’t real-time
A lot of users with
few interactions
Number of
Users
A few users with
many interactions
Number of Interactions
26. Fresheners Compute Lazily
Read a column
Get from HBase
Client
Freshness
Policy
Scorer
Yes, return to client
Write back for next time
KijiScoring Server
HBase
29. Kiji Model Repository
Link between application and models
Stores Freshener metadata
FreshnessPolicy, Scorer, attached column
Location of trained model
Stores Scorer code
Code repository makes model scoring code available
to the application from a central location
New models can be deployed to the Model
Repository and made immediately available to
the application
33. Content-Based Recommenders
Build models around entities using
features that we think reflect
inherent characteristics
Orange-Nosed
Lab Assistant
Meeps a lot
40. Similar Entities
What do we mean by similar?
Jaccard Index: a measure of set similarity
Cosine Similarity: the angle between two vectors
Pearson Correlation: statistical measure, similar to cosine
Naively, we could compare every entity to each other
…But that would not scale
will with increasing
numbers of entities
42. Collaborative Filtering: Is This Useful?
Problem: Too much data!
Tracking user preferences and all their events generates huge
amounts of data
Problem: Too little data!
Dimensions of user-space and item-space are usually very large
More variables makes it more difficult to generate user
preferences
Problem: Cold start
If you don’t know anything about a user, what should you
recommend?
Problem: More ratings means slower computations
Identifying neighborhoods of entities is expensive
43. Collaborative Filtering: Why Is It Useful?
Because it works
Content-agnostic
All that matters is co-occurrence of events
44. Amazon: Item-Item Collaborative Filtering
>
Used for personalized recommendations
Fill screen real estate with related items
Produces specific, but non-creepy
recommendations
Linden, G.; Smith, B.; York, J., "Amazon.com recommendations: item-to-item collaborative filtering," Internet Computing, IEEE , vol.7,
no.1, pp.76,80, Jan/Feb 2003
45. Item-Item Collaborative Filtering
Beaker buys a banana slicer
Then:
Generate list of candidate items to predict ratings for
Predict ratings for candidate items
Select Top-N items
46. Accessing External Data
KeyValueStore API enables external data access
when applying a model
External data might be…
Trained model parameters
Hierarchical/Taxonomic data
Geo-lookup
Store external data flexibly
Text files, sequence files, Kiji tables, etc.
Data access is decoupled from use during execution
If the data doesn’t fit in memory, put it in a table
47. How Much Less Work Can We Do?
We can choose a
predictor that allows
us to truncate a sum
There are two ways
terms in the sum of
our predictor can be
small
No rating
Small similarity
48. How Much Less Work Can We Do?
We can choose a
predictor that allows
us to truncate a sum
There are two ways
terms in the sum of
our predictor can be
small
No rating
Small similarity
49. How Much Less Work Can We Do?
We can choose a
predictor that allows
us to truncate a sum
Ignore unrated items
There are two ways
terms in the sum of
our predictor can be
small
No rating
Small similarity
50. How Much Less Work Can We Do?
We can choose a
predictor that allows
us to truncate a sum
Ignore dissimilar items
There are two ways
terms in the sum of
our predictor can be
small
No rating
Small similarity
51. How Much Less Work Can We Do?
If we only present a few recommendations,
we don’t need to predict ratings for all items
Choose your candidate set to estimate ratings
wisely or infer from nearest neighbors
54. Want to Know More?
The Kiji Project
kiji.org
github.com/kijiproject
Questions about this presentation?
Twitter: @JulietHougland or @nattyice
Email: natty@wibidata.com
Editor's Notes
Natty, thanks for that great description of the infrastructure that Kiji provides. Now we want to take the next step- go from infrastructure to actually showing our customers items they may be interested in.1. Recommending a group of items. If you are taking the time to make predictions you may as well present people with many options. This means recommendations arent about finding the best item to recommend. It is about predicting to best group of items. This gives us some leeway in terms of exact value.2. Users often will not be logged in at first. If we want to present good recs we need to be able to do it based on their current sessions browsing history. You can’t have personalized recs for non logged in users without real-time recs.3. Online retails want to present vast catalogs to their large user base. For most online retail sites the point of recommendations is organizing information in a way that is relevant and useful to their customer.
I want to give us a broad overview of the types of approaches to recommendations available. We have got a finite amount of time here, so I will focus a lot of what I say on a simple implementation of collaborative filtering. I don’t want to give the impression that it is the only, or absolute best solution to the recommendations problem. Like with any prediction problem, there are many ways to tackle retail recommendation.There are two main types of recommendation algorithms. In a realistic system, they willalways be used together
Use item descriptions, user generated tags, expert generated tags (Pandora) in order to build representations.A major pro of content based models is that they better handle unrated items. Good way to get around the cold start problem for a rec systems. Good way to bootstrap your way to getting user ratings or augment other methods of recommendation.The down side is that processingand building models around textual information can be very challenging.
Just look at it.In a content based system, the hope is that the content you are basing your recs on is a good indicator of other items that are related in a relevant way.So, if we had a good content based recommender, after observing an interest in a banana slicer, it would recommend that you trying using a butter knife quickly.
Pandora is the first recommendation system I remember consciously interacting with.Pandora: Expert TaggingThey have a team of musicians (domain expertise in invaluable) to listen to music and apply tags to songs.From a seeded station they begin to present to you variations on the original attributes of the song you started with. Your likes and dislikes as expressed to their systems helps it learn what attributes you like and dislike.Expert tagging is expensive and is a bottleneck in introducing new items.
In collaborative filtering recommendation algorithms, we base our prediction based purely on expressed preferences.we think of storing user-item ratings as a matrix where the rows correspond to users and the columns correspond to items. We collectexplicit ratings and record them. Unfortunately, people don’t provide many ratings. Also, people lie.
Gather feedback as explicit ratings or implicitly through user behavior (page views, put in shopping cart, starred/saved for later, bought)Lots of work in rec systems has been done around explicitly rated items.People lying about their preferences. They are aspirational.I put ken burns in my queue, but I watch a lot of the deadliest catch.New data can be added incrementally to the model.
We can rely on implicit affinities for items instead of explicit ratings. We can track viewed or bought items and use a unary representation.Meaning, the happen, or… nothing, null, the void.
Users that have expressed similar taste in the past should express similar taste in the future. (We represent users as a vector of ratings for items.)Items that have had similar profiles of user interest should continue to appeal to a similar collection of users. (we represent items as a vector of ratings by user.)For a target entity, we predict the unknown rating using information from other similar entities. A simple and common approach is to take the weighted average of ratings, where weights are some function of the similarity between entities. r_{i} = sum_{j}w_{ij}r_{j}
identify items to select recommendations fromgenerate predicted scores, often through weighted averages in neighborhoods.return a list of top rated items
- we had the cold start problem => content based recs- where do we keep our data? What do we need access to when we generate recs?design tablestrain model
Too much data!Too little data! Data sparsity. Large number of users and items makes it hard to get a good sample of the potential “taste space.” Did you see gravity? Did the emptiness fo space strike you? This is like that, but much much emptier.Cold start: Troublesome because it requires ratings. If your system has no recorded interactions or explicit ratings, you can’t do this. Usually systems will be bootstrapped from recommenders that don’t require having user interactions recorded.Content based recommenders. Use item descriptions or tags to infer similarity between items, or generate profiles for users.Use data volunteered by users during registration or pulled in from facebook profiles to begin recommending items.The more ratings data you have, the slower you computation goes.
Useful because it is content agnostic. Can be used for any variety of content, you just need items, users, and ratings. Can be used across languages. (If you are google, this is very important.)It performs well. It used used in many succesful commerical applications. Amazon, Netflix, Google. It just works well.
Conceptual reasons amazon users item-item CFAmazon has more users than items. Computationally cheaper to focus model building around item relationships since their are less items.The relationships between items is also often simplerCan be used for personalized recs. It is especially useful when the only information you have about your user is a few items they have viewed in their current session.Item based CF is specific in the types of items is recommends, user based CF is more serendipitous in the types of items it recommends.Less creepy to get recommended very similar items to the one you are currently looking at than to have an accurate prediction when people don’t think Can fill screen real estate with similar items easily. “Customer who bought X also bought..” You are already doing the needed computation during the model training phase.
Use banana slicer + banana slicer pile pic and eq hereSteps in generating rec:1. We don’t need to estimate rating for every item in the catalog if we will only present a few recs.Choose your candidate set to estimate ratings wisely or infer from nearest neighbors.2. Precompute item-item similarities- (N^2)M operation. In practice NM since ratings are sparse. At scoring time, aggregate ratings and similarities to predict ratings for unrated items.How can you organize the model data in such a way that what you need to generate predictions is accessible at scoring time?
We need to be able to quickly access item-item similarities on a per item basis. How can we do this, quickly?Kiji provides the ability to access outside data source while freshening, running a MapReduce job, or testing through the KeyValue Stores. The KeyValueStore interface has many existing useful implementations you may use, or you may define your own custom one. Depending on you access need and total size of the data you need access to you may use a file backed KeyValue Store, or another KijiTable itself.Since item-item CF as we have stated it requires that we are able to access all item-item similarity pairs, our best choice is to use another Kiji Table to store this information. Accesing this information quickly then becomes an issue of table layout design.
no rating => we should be able to query item-item similarities on a per items basis and
Organization of data in your tables depends on your prediction function. We can see that in standard neighborhood based interpolation in CF that we need to be able to access all of a users ratings.Two tablesUsers table contains user info, product ratings, views, purchases, etc.Products table contains product info, and will be augmented with related/similar products
Organization of data in your tables depends on your prediction function. We can see that in standard neighborhood based interpolation in CF that we need to be able to access all of a users ratings.