The document discusses improving Reddit's search capabilities. It describes Reddit's search architecture, including how they improved relevance through signals like click data and comments. It also discusses scaling the infrastructure through techniques like Terraform, autoscaling replicas across availability zones, and building a faster data ingestion pipeline.
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and graph algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research.
Morpheus is an open-source library that is API compatible with Spark Graph and extends its functionality by:
A Property Graph catalog to manage multiple Property Graphs and Views
Property Graph Data Sources that connect Spark Graph to Neo4j and SQL databases
Extended Cypher capabilities including multiple graph support and graph construction
Built-in support for the Neo4j Graph Algorithms library In this talk, we will walk you through the new Spark Graph module and demonstrate how we extend it with Morpheus to support enterprise users to integrate Spark Graph in their existing Spark and Neo4j installations.
We will demonstrate how to explore data in Spark, use Morpheus to transform data into a Property Graph, and then build a Graph Solution in Neo4j.
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...luisw19
Originally designed by Facebook to allow its mobile clients to define exactly what data should be send back by an API and therefore avoid unnecessary roundtrips and data usage, GraphQL is a JSON based query language for Web APIs. Since it was open sourced by Facebook in 2015, it has undergone very rapid adoption and many companies have already switch to the GraphQL way of building APIs – see http://GraphQL.org/users.
However, with some many hundreds of thousands of REST APIs publicly available today (and many thousands others available internally), what are the implications of moving to GraphQL? Is it really worth the effort of replacing REST APIs specially if they’re successful and performing well in production? What are the pros/cons of using GraphQL? What tools / languages can be used for GraphQL? What about API Gateways? What about API design?
With a combination of rich content and hands-on demonstrations, attend this session for a point of view on how address these and many other questions, and most importantly get a better understanding and when/where/why/if GraphQL applies for your organisation or specific use case.
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
In this talk, we will explore how Uber enables rapid experimentation of machine learning models and optimization algorithms through the Uber’s Data Science Workbench (DSW). DSW covers a series of stages in data scientists’ workflow including data exploration, feature engineering, machine learning model training, testing and production deployment. DSW provides interactive notebooks for multiple languages with on-demand resource allocation and share their works through community features.
It also has support for notebooks and intelligent applications backed by spark job servers. Deep learning applications based on TensorFlow and Torch can be brought into DSW smoothly where resources management is taken care of by the system. The environment in DSW is customizable where users can bring their own libraries and frameworks. Moreover, DSW provides support for Shiny and Python dashboards as well as many other in-house visualization and mapping tools.
In the second part of this talk, we will explore the use cases where custom machine learning models developed in DSW are productionized within the platform. Uber applies Machine learning extensively to solve some hard problems. Some use cases include calculating the right prices for rides in over 600 cities and applying NLP technologies to customer feedbacks to offer safe rides and reduce support costs. We will look at various options evaluated for productionizing custom models (server based and serverless). We will also look at how DSW integrates into the larger Uber’s ML ecosystem, e.g. model/feature stores and other ML tools, to realize the vision of a complete ML platform for Uber.
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Databricks
From training billions of ad impressions to scaling gradient boosted trees with more than three million nodes, Ad Targeting at Yelp uses Apache Spark in many stages of its large-scale machine learning pipeline.
This session will explore examples of how Yelp employed and tweaked Spark to support big data feature engineering, visualizations and machine learning model training, evaluation and diagnostics. You’ll also hear about the challenges in building and deploying such a large-scale intelligent system in a production environment.
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Databricks
Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times.
Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature.
Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines.
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and graph algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research.
Morpheus is an open-source library that is API compatible with Spark Graph and extends its functionality by:
A Property Graph catalog to manage multiple Property Graphs and Views
Property Graph Data Sources that connect Spark Graph to Neo4j and SQL databases
Extended Cypher capabilities including multiple graph support and graph construction
Built-in support for the Neo4j Graph Algorithms library In this talk, we will walk you through the new Spark Graph module and demonstrate how we extend it with Morpheus to support enterprise users to integrate Spark Graph in their existing Spark and Neo4j installations.
We will demonstrate how to explore data in Spark, use Morpheus to transform data into a Property Graph, and then build a Graph Solution in Neo4j.
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...luisw19
Originally designed by Facebook to allow its mobile clients to define exactly what data should be send back by an API and therefore avoid unnecessary roundtrips and data usage, GraphQL is a JSON based query language for Web APIs. Since it was open sourced by Facebook in 2015, it has undergone very rapid adoption and many companies have already switch to the GraphQL way of building APIs – see http://GraphQL.org/users.
However, with some many hundreds of thousands of REST APIs publicly available today (and many thousands others available internally), what are the implications of moving to GraphQL? Is it really worth the effort of replacing REST APIs specially if they’re successful and performing well in production? What are the pros/cons of using GraphQL? What tools / languages can be used for GraphQL? What about API Gateways? What about API design?
With a combination of rich content and hands-on demonstrations, attend this session for a point of view on how address these and many other questions, and most importantly get a better understanding and when/where/why/if GraphQL applies for your organisation or specific use case.
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
In this talk, we will explore how Uber enables rapid experimentation of machine learning models and optimization algorithms through the Uber’s Data Science Workbench (DSW). DSW covers a series of stages in data scientists’ workflow including data exploration, feature engineering, machine learning model training, testing and production deployment. DSW provides interactive notebooks for multiple languages with on-demand resource allocation and share their works through community features.
It also has support for notebooks and intelligent applications backed by spark job servers. Deep learning applications based on TensorFlow and Torch can be brought into DSW smoothly where resources management is taken care of by the system. The environment in DSW is customizable where users can bring their own libraries and frameworks. Moreover, DSW provides support for Shiny and Python dashboards as well as many other in-house visualization and mapping tools.
In the second part of this talk, we will explore the use cases where custom machine learning models developed in DSW are productionized within the platform. Uber applies Machine learning extensively to solve some hard problems. Some use cases include calculating the right prices for rides in over 600 cities and applying NLP technologies to customer feedbacks to offer safe rides and reduce support costs. We will look at various options evaluated for productionizing custom models (server based and serverless). We will also look at how DSW integrates into the larger Uber’s ML ecosystem, e.g. model/feature stores and other ML tools, to realize the vision of a complete ML platform for Uber.
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Databricks
From training billions of ad impressions to scaling gradient boosted trees with more than three million nodes, Ad Targeting at Yelp uses Apache Spark in many stages of its large-scale machine learning pipeline.
This session will explore examples of how Yelp employed and tweaked Spark to support big data feature engineering, visualizations and machine learning model training, evaluation and diagnostics. You’ll also hear about the challenges in building and deploying such a large-scale intelligent system in a production environment.
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Databricks
Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times.
Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature.
Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines.
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
Graph data and graph analytics are increasingly important in data science and engineering. Cypher is an open language used for querying and updating graph databases and analytics platforms, which is now available in the Apache Spark environment. Neo4j Morpheus leverages the open source graph language project to integrate data from Neo4j operational graph databases with Hive and JDBC SQL data sources, using new Cypher features like the Property Graph Catalog, named graphs, graph projection, parameterized graph view functions, and graph/table views. Input and output graphs can be loaded and stored as structured collections of DataFrames with strong graph schemas to ensure data consistency and graph query optimization. Property graphs can also be analyzed and transformed using graph algorithms such as those in the GraphFrames project. Besides describing and demonstrating these capabilities, this talk also discusses the Spark Project Improvement Proposal to bring Cypher into Spark 3.0, and outlines current work to unify Cypher with other graph query languages to form a new ISO standard Graph Query Language.
Speakers: Alastair Green, Martin Junghanns
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
GE Aviation has hundreds of data scientists and engineers developing algorithms. The majority of these people do not have the time to learn Apache Spark and continue to develop on local machines in Python or R. We also have lots of historical code that was not developed for Spark. However, the business wanted to deploy to a Spark environment for scalability, as quickly as possible. So how did we bridge the gap? A data scientist and software engineer will co-present to share how we approached the problem of building, unifying and scaling these algorithms.
It is a basic presentation which can help you understand the basic concepts about Graphql and how it can be used to resolve the frontend integration of projects and help in reducing the data fetching time
This presentation also explains the core features of Graphql and why It is a great alternative for REST APIs along with the procedure with which we can integrate it into our projects
Flink has been used by many users in their ML use cases, such as real-time feature engineering and near-line inference. For the other ML use cases that are more batch-oriented, such as model training, validation, usually other systems are used. This talk we give in Flink Forward 2019 show the efforts in Flink community to let Flink cover all the ML use cases.
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...Databricks
Rather than running pre-defined queries embedded in dashboards, business users and data scientists want to explore data in more intuitive ways. Natural language interfaces for data exploration have gained considerable traction in industry. Their success is triggered by advancements in machine learning and by novel big data technologies that enable processing large amounts of data in real-time. However, even though these systems show significant progress, they have not yet reached the maturity level to support real users in data exploration scenarios either due to the lack of supported functionality or the narrow application scope, remaining one of the ‘holy grails’ of the data analytics community.
In this talk, we will present a Spark-based architecture of an intelligent data assistant, a system that combines real-time data processing and analytics over large amounts of data with user interaction in natural language, and we will argue why Spark is the right platform for next-gen intelligent data assistants.
Our intelligent data assistant
(a) enables a more natural interaction with the user through natural language;
(b) offers active guidance through explanations and suggestions;
(c) constantly learns and improves its performance. To build an intelligent data assistant, there are several challenges. Unlike search engines, users tend to express sophisticated query logics and expect perfect results. The inherent complexity of natural languages complicates things in several ways. The intricacies of the data domain require that the system constantly expands its domain knowledge and its ability to interpret new data and user queries by constantly analyzing data and queries.
Our intelligent data assistant brings together several components, including natural language processing for understanding user queries and generating answers in natural language, automatic knowledge base construction techniques for learning about data sources and how to find the information requested, as well as deep learning methods for query disambiguation and domain understanding.
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
During the past 10 years, big-data storage layers mainly focus on analytical use cases. When it comes to analytical cases, users usually offload data onto Hadoop cluster and perform queries on HDFS files. People struggle dealing with modifications on append only storage and maintain fragile ETL pipelines.
On the other hand, although Spark SQL has been proven effective parallel query processing engine, some tricks common in traditional databases are not available due to characteristics of storage underneath. TiSpark sits directly on top of a distributed database (TiDB)’s storage engine, expand Spark SQL’s planning with its own extensions and utilizes unique features of database storage engine to achieve functions not possible for Spark SQL on HDFS. With TiSpark, users are able to perform queries directly on changing / fresh data in real time.
The takeaways from this two are twofold:
— How to integrate Spark SQL with a distributed database engine and the benefit of it
— How to leverage Spark SQL’s experimental methods to extend its capacity.
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15MLconf
Spark DataFrames and ML Pipelines: In this talk, we will discuss two recent efforts in Spark to scale up data science: distributed DataFrames and Machine Learning Pipelines. These components allow users to manipulate distributed datasets and handle complex ML workflows, using intuitive APIs in Python, Java, and Scala (and R in development).
Data frames in R and Python have become standards for data science, yet they do not work well with Big Data. Inspired by R and Pandas, Spark DataFrames provide concise, powerful interfaces for structured data manipulation. DataFrames support rich data types, a variety of data sources and storage systems, and state-of-the-art optimization via the Spark SQL Catalyst optimizer.
On top of DataFrames, we have built a new ML Pipeline API. ML workflows often involve a complex sequence of processing and learning stages, including data cleaning, feature extraction and transformation, training, and hyperparameter tuning. With most current tools for ML, it is difficult to set up practical pipelines. Inspired by scikit-learn, we built simple APIs to help users quickly assemble and tune practical ML pipelines.
Validating credit cards on mobile using deep learningDataWorks Summit
The ability to validate credit cards using a mobile device is fruitful for many e-commerce platforms including Uber. Not only does this provide a seamless experience to users, but it also enables the company to verify that a user has physical possession of the credit card. In this talk we will discuss our new application that uses object detection neural networks to scan credit cards.
Traditionally, machine learning models are hosted server side, but with challenges including high bandwidth inputs, low network speeds, and a greater focus on user privacy, hosting these models server side is not always feasible. Recent advancements bring up the possibility of deploying these models directly on the mobile devices. We will discuss the challenges we faced in designing the vision model to run on a mobile device. These include reducing the model’s size footprint and optimizing the model to run on various different types of mobile hardware.
Speakers
Richard Ash, Mobile Software Engineer, Uber
Lenny Evans, Data Scientists, Uber
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
Many high-tech industries rely on machine-learning systems in production environments to automatically classify and respond to vast amounts of incoming data. Despite their critical roles, these systems are often not actively monitored. When a problem first arises, it may go unnoticed for some time. Once it is noticed, investigating its underlying cause is a time-consuming, manual process. Wouldn’t it be great if the model’s output were automatically monitored? If they could be visualized, sliced by different dimensions? If the system could automatically detect performance degradation and trigger alerts? In this presentation, we describe our experience from building such a core machine-learning services: Model Evaluation.
Our service provides automated, continuous evaluation of the performance of a deployed model over commonly-used metrics like the area-under-the-curve (AUC), root-mean-square-error (RMSE) etc. In addition, summary statistics about the model’s output, their distributions are also computed. The service also provides a dashboard to visualize the performance metrics, summary statistics and distributions of a model over time along with REST APIs to retrieve these metrics programmatically.
These metrics can be sliced by input features (e.g. Geography, Product type) to provide insights into model performance over different segments. The talk will describe various components that are required in building such a service and metrics of interest. Our system has a backend component built with spark on Azure Databricks. The backend can scale to analyze TBs of data to generate model evaluation metrics.
We will talk about how we modified Spark MLLib for computing AUC sliced by different dimensions and other optimizations in Spark to improve compute and performance. Our front-end and middle-tier, built with Docker and Azure Webapp provides visuals and REST APIs to retrieve the above metrics. This talk will cover various aspects of building, deploying and using the above system.
AWS Mobile Week at the San Francisco Loft
Introduction to GraphQL
GraphQL is a query language for APIs and a runtime for fulfilling those queries. It gives clients the power to ask for exactly what they need, which makes it a great fit for modern web and mobile apps. In this talk, we explain why GraphQL was created, introduce you to the syntax and behavior, and then show how to use it to build powerful APIs for your data. We will also introduce you to AWS AppSync, a GraphQL-powered serverless backend for apps, which you can use to host GraphQL APIs and also add real-time and offline capabilities to your web and mobile apps. You can follow along if you have an AWS account – no GraphQL experience required!
Level: Beginner
Speaker: Rohan Deshpande - Sr. Software Developer Engineer, AWS Mobile Applications
How web works and browser works ? (behind the scenes)Vibhor Grover
how web and browser works, this presentation can help you in understanding what happens when you enter a URL in your browser and how the page is displayed by the browser, and how we can improve the performance of our applications.
No REST till Production – Building and Deploying 9 Models to Production in 3 ...Databricks
The state of the art in productionizing machine Learning models today primarily addresses building RESTful APIs. In the Digital Ecosystem, RESTful APIs are a necessary, but not sufficient, part of the complete solution for productionizing ML models. And according to recent research by the McKinsey Global Institute, applying AI in marketing and sales has the most potential value.
In the digital ecosystem, productionizing ML models at an accelerated pace becomes easy with:
Feature Store with commonly used features that is available for all data scientists
Feature Stores that distill visitor behavior is ready to use feature vectors in a semi supervised manner
Data pipeline that can support the challenging demands of the digital ecosystem to feed the Feature Store on an ongoing basis
Pipeline templates that support the challenging demands of the digital ecosystem that feed feature store, predict and distribute predictions on an ongoing basis. With these, a major electronics manufacturer was able to build and productionize a new model in 3 weeks.
The use case for the model is retargeting advertising; it analyzes the behavior of website visitors and builds customized audiences of the visitors that are most likely to purchase 9 different products. Using the model, this manufacturer was able to maintain the same level of purchases with half of the retargeting media spend -increasing the efficiency of their marketing spend by 100%.
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Databricks
Building flexible machine learning libraries adapted for Netflix’s use cases is paramount in our continued efforts to better model our users’ behaviors and provide them great personalized video recommendations.
This talk introduces one such spark-based stratification library developed at Netflix to aid “Training Set Stratification” in offline machine learning workflows. Originally created to implement user selection algorithms in our data snapshotting infrastructure, the library has evolved to cater to general-purpose stratification use cases in ML pipelines. We will talk about how using the stratification library’s DSL (domain specific language) and its underlying Spark based implementation, one can easily express complex sampling rules and dynamically carve out matching portions of a Spark dataframe.
For example, arbitrary rules governing the distributions of user attributes (and combinations there of) such as origin country, video play frequency, tenure etc can be easily enforced when constructing a ML training data set. The demo section of the talk will showcase example usages of the stratification library in a Jupyter notebook.
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
We all know what they say – the bigger the data, the better. But when the data gets really big, how do you use it? This talk will cover three of the most popular deep learning frameworks: TensorFlow, Keras, and Deep Learning Pipelines, and when, where, and how to use them.
We’ll also discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data), as well as help you answer questions such as:
– As a developer how do I pick the right deep learning framework for me?
– Do I want to develop my own model or should I employ an existing one
– How do I strike a trade-off between productivity and control through low-level APIs?
In this session, we will show you how easy it is to build an image classifier with Tensorflow, Keras, and Deep Learning Pipelines in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you, and perhaps with a better sense for how to fool an image classifier!
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
How to Build a Recommendation Engine on SparkCaserta
How to Build a Recommendation Engine on Spark was a presentation given by Joe Caserta, CEO and founder of Caserta Concepts, at @AnalyticsWeek in Boston.
Boston's Data AnalyticsStreet Conference is a 2 day packed event with thought provoking keynotes, knowledge filled sessions, intense workshops, insightful panels, and real-world case studies - engaging analytics community with latest methodologies and trends. The conference encompasses largest Speaker-to-Attendee ratio for unmatched networking and learning opportunity.
For more information on the services and solutions Caserta Concepts offers, visit our website at http://casertaconcepts.com/.
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
Graph data and graph analytics are increasingly important in data science and engineering. Cypher is an open language used for querying and updating graph databases and analytics platforms, which is now available in the Apache Spark environment. Neo4j Morpheus leverages the open source graph language project to integrate data from Neo4j operational graph databases with Hive and JDBC SQL data sources, using new Cypher features like the Property Graph Catalog, named graphs, graph projection, parameterized graph view functions, and graph/table views. Input and output graphs can be loaded and stored as structured collections of DataFrames with strong graph schemas to ensure data consistency and graph query optimization. Property graphs can also be analyzed and transformed using graph algorithms such as those in the GraphFrames project. Besides describing and demonstrating these capabilities, this talk also discusses the Spark Project Improvement Proposal to bring Cypher into Spark 3.0, and outlines current work to unify Cypher with other graph query languages to form a new ISO standard Graph Query Language.
Speakers: Alastair Green, Martin Junghanns
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
GE Aviation has hundreds of data scientists and engineers developing algorithms. The majority of these people do not have the time to learn Apache Spark and continue to develop on local machines in Python or R. We also have lots of historical code that was not developed for Spark. However, the business wanted to deploy to a Spark environment for scalability, as quickly as possible. So how did we bridge the gap? A data scientist and software engineer will co-present to share how we approached the problem of building, unifying and scaling these algorithms.
It is a basic presentation which can help you understand the basic concepts about Graphql and how it can be used to resolve the frontend integration of projects and help in reducing the data fetching time
This presentation also explains the core features of Graphql and why It is a great alternative for REST APIs along with the procedure with which we can integrate it into our projects
Flink has been used by many users in their ML use cases, such as real-time feature engineering and near-line inference. For the other ML use cases that are more batch-oriented, such as model training, validation, usually other systems are used. This talk we give in Flink Forward 2019 show the efforts in Flink community to let Flink cover all the ML use cases.
A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Langu...Databricks
Rather than running pre-defined queries embedded in dashboards, business users and data scientists want to explore data in more intuitive ways. Natural language interfaces for data exploration have gained considerable traction in industry. Their success is triggered by advancements in machine learning and by novel big data technologies that enable processing large amounts of data in real-time. However, even though these systems show significant progress, they have not yet reached the maturity level to support real users in data exploration scenarios either due to the lack of supported functionality or the narrow application scope, remaining one of the ‘holy grails’ of the data analytics community.
In this talk, we will present a Spark-based architecture of an intelligent data assistant, a system that combines real-time data processing and analytics over large amounts of data with user interaction in natural language, and we will argue why Spark is the right platform for next-gen intelligent data assistants.
Our intelligent data assistant
(a) enables a more natural interaction with the user through natural language;
(b) offers active guidance through explanations and suggestions;
(c) constantly learns and improves its performance. To build an intelligent data assistant, there are several challenges. Unlike search engines, users tend to express sophisticated query logics and expect perfect results. The inherent complexity of natural languages complicates things in several ways. The intricacies of the data domain require that the system constantly expands its domain knowledge and its ability to interpret new data and user queries by constantly analyzing data and queries.
Our intelligent data assistant brings together several components, including natural language processing for understanding user queries and generating answers in natural language, automatic knowledge base construction techniques for learning about data sources and how to find the information requested, as well as deep learning methods for query disambiguation and domain understanding.
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
During the past 10 years, big-data storage layers mainly focus on analytical use cases. When it comes to analytical cases, users usually offload data onto Hadoop cluster and perform queries on HDFS files. People struggle dealing with modifications on append only storage and maintain fragile ETL pipelines.
On the other hand, although Spark SQL has been proven effective parallel query processing engine, some tricks common in traditional databases are not available due to characteristics of storage underneath. TiSpark sits directly on top of a distributed database (TiDB)’s storage engine, expand Spark SQL’s planning with its own extensions and utilizes unique features of database storage engine to achieve functions not possible for Spark SQL on HDFS. With TiSpark, users are able to perform queries directly on changing / fresh data in real time.
The takeaways from this two are twofold:
— How to integrate Spark SQL with a distributed database engine and the benefit of it
— How to leverage Spark SQL’s experimental methods to extend its capacity.
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15MLconf
Spark DataFrames and ML Pipelines: In this talk, we will discuss two recent efforts in Spark to scale up data science: distributed DataFrames and Machine Learning Pipelines. These components allow users to manipulate distributed datasets and handle complex ML workflows, using intuitive APIs in Python, Java, and Scala (and R in development).
Data frames in R and Python have become standards for data science, yet they do not work well with Big Data. Inspired by R and Pandas, Spark DataFrames provide concise, powerful interfaces for structured data manipulation. DataFrames support rich data types, a variety of data sources and storage systems, and state-of-the-art optimization via the Spark SQL Catalyst optimizer.
On top of DataFrames, we have built a new ML Pipeline API. ML workflows often involve a complex sequence of processing and learning stages, including data cleaning, feature extraction and transformation, training, and hyperparameter tuning. With most current tools for ML, it is difficult to set up practical pipelines. Inspired by scikit-learn, we built simple APIs to help users quickly assemble and tune practical ML pipelines.
Validating credit cards on mobile using deep learningDataWorks Summit
The ability to validate credit cards using a mobile device is fruitful for many e-commerce platforms including Uber. Not only does this provide a seamless experience to users, but it also enables the company to verify that a user has physical possession of the credit card. In this talk we will discuss our new application that uses object detection neural networks to scan credit cards.
Traditionally, machine learning models are hosted server side, but with challenges including high bandwidth inputs, low network speeds, and a greater focus on user privacy, hosting these models server side is not always feasible. Recent advancements bring up the possibility of deploying these models directly on the mobile devices. We will discuss the challenges we faced in designing the vision model to run on a mobile device. These include reducing the model’s size footprint and optimizing the model to run on various different types of mobile hardware.
Speakers
Richard Ash, Mobile Software Engineer, Uber
Lenny Evans, Data Scientists, Uber
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
Many high-tech industries rely on machine-learning systems in production environments to automatically classify and respond to vast amounts of incoming data. Despite their critical roles, these systems are often not actively monitored. When a problem first arises, it may go unnoticed for some time. Once it is noticed, investigating its underlying cause is a time-consuming, manual process. Wouldn’t it be great if the model’s output were automatically monitored? If they could be visualized, sliced by different dimensions? If the system could automatically detect performance degradation and trigger alerts? In this presentation, we describe our experience from building such a core machine-learning services: Model Evaluation.
Our service provides automated, continuous evaluation of the performance of a deployed model over commonly-used metrics like the area-under-the-curve (AUC), root-mean-square-error (RMSE) etc. In addition, summary statistics about the model’s output, their distributions are also computed. The service also provides a dashboard to visualize the performance metrics, summary statistics and distributions of a model over time along with REST APIs to retrieve these metrics programmatically.
These metrics can be sliced by input features (e.g. Geography, Product type) to provide insights into model performance over different segments. The talk will describe various components that are required in building such a service and metrics of interest. Our system has a backend component built with spark on Azure Databricks. The backend can scale to analyze TBs of data to generate model evaluation metrics.
We will talk about how we modified Spark MLLib for computing AUC sliced by different dimensions and other optimizations in Spark to improve compute and performance. Our front-end and middle-tier, built with Docker and Azure Webapp provides visuals and REST APIs to retrieve the above metrics. This talk will cover various aspects of building, deploying and using the above system.
AWS Mobile Week at the San Francisco Loft
Introduction to GraphQL
GraphQL is a query language for APIs and a runtime for fulfilling those queries. It gives clients the power to ask for exactly what they need, which makes it a great fit for modern web and mobile apps. In this talk, we explain why GraphQL was created, introduce you to the syntax and behavior, and then show how to use it to build powerful APIs for your data. We will also introduce you to AWS AppSync, a GraphQL-powered serverless backend for apps, which you can use to host GraphQL APIs and also add real-time and offline capabilities to your web and mobile apps. You can follow along if you have an AWS account – no GraphQL experience required!
Level: Beginner
Speaker: Rohan Deshpande - Sr. Software Developer Engineer, AWS Mobile Applications
How web works and browser works ? (behind the scenes)Vibhor Grover
how web and browser works, this presentation can help you in understanding what happens when you enter a URL in your browser and how the page is displayed by the browser, and how we can improve the performance of our applications.
No REST till Production – Building and Deploying 9 Models to Production in 3 ...Databricks
The state of the art in productionizing machine Learning models today primarily addresses building RESTful APIs. In the Digital Ecosystem, RESTful APIs are a necessary, but not sufficient, part of the complete solution for productionizing ML models. And according to recent research by the McKinsey Global Institute, applying AI in marketing and sales has the most potential value.
In the digital ecosystem, productionizing ML models at an accelerated pace becomes easy with:
Feature Store with commonly used features that is available for all data scientists
Feature Stores that distill visitor behavior is ready to use feature vectors in a semi supervised manner
Data pipeline that can support the challenging demands of the digital ecosystem to feed the Feature Store on an ongoing basis
Pipeline templates that support the challenging demands of the digital ecosystem that feed feature store, predict and distribute predictions on an ongoing basis. With these, a major electronics manufacturer was able to build and productionize a new model in 3 weeks.
The use case for the model is retargeting advertising; it analyzes the behavior of website visitors and builds customized audiences of the visitors that are most likely to purchase 9 different products. Using the model, this manufacturer was able to maintain the same level of purchases with half of the retargeting media spend -increasing the efficiency of their marketing spend by 100%.
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Databricks
Building flexible machine learning libraries adapted for Netflix’s use cases is paramount in our continued efforts to better model our users’ behaviors and provide them great personalized video recommendations.
This talk introduces one such spark-based stratification library developed at Netflix to aid “Training Set Stratification” in offline machine learning workflows. Originally created to implement user selection algorithms in our data snapshotting infrastructure, the library has evolved to cater to general-purpose stratification use cases in ML pipelines. We will talk about how using the stratification library’s DSL (domain specific language) and its underlying Spark based implementation, one can easily express complex sampling rules and dynamically carve out matching portions of a Spark dataframe.
For example, arbitrary rules governing the distributions of user attributes (and combinations there of) such as origin country, video play frequency, tenure etc can be easily enforced when constructing a ML training data set. The demo section of the talk will showcase example usages of the stratification library in a Jupyter notebook.
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
We all know what they say – the bigger the data, the better. But when the data gets really big, how do you use it? This talk will cover three of the most popular deep learning frameworks: TensorFlow, Keras, and Deep Learning Pipelines, and when, where, and how to use them.
We’ll also discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data), as well as help you answer questions such as:
– As a developer how do I pick the right deep learning framework for me?
– Do I want to develop my own model or should I employ an existing one
– How do I strike a trade-off between productivity and control through low-level APIs?
In this session, we will show you how easy it is to build an image classifier with Tensorflow, Keras, and Deep Learning Pipelines in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you, and perhaps with a better sense for how to fool an image classifier!
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
How to Build a Recommendation Engine on SparkCaserta
How to Build a Recommendation Engine on Spark was a presentation given by Joe Caserta, CEO and founder of Caserta Concepts, at @AnalyticsWeek in Boston.
Boston's Data AnalyticsStreet Conference is a 2 day packed event with thought provoking keynotes, knowledge filled sessions, intense workshops, insightful panels, and real-world case studies - engaging analytics community with latest methodologies and trends. The conference encompasses largest Speaker-to-Attendee ratio for unmatched networking and learning opportunity.
For more information on the services and solutions Caserta Concepts offers, visit our website at http://casertaconcepts.com/.
Conversion Models: A Systematic Method of Building Learning to Rank Training ...Lucidworks
When using user signals to improve relevance, what should you use? Clicks are more frequent, but really only correspond to a search result looking attractive. A conversion is a powerful signal of true relevance but occurs less frequently. Can we combine shallow "this looks interesting" click events along with strong, but rare conversion signals in a robust fashion to generate learning to rank training data? In this talk, we introduce click models, an industry-proven way of measuring search result attractiveness from clicks, and propose a systematic way of incorporating conversion data into click models. Whether your industry is conversion heavy (like e-commerce), or lacking in any clear conversion signal (like publishing) you'll take away from this talk a system for turning any search analytics into robust judgments and training data. Because, after all, there is no AI-based Search without good training data!
Doug Turnbull, OpenSource Connections
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesSanjay Willie
A comprehensive exploration of artificial intelligence, particularly focusing on its historical development, notable milestones, and various applications. It begins with a brief history of AI, tracing its ancient philosophical roots through to contemporary advancements like quantum computing and advanced robotics. Key historical highlights include the development of "Shakey," the first mobile robot capable of reasoning about its environment, and ELIZA, the first chatbot.
The presentation also covers the evolution of self-driving technology, starting with Ernst Dickmanns' pioneering work in the 1980s. It delves into the profound impact of AI in games, exemplified by AlphaGo's victory over a human Go champion.
Furthermore, it details the types of AI and machine learning, emphasizing the revolutionary role of ChatGPT. Introduced by OpenAI, ChatGPT quickly became the fastest app to reach 100 million users due to its versatile capabilities in language processing and interaction.
Lastly, the slides provide practical insights on effectively utilizing ChatGPT, such as optimizing input to enhance outcomes and integrating ChatGPT's API into various applications. The presentation is aimed at both educating on AI's capabilities and demonstrating its practical applications in modern technology scenarios.
Digital Marketing is the leading industry in the market and most popular filed is SEO so why are waiting just learn SEO and prepare for SEo interview Questions. Due to this pandemic, the Digital Advertising industry has been increased by about 15% growth in the year 2020.
Become an artisan web analytics practitioner by building your own analytics QA tool. For Adobe Analytics but you could do the same with Google Analytics, A/B testing, tag management, VOC tools and many other analytics tools
Uncovering 'not provided' keyword data Clayton Wood
The platform can set up all of the filters and views in Google Analytics automatically. There's a free version so you can build one on your own as well.
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
Serverless Toronto's 6th-anniversary event helps IT pros understand and prepare for the #GenAI tsunami ahead. You'll gain situational awareness of the LLM Landscape, receive condensed insights, and actionable advice about RAG in 2024 from Google AI Lead Mark Ryan and LlamaIndex creator Jerry Liu. We chose #RAG (Retrieval-Augmented Generation) because it is the predominant paradigm for building #LLM (Large Language Model) applications in enterprises today - and that's where the jobs will be shifting. Here is the recording: https://youtu.be/P5xd1ZjD-Os?si=iq8xibj5pJsJ62oW
Crawlable Spatial Data - #Geo4Web research topic #3Dimitri van Hees
Outcomes of topic 3 of the "Spatial Data on the Web" testbed. About best practices to publish crawlable, devloper-friendly and machine-friendly geospatial data on the web.
How to unlock the secrets of effortless keyword research with ChatGPT.pptxDaniel Smullen
A guide on how to do keyword research using ChatGPT. Comparison of ChtGPT keyword research versus standard keyword research, the pros and cons, as well as some really great keyword research prompts to try within ChatGPT.
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
Three aspects of search quality; focusing on relevance; why this is not just a technology problem; measuring search maturity & relevance; open source tools and techniques; Solr and Elasticsearch
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
With ecommerce experiencing explosive growth, it seems intuitive that the B2B segment of that ecosystem is mirroring the same trajectory. That said, B2B has very different needs when it comes to transacting with the same style of experiences that we see in B2C. For instance, B2B ecommerce is about precision findability, whereas B2C customers can convert at higher rates when they’re just browsing online. In order for the B2B buying experience to be successful, search needs to be tuned to meet the unique needs of the segment.
In this webinar with Forrester senior analyst Joe Cicman, you’ll learn:
-Which verticals in B2B will drive the most growth, and how machine-learning powered personalization tactics can be deployed to support those specific verticals
-Why an omnichannel selling approach must be deployed in order to see success in B2B
-How deploying content search capabilities will support a longer sales cycle at scale
-What the next steps are to support a robust B2B commerce strategy supported by new technology
Speakers
Joe Cicman, Senior Analyst, Forrester
Jenny Gomez, VP of Marketing, Lucidworks
Customer loyalty starts with quickly responding to your customer’s needs. When it comes to resolving open support cases, time is of the essence. Time spent searching for answers adds up and creates inefficiencies in resolving cases at scale. Relevant answers need to be a few clicks away and easily accessible for agents directly from their service console.
We will explore how Lucidworks’ Agent Insights application automatically connects agents with the correct answers and resources. You’ll learn how to:
-Configure a proactive widget in an agent’s case view page to access resources across third-party systems (such as Sharepoint, Confluence, JIRA, Zendesk, and ServiceNow).
-Easily set up query pipelines to autonomously route assets and resources that are relevant to the case-at-hand—directly to the right agent.
-Identify subject matter experts within your support data and access tribal knowledge with lightning-fast speed.
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
Lunch and Learn during Retail TouchPoints #RIC21 virtual event.
***
Crate & Barrel’s previous search solution couldn’t provide its shoppers with an online search and browse experience consistent with the customer-centric Crate & Barrel brand. Meanwhile, Crate & Barrel merchandisers spent the bulk of their time manually creating and maintaining search rules. The search experience impacted customer retention, loyalty, and revenue growth.
Join this lunch & learn for an interactive chat on how Crate & Barrel partnered with Lucidworks to:
-Improve search and browse by modernizing the technology stack with ML-based personalization and merchandising solutions
-Enhance the experience for both shoppers and merchandisers
-Explore signals to transform the omnichannel shopping experience
Questions? Visit https://lucidworks.com/contact/
Learn how to guide customers to relevant products using eCommerce search, hyper-personalisation, and recommendations in our ‘Best-In-Class Retail Product Discovery’ webinar.
Nowadays, shoppers want their online experience to be engaging, inspirational and fulfilling. They want to find what they’re looking for quickly and easily. If the sought after item isn’t available, they want the next best product or content surfaced to them. They want a website to understand their goals as though they were talking to a sales assistant in person, in-store.
In this webinar, we explore IMRG industry data insights and a best-in-class example of retail product discovery. You’ll learn:
- How AI can drive increased revenue through hyper-personalised experiences
- How user intent can be easily understood and results displayed immediately
- How merchandisers can be empowered to curate results and product placement – all without having to rely on IT.
Presented by:
Dave Hawkins, Principal Sales Engineer - Lucidworks
Matthew Walsh, Director of Data & Retail - IMRG
Connected Experiences Are Personalized ExperiencesLucidworks
Many companies claim personalization and omnichannel capabilities are top priorities. Few are able to deliver on those experiences.
For a recent Lucidworks-commissioned study, Forrester Consulting surveyed 350+ global business decision-makers to see what gets in the way of achieving these goals. They discovered that inefficient technology, lack of behavioral insights, and failure to tie initiatives to enterprise-wide goals are some of the most frequent blockers to personalization success.
Join guest speaker, Forrester VP and Principal Analyst, Brendan Witcher, and Lucidworks CEO, Will Hayes, to hear the results of the Forrester Consulting study, how to avoid “digital blindness,” and how to apply VoC data in real-time to delight customers with personalized experiences connected across every touchpoint.
In this webinar, you’ll learn:
- Why companies who utilize real-time customer signals report more effective personalization
- How to connect employees and customers in a shared experience through search and browse
- How Lucidworks clients Lenovo, Morgan Stanley and Red Hat fast-tracked improvements in conversion, engagement and customer satisfaction
Featuring
- Will Hayes, CEO, Lucidworks
- Brendan Witcher, VP, Principal Analyst, Forrester
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
Intelligent Policing. Leveraging Data to more effectively Serve Communities.
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
-The technology needs of an intelligent police force.
-How a Global Search improves an officer's interaction with existing data.
Featuring:
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
The technology needs of an intelligent police force.
How a Global Search improves an officer's interaction with existing data.
Featuring
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
Wish your conversion rates were higher? Can’t figure out how to efficiently and effectively serve all the visitors on your site? Embarrassed by the quality of your product discovery experience? The bar is high and the influx of online shopping over recent months has reminded us that the opportunities are real. We’re all deep in holiday prep, but let’s take a few minutes to think about January 2021 and beyond. How can we position ourselves for success with our customers and against our competition?
Grab your lunch and let’s dive into three strategies that need to be part of your 2021 roadmap. You don’t need an army to get there. But you do need to take action and capitalize on the shoppers abandoning the product discovery journey on your site.
In this session, attendees will find out how to:
-Take control of merchandising at scale;
-Implement hands-free search relevancy; and
-Address personalization challenges.
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
For a personalized search experience, search curation requires robust text interpretation, data enrichment, relevancy tuning and recommendations. In order to achieve this, language and entity identification are crucial.
For teams working on search applications, advanced language packages allow them to achieve greater recall without sacrificing precision.
Join us for a guided tour of our new Advanced Linguistics packages, available in Fusion, thanks to the technology partnership between Lucidworks and Basistech.
We’ll explore the application of language identification and entity extraction in the context of search, along with practical examples of personalizing search and enhancing entity extraction.
In this webinar, we’ll cover:
-How Fusion uses the Rosette Basic Linguistics and Entity Extraction packages
-Tips for improving language identification and treatment as well as data enrichment for personalization
-Speech2 demo modeling Active Recommendation
-Use Rosette’s packages with Fusion Pipelines to build custom entities for specific domain use cases
Featuring:
-Radu Miclaus, Director of Product, AI and Cloud, Lucidworks, Lucidworks
-Robert Lucarini, Senior Software Engineer, Lucidworks
-Nick Belanger, Solutions Engineer, Basis Technology
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
Before COVID-19, almost 80% of the US workforce worked service in jobs that involve in-person interaction with strangers. Now, leaders of service organizations must reshape their offerings during the pandemic and prepare for whatever the new normal turns out to be. Our three panelists will share ideas for adapting their service businesses, now that closer-than-six-feet isn’t an option.
Join Lucidworks as we talk shop with 3 service business leaders, covering:
-Common impacts of the pandemic on service businesses (and what to do about them),
-How service teams can maintain a human touch across virtual channels, and
-Plans for the future, before and after the pandemic subsides.
Featuring
-Sara Nathan, President & CEO, AMIGOS
-Anthony Carruesco, Founder, AC Fly Fishing
-sara bradley, chef and proprietor, freight house
-Justin Sears, VP Product Marketing, Lucidworks
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
The COVID-19 pandemic has forced companies to support far more customers and employees through digital channels than ever before. Many are turning to chatbots to help meet increasing demand, but traditional rules-based approaches can’t keep up. Our new Smart Answers add-on to Lucidworks Fusion makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
Watch our on-demand webinar showcasing Smart Answers on Lucidworks Fusion. This technology makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
In this webinar, we’ll cover off:
-How search and deep learning extend conversational frameworks for improved experiences
-How Smart Answers improves customer care, call deflection, and employee self-service
-A live demo of Smart Answers for multi-channel self-service support
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
In the current climate, it’s now more important than ever to digitally enable your workforce and customers.
Hear from Simon Taylor, VP Global Partners & Alliances, Lucidworks and Matt Aslett, Research Vice President, 451 Research to get the inside scoop on how industry leaders in Europe are developing and executing their digital transformation strategies.
In this webinar, we’ll discuss:
The top challenges and aspirations European business and technology leaders are solving using AI and search technology
Which search and AI use cases are making the biggest impact in industries such as finance, healthcare, retail and energy in Europe
What technology buyers should look for when evaluating AI and search solutions
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
In this webinar with 451 Research, you'll understand how retailers are using AI to predict customer intent and learn which key performance metrics are used by more than 120 online retailers in Lucidworks’ 2019 Retail Benchmark Survey.
In this webinar, you’ll learn:
● What trends and opportunities are facing the ecommerce industry in 2020
● Why search is the universal path to understanding customer intent
● How large online retailers apply AI to maximize the effectiveness of their personalization efforts
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
Nordstrom Rack | Hautelook curates and serves customers a wide selection of on-trend apparel, accessories, and shoes at an everyday savings of up to 75 percent off regular prices. With over a million visitors shopping across different platforms every day, and a realization that customers have become accustomed to robust and personalized search interactions, Nordstrom Rack | Hautelook launched an initiative over a year ago to provide data science-driven digital experiences to their customers.
In this session, we’ll discuss Nordstrom Rack | Hautelook’s journey of operationalizing a hefty strategy, optimizing a fickle infrastructure, and rallying troops around a single vision of building an expansible machine-learning driven product discovery engine.
The audience will learn about:
-The key technical challenges and outcomes that come with onboarding a solution
-The lessons learned of creating and executing operational design
-The use of Lucidworks Fusion to plug custom data science models into search and browse applications to understand user intent and deliver personalized experiences
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
Knowledge graphs and machine learning are on the rise as enterprises hunt for more effective ways to connect the dots between the data and the business world. With newer technologies, the digital workplace can dramatically improve employee engagement, data-driven decisions, and actions that serve tangible business objectives.
In this webinar, you will learn
-- Introduction to knowledge graphs and where they fit in the ML landscape
-- How breakthroughs in search affect your business
-- The key features to consider when choosing a data discovery platform
-- Best practices for adopting AI-powered search, with real-world examples
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
1. Making Reddit Search
Relevant and Scalable
Anupama Joshi
Senior Engineering Manager, Search
Jerry Bao
Senior Software Engineer, Search
2. Agenda
• What is Reddit?
• Search Architecture
• Improving our Relevance
• The History of Search @ Reddit
• Scaling our Infrastructure
• Q&A
3. What is Reddit?
Reddit is a network of communities where
individuals can find experiences built
around their interests, hobbies and
passions
It’s where people converse about the
things that are most important to them
5. Reddit by the numbers
Alexa Rank (US/World)
MAU
Communities
Posts per day
Comments per day
Votes per day
Searches per day
5th/18th
400M+
1M+
440K+
3.5M+
82M+
68M+
18. Show and Tell: A better subreddit search
Challenge: Redditors are very creative in their subreddit naming (e.g. r/superbowl
is about superb owl pictures) which whilst fun, poses a challenge for discovery.
Answer: faceted search on posts!
21. Show and Tell: Better Post Search
● Post search with phrase matching of selftext
The challenge: What about images and link posts?
Answer - Comments
● Comments are important but which comments are most relevant to the post?
● How do we separate the signal from the noise?
Answer - HVT
● HVTs are the highest scoring tf-idf terms from comment sections.
● Index and match on these HVTs along with post selftexts and titles.
22. Result: Better Post Search
Qualitatively, we saw some users notice almost immediately when we first introduced HVTs.
For some queries, the difference is
quite stark. The following are
search results for the query
‘shabooya’. Note how ‘shabooya’
doesn’t appear anywhere in the title
or the body of the first three post
results, but you can see the phrase
show up in the comments.
23. Result: Better Post Search
● Post click through rate (CTR) (+3.15%),
● Relevancy ranking for navigational searches (MRR) (+4.01%)
● Search experience improvements for navigational searches due to increased
recall on posts with poor title or body text
24. Take It to the Next Level: Improve Search Relevance
● Learn from the users click statistics to automatically generate a relevancy
model
● Rerank Search results based on aggregated Click Signal weights that users
click higher on search results for a given query
○ Stream user events in Solr/Fusion cluster
○ Spark Jobs to aggregate click data
○ Use output from the aggregated signal to boost the search results
25. Result: Post search relevance using signals
7.5 % Increase in CTR12.5 % increase in MRR
28. Head-Tail Analysis
A tail query like “lot of credit card debit” would be rewritten to produce better relevant results.
29. Trending Searches
● Reddit can attribute week-over-week DAU
growth to external events, like game
releases, movie releases, and cultural
events (reference).
● We see similar upticks in searches based on
these events (reference).
● We believe that we can increase search
engagement and time on site by leveraging
these signals to highlight trending queries
to users when they search on Reddit.
30. NSFW Categorization
● Develop NSFW classification criteria
● Query Time classification based content filtering.
● Results boosting/reordering based on classification(boost or filter results
based on knowing the query does/does not have NSFW intent)
● Look at the NSFW results in recall
● Look at the NSFW results people clicked
● Try open source Tensorflow libraries for auto detection of NSFW which is not
marked NSFW
31. Related Searches
● Train a collaborative filtering matrix decomposition recommender using
SparkML's Alternating Least Squares (ALS) to batch compute query-query
similarities
● Related Searches backend based on Collaborative Filtering & Co Occurrence
Counting Algorithm via Temporal Proximity
● Collaborative filtering based recommender systems are a popular technique
applied for movie recommendations at Netflix, or product recommendations in
e-commerce sites like Amazon
32. Related Searches
● Dynamic temporal buckets as source of data.
● All pairs irrespective of number of distinct queries in Session
● Length & temporal distance metrics to help with boosting recommendation.
● Intuitive & easily explainable.
● Scales extremely well for building pluggable logic & adding more dimensions.
35. What’s next
● Contextual Query Understanding
○ how context informs query understanding
● Understanding User Intent
○ classifying the query by its interpretation. The interpretation of the query can then be used to
define intent
● Query rewriting and scoping
○ query rewriting technique that improves precision by matching each query segment to the right
attribute
○ query tagging (special case of named-entity recognition (NER))
37. Reddit Search has an
interesting history...
History of Reddit Search
38. History of Reddit Search
● 2005 - Steve Huffman, cofounder and now CEO, implements postgres tsearch.
● 2006 - Chris Slowe, founding engineer and now CTO, implements pylucene.
○ “we fixed a bug in the search results ordering” - Steve Huffman ‘06
○ “I made a quick fix to search that I hope helps until we get a chance to really fix it.” - Steve ‘07
● 2008 - David King, first employee and former search engineer, implements Solr.
○ “[David]’s been fixing search and hacking mystery projects in Erlang.” - Alexis Ohanian ‘08
○ “I’ve totally replaced the reddit search function.” - David King ‘08
● 2010 - David King replaces Solr with IndexTank.
○ “We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt
before.” - David King ‘10
● 2012 - u/kemitche implements CloudSearch after LinkedIn shut down IndexTank
“Q: Where do you see reddit in 10 years? A: Reddit search might work by then.” - Steve AMA ‘16
39. Redditors told us how
much they loved
Search...
“Reddit Search is great!” - said no redditor ever
40. “This image should honestly replace the 503 error (all servers busy) page.” - u/seven0feleven
41. “Ever since they moved away from scotch tape, I've been able to get irrelevant results in record time.” - u/El_Bandito_Blanquito
42. In 2017, we set out to
rebuild search from the
ground up!
Rebuilding Search
43. Our First Cluster
● Create an AMI with Solr and Fusion packages installed
● Spin up servers with custom AMI
● SSH into each server
○ Install Fusion and Solr
○ Edit configuration files
○ Increase file descriptor limit
● Configured in AWS US West
44. Our First Cluster
Our new cluster was up
and running well! We
immediately started work
on ingesting data and
relevance tuning.
45. But we ran into a
couple of key issues
when trying to scale
up...
Challenge #1
46. Issues with Scaling our Solr Cluster
● Adding capacity to our cluster or changing instance types took a lot of
effort
● Adding capacity our cluster meant that we needed to rebalance our
cluster so that our replicas were equally distributed across machines
○ Solr 7+ introduced some basic autoscaling features but lacked
policies to ensure a cluster was properly balanced
○ Rebalancing process was 100% manual
● Cross-region requests cost unnecessary latency
● As a result, our team was very cautious in scaling our cluster until it
was absolutely needed, to reduce the number of times we scaled up
48. Terraform + Puppet
● Together they allow us to programmatically make changes to
infrastructure and server configuration quickly
● We can describe how we want servers to be setup
○ Install Java and Solr
○ Mount drives and add user groups/permissions
○ Set up Solr configuration files
● Modifications to servers and infra are reviewable, and revertible
● Rollout changes across our fleet with ease
● “Can you add more servers Jerry??”
○ No problem! One line code change.
59. Solr Rebalancing Tool
● Applied balancing rules in order
○ Check each shard’s availability zone distribution and replica
distribution
○ Move replicas so that each collection’s replicas are on the most
amount of machines
○ Move replicas so that each machine has the least amount of
replicas possible
● Outputs list of operations to be performed and confirms with user each
replica to move
63. Our cluster was now
scaling easily, but
reindexing all of our
data took many
weeks...
Challenge #2
64. Indexing Data for Search
● Backfills
○ Pulls data from our datasource
○ Transforms it into the schema we need for indexing
○ Used to add/remove/change field indexing
● Streaming
○ Captures real-time updates so up-to-date information can be
reflected in our indices
○ Transforms data the same way as backfills
65. Why are fast backfills important?
● Quickly iterate on document schemas
● Test new ways to analyze document fields
● Create multiple clusters of the same data for testing
● Fix data issues rapidly
67. Hive
● Pulled data from postgres with sqoop into Hive
● A series of transformations to
○ Join thing and data tables
○ Rotate the keys into columns
○ Store the final result as Parquet in S3
● Fusion/Spark fetched S3 files and indexed data into Solr
68.
69. Issues with v1
● Several weeks to transform data
○ Afraid of changing the schema
● Many stages of transformation, making it hard to debug and figure out
how far upstream data transformation issues were
○ Hard to ensure the end result was correct
70. Thing Service
● Search Service as the transformer and indexer of data
○ Fetches the latest data from the Thing Service
● Special logic in Thing Service made it easier to handle postgres data
○ Score of links, comments
○ Converting to actual data types (booleans, fullnames)
● Cut backfill time from multiple weeks to a single week with
parallelization
71.
72. Issues with v2
● Reliant upon a shared production service for what should be an offline
job
○ We’ve pushed the thing service too hard with our backfills,
affecting other services that rely upon it
● Other initiatives highlighted how slow our ingestion could get
○ HVTs (augmenting links with high value tokens from comments)
○ Attempts to index comment data
73. Spark
● Running our own postgres replicas from wal-e backups in S3
● Spark pulls data directly from postgres and transforms the data
● Can horizontally scale ingestion to be faster
○ Postgres to speed up ingestion of data into Spark
○ Spark to speed up transformation and joining of data
● We can adjust ingestion parallelism by repartitioning in the end
● Cut backfill time significantly from multiple weeks to days
76. Redditors Issue Expensive Queries
● High Recall Queries
○ the, would, you, ifs, news, games
● Crazy Queries
○ (AFD+OR+CDU+OR+CSU+OR+FDP+OR+Grünen+OR+SPD+OR+"
Die+Linke"+OR+Energiepolitik+OR+Gesetze~+OR+Kabinetts~+O
R+Regierungs~+OR+Referentenentwurf)+(Energiehandel~+OR+E
nergiemanagement~+OR+Energiepreis~+OR+Energiesteuer~)
● These queries would take multiple seconds to complete, blocking a
significant number of CPU cores in the cluster
77. Cutting Queries Off
● Utilize timeAllowed in solrconfig.xml to prevent expensive queries
taking up all of your cluster’s resources
○ NOTE: timeAllowed is not a hard cutoff. From the Solr docs:
○ As this check is periodically performed, the actual time for which a
request can be processed before it is aborted would be marginally
greater than or equal to the value of timeAllowed. If the request
consumes more time in other stages, e.g., custom components,
etc., this parameter is not expected to abort the request.
80. Multi-Cluster Solr Environment
● One cluster per collection
● Hardware Isolation: one collections issues won’t affect other
collections
● Scale each collection independently
● Balancing becomes really simple
○ Each machine has equally distributed number of replicas
○ Ensure AZ and shard awareness
81. Solr 7.5 Autoscaling
● Solr 7.5 includes new policies that allow us to equally distribute
replicas by
○ Arbitrary properties
○ Collection
○ Cluster
● Turn Solr Scaling into a one step process