Workday is a leading provider of cloud-based enterprise software products such as Human Capital Management, Talent, Finance, Student, Planning etc. These products produce a wealth of natural language data. However, this data is unstructured and denormalized. Retrieving relevant information from such data is a challenging task. Using simple index-based search methods can only take us so far. The Data Science team at Workday is determined to apply Machine Learning and AI to make search better across Workday’s products.
In this session, we present to you, how we use word embeddings to normalize the data and add structure to it. We will also talk about using word representations to make search intelligent. The specific use cases we will discuss are adding synonyms detection and entity-recommendation.
In this talk, we will focus on the word-embeddings techniques explored, metrics used to evaluate Natural Language Processing Models, tools built, and future work as a part of improving search.
Speaker
Namrata Ghadi, Workday Inc, Software Development Engineer (Data Science)
Adam Baker, Workday Inc, Sr Software Engineer
During the OpenStack Tokyo Summit we provided an overview on how Workday started the production deployment with a very robust and efficient CI/CD process that it explained here.
Workday has built one of the largest OpenStack-based private clouds in the world, hosting a workload of over a million physical cores on over 16,000 compute nodes in 5 data centers for over ten years. However, there was a growing need for a newer, more maintainable deployment model that would closely follow the upstream community. We would like to share our new architecture and deployment approach as well as lessons learned from our experience.
We’ve converted many of our technologies in the process, from…
Migrating from Mitaka, to Victoria
Converting from OpenContrail, to pure L3 Calico with BGP on the host
Deploying with Chef, to deploying with Ansible
Building home-grown container images, to Kolla
Monitoring with Sensu and Wavefront, to Prometheus and Grafana
CI/CD in Jenkins, to Zuul
CentOS 7, to CentOS 8 Stream
We'll also talk about some internal tools we wrote that, while Workday-specific, may inspire you to see what value-add you can make for your customers.
Computing the Square Roots of Unity to break RSA using Quantum AlgorithmsDharmalingam Ganesan
We study the problem of finding the square roots of unity in a finite group in order to factor composite numbers used in RSA. We implemented Peter Shor’s algorithm to find the square root of unity. Experimental results showed that finding the square roots of unity in a finite group multiplicative group is “hard”.
During the OpenStack Tokyo Summit we provided an overview on how Workday started the production deployment with a very robust and efficient CI/CD process that it explained here.
Workday has built one of the largest OpenStack-based private clouds in the world, hosting a workload of over a million physical cores on over 16,000 compute nodes in 5 data centers for over ten years. However, there was a growing need for a newer, more maintainable deployment model that would closely follow the upstream community. We would like to share our new architecture and deployment approach as well as lessons learned from our experience.
We’ve converted many of our technologies in the process, from…
Migrating from Mitaka, to Victoria
Converting from OpenContrail, to pure L3 Calico with BGP on the host
Deploying with Chef, to deploying with Ansible
Building home-grown container images, to Kolla
Monitoring with Sensu and Wavefront, to Prometheus and Grafana
CI/CD in Jenkins, to Zuul
CentOS 7, to CentOS 8 Stream
We'll also talk about some internal tools we wrote that, while Workday-specific, may inspire you to see what value-add you can make for your customers.
Computing the Square Roots of Unity to break RSA using Quantum AlgorithmsDharmalingam Ganesan
We study the problem of finding the square roots of unity in a finite group in order to factor composite numbers used in RSA. We implemented Peter Shor’s algorithm to find the square root of unity. Experimental results showed that finding the square roots of unity in a finite group multiplicative group is “hard”.
Dok Talks #111 - Scheduled Scaling with Dask and Argo WorkflowsDoKC
https://go.dok.community/slack
https://dok.community/
ABSTRACT OF THE TALK
Complex computational workloads in Python are a common sight these days, especially in the context of processing large and complex datasets. Battle-hardened modules such as Numpy, Pandas, and Scikit-Learn can perform low-level tasks, while tools like Dask makes it easy to parallelize these workloads across distributed computational environments. Meanwhile, Argo Workflows offers a Kubernetes-native solution to provisioning cloud resources in Kubernetes and triggering workflows on a regular schedule. Being Kubernetes-native, Argo Workflows also meshes nicely with other Kubernetes tools. This talk discusses the combination of these two worlds by showcasing a set-up for Argo-managed workflows which schedule and automatically scale-out Dask-powered data pipelines in Python.
BIO
Former academic in the field of renewable energy simulation and energy systems analysis. Currently responsible for architecting and maintaining the cloud- and data strategy at ACCURE Battery Intelligence
KEY TAKE-AWAYS FROM THE TALK
Argo Workflows + Dask is a nice combination for data-processing pipelines. There are a a few "gotchyas" to be on the look-out for, but in nevertheless this is still a generally-applicable and powerful combination.
https://github.com/sevberg
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/28XnVtb.
Felix Klock describe the core concepts of the Rust language (ownership, borrowing, and lifetimes), as well as the tools beyond the compiler for open source software component distribution (cargo, crates.io). Filmed at qconlondon.com.
Felix Klock is a research engineer at Mozilla, where he works on the Rust compiler, runtime libraries, and language design. He previously worked on the ActionScript Virtual Machine for the Adobe Flash runtime. Klock is one of the developers of the Larceny Scheme language runtime.
SyScan Singapore 2010 - Returning Into The PHP-InterpreterStefan Esser
Among web application security experts there is the popular believe that low level vulnerabilities like buffer overflows and other kinds of memory corruption vulnerabilities do not matter for web application security. In addition to that the increasing use of exploit mitigation techniques on modern web servers make many believe that exploiting remote memory corruptions in webserver software is over. But is it really?
This talk will introduce the idea of returning into the PHP interpreter from memory corruption vulnerabilities and discuss the requirements and feasibility of different ways to do that. This idea will then be applied to a yet undisclosed PHP vulnerability, which is exposed to remote attackers in several widespread PHP applications. Different aspects of this vulnerability will be analyzed and it will be explained how they can be abused in remote information leak and memory corruption exploits. The creation of such a remote code execution exploit will then be detailed step by step.
Alphorm.com Formation Elastic : Maitriser les fondamentauxAlphorm
La recherche d’information dans les logs a toujours été chronophage tant au niveau humain que du traitement informatique : Connexion au serveur, localisation du fichier, choix du bon outil, rappel de la syntaxe, exécution de la commande, etc.
La société Elastic, éditeur du moteur de recherche ElasticSearch, édite dorénavant une pile de produits répondant spécifiquement au traitement des fichiers journaux et se résumant à « Toutes les réponses à vos questions sont dans vos logs ! ».
Cette formation d’initiation a pour objectif de vous apprendre à mettre en place la solution (stack) de monitoring elastic et à comprendre et configurer ses composants, suite Elastic (Beats, Logstash et Kibana).
La suite Elastic, qui se compose à ce jour d'Elasticsearch, Kibana, elasticsearch, APM, Beats, et va être principalement utilisé pour construire des moteurs de recherche, mais aussi agréger et manipuler des données logs.
Dans cette formation suite Elastic, nous aborderons toutes les fonctionnalités permettant de mettre en place une solution de monitoring complète.
Les points forts de la formation
- Formation pratique à hauteur de 80%.
- Formation fonctionnelle qui vous donne des compétences exploitables sur le terrain.
- Formation prenant en considération les besoins du marché.
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Spark Summit
A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. In this talk, Karanjeet Singh and Thamme Gowda will describe a new crawler called Sparkler (contraction of Spark-Crawler) that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.
https://github.com/USCDataScience/sparkler
Change data capture with MongoDB and Kafka.Dan Harvey
In any modern web platform you end up with a need to store different views of your data in many different datastores. I will cover how we have coped with doing this in a reliable way at State.com across a range of different languages, tools and datastores.
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies.
Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML.
This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.
Geospatial data appears to be simple right up until the part when it becomes intractable. There are many gotcha moments with geospatial data in spark and we will break those down in our talk. Users who are new to geospatial analysis in spark will find this portion useful as projections, geometry types, indices, and geometry storage can cause issues.
Tree-like data relationships are common, but working with trees in SQL usually requires awkward recursive queries. This talk describes alternative solutions in SQL, including:
- Adjacency List
- Path Enumeration
- Nested Sets
- Closure Table
Code examples will show using these designs in PHP, and offer guidelines for choosing one design over another.
Neha Narkhede talks about the experience at LinkedIn moving from batch-oriented ETL to real-time streams using Apache Kafka and how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data. She covers some of the challenges of scaling Kafka to hundreds of billions of events per day at Linkedin, supporting thousands of engineers, etc.
Nowadays REST APIs are behind each mobile and nearly all of web applications. As such they bring a wide range of possibilities in cases of communication and integration with given system. But with great power comes great responsibility. This talk aims to provide general guidance related do API security assessment and covers common API vulnerabilities. We will look at an API interface from the perspective of potential attacker.
I will show:
how to find hidden API interfaces
ways to detect available methods and parameters
fuzzing and pentesting techniques for API calls
typical problems
I will share several interesting cases from public bug bounty reports and personal experience, for example:
* how I got various credentials with one API call
* how to cause DoS by running Garbage Collector from API
Querying Elasticsearch with Deep Learning to Answer Natural Language QuestionsSebastian Blank
Natural language is gaining more and more relevance as an interface between man and machine. Already today, we are able to carry out simple task by talking to our smartphone or smart speaker, like Google Home or Alexa. An important challenge for any kind of dialog agent or chatbot is to include external knowledge into the conversation with the user. Therefore, such systems need to be able to interact with resources like relational databases or unstructured resources, like search engines. However, the complexity of natural language makes it hard to capture diverse utterances with a set pre-defined rules. Instead, we present an approach that leverages Deep Learning to learn how to query an Elasticsearch given natural language questions. As our model learns to follow the inherent logic of querying, it is even possible to switch to other systems and query languages. This carries a great potential for future applications of Elasticsearch and related NoSQL solutions.
[This is work presented at SIGMOD'13.]
The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.
PHP is the most commonly used server-side programming and deployed more than 80% in web server all over the world. However, PHP is a 'grown' language rather than deliberately engineered, making writing insecure PHP applications far too easy and common. If you want to use PHP securely, then you should be aware of all its pitfalls.
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
This talk describes how to implement conceptual search (semantic search) within a modern search engine using the word2vec algorithm to learn concepts. We also cover how to auto-tune the search engine parameters using black box optimization techniques, and the problems of feedback loops encountered when building machine learning systems that modify the user behavior used to train the system.
Dok Talks #111 - Scheduled Scaling with Dask and Argo WorkflowsDoKC
https://go.dok.community/slack
https://dok.community/
ABSTRACT OF THE TALK
Complex computational workloads in Python are a common sight these days, especially in the context of processing large and complex datasets. Battle-hardened modules such as Numpy, Pandas, and Scikit-Learn can perform low-level tasks, while tools like Dask makes it easy to parallelize these workloads across distributed computational environments. Meanwhile, Argo Workflows offers a Kubernetes-native solution to provisioning cloud resources in Kubernetes and triggering workflows on a regular schedule. Being Kubernetes-native, Argo Workflows also meshes nicely with other Kubernetes tools. This talk discusses the combination of these two worlds by showcasing a set-up for Argo-managed workflows which schedule and automatically scale-out Dask-powered data pipelines in Python.
BIO
Former academic in the field of renewable energy simulation and energy systems analysis. Currently responsible for architecting and maintaining the cloud- and data strategy at ACCURE Battery Intelligence
KEY TAKE-AWAYS FROM THE TALK
Argo Workflows + Dask is a nice combination for data-processing pipelines. There are a a few "gotchyas" to be on the look-out for, but in nevertheless this is still a generally-applicable and powerful combination.
https://github.com/sevberg
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/28XnVtb.
Felix Klock describe the core concepts of the Rust language (ownership, borrowing, and lifetimes), as well as the tools beyond the compiler for open source software component distribution (cargo, crates.io). Filmed at qconlondon.com.
Felix Klock is a research engineer at Mozilla, where he works on the Rust compiler, runtime libraries, and language design. He previously worked on the ActionScript Virtual Machine for the Adobe Flash runtime. Klock is one of the developers of the Larceny Scheme language runtime.
SyScan Singapore 2010 - Returning Into The PHP-InterpreterStefan Esser
Among web application security experts there is the popular believe that low level vulnerabilities like buffer overflows and other kinds of memory corruption vulnerabilities do not matter for web application security. In addition to that the increasing use of exploit mitigation techniques on modern web servers make many believe that exploiting remote memory corruptions in webserver software is over. But is it really?
This talk will introduce the idea of returning into the PHP interpreter from memory corruption vulnerabilities and discuss the requirements and feasibility of different ways to do that. This idea will then be applied to a yet undisclosed PHP vulnerability, which is exposed to remote attackers in several widespread PHP applications. Different aspects of this vulnerability will be analyzed and it will be explained how they can be abused in remote information leak and memory corruption exploits. The creation of such a remote code execution exploit will then be detailed step by step.
Alphorm.com Formation Elastic : Maitriser les fondamentauxAlphorm
La recherche d’information dans les logs a toujours été chronophage tant au niveau humain que du traitement informatique : Connexion au serveur, localisation du fichier, choix du bon outil, rappel de la syntaxe, exécution de la commande, etc.
La société Elastic, éditeur du moteur de recherche ElasticSearch, édite dorénavant une pile de produits répondant spécifiquement au traitement des fichiers journaux et se résumant à « Toutes les réponses à vos questions sont dans vos logs ! ».
Cette formation d’initiation a pour objectif de vous apprendre à mettre en place la solution (stack) de monitoring elastic et à comprendre et configurer ses composants, suite Elastic (Beats, Logstash et Kibana).
La suite Elastic, qui se compose à ce jour d'Elasticsearch, Kibana, elasticsearch, APM, Beats, et va être principalement utilisé pour construire des moteurs de recherche, mais aussi agréger et manipuler des données logs.
Dans cette formation suite Elastic, nous aborderons toutes les fonctionnalités permettant de mettre en place une solution de monitoring complète.
Les points forts de la formation
- Formation pratique à hauteur de 80%.
- Formation fonctionnelle qui vous donne des compétences exploitables sur le terrain.
- Formation prenant en considération les besoins du marché.
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Spark Summit
A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. In this talk, Karanjeet Singh and Thamme Gowda will describe a new crawler called Sparkler (contraction of Spark-Crawler) that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.
https://github.com/USCDataScience/sparkler
Change data capture with MongoDB and Kafka.Dan Harvey
In any modern web platform you end up with a need to store different views of your data in many different datastores. I will cover how we have coped with doing this in a reliable way at State.com across a range of different languages, tools and datastores.
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies.
Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML.
This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.
Geospatial data appears to be simple right up until the part when it becomes intractable. There are many gotcha moments with geospatial data in spark and we will break those down in our talk. Users who are new to geospatial analysis in spark will find this portion useful as projections, geometry types, indices, and geometry storage can cause issues.
Tree-like data relationships are common, but working with trees in SQL usually requires awkward recursive queries. This talk describes alternative solutions in SQL, including:
- Adjacency List
- Path Enumeration
- Nested Sets
- Closure Table
Code examples will show using these designs in PHP, and offer guidelines for choosing one design over another.
Neha Narkhede talks about the experience at LinkedIn moving from batch-oriented ETL to real-time streams using Apache Kafka and how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data. She covers some of the challenges of scaling Kafka to hundreds of billions of events per day at Linkedin, supporting thousands of engineers, etc.
Nowadays REST APIs are behind each mobile and nearly all of web applications. As such they bring a wide range of possibilities in cases of communication and integration with given system. But with great power comes great responsibility. This talk aims to provide general guidance related do API security assessment and covers common API vulnerabilities. We will look at an API interface from the perspective of potential attacker.
I will show:
how to find hidden API interfaces
ways to detect available methods and parameters
fuzzing and pentesting techniques for API calls
typical problems
I will share several interesting cases from public bug bounty reports and personal experience, for example:
* how I got various credentials with one API call
* how to cause DoS by running Garbage Collector from API
Querying Elasticsearch with Deep Learning to Answer Natural Language QuestionsSebastian Blank
Natural language is gaining more and more relevance as an interface between man and machine. Already today, we are able to carry out simple task by talking to our smartphone or smart speaker, like Google Home or Alexa. An important challenge for any kind of dialog agent or chatbot is to include external knowledge into the conversation with the user. Therefore, such systems need to be able to interact with resources like relational databases or unstructured resources, like search engines. However, the complexity of natural language makes it hard to capture diverse utterances with a set pre-defined rules. Instead, we present an approach that leverages Deep Learning to learn how to query an Elasticsearch given natural language questions. As our model learns to follow the inherent logic of querying, it is even possible to switch to other systems and query languages. This carries a great potential for future applications of Elasticsearch and related NoSQL solutions.
[This is work presented at SIGMOD'13.]
The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.
PHP is the most commonly used server-side programming and deployed more than 80% in web server all over the world. However, PHP is a 'grown' language rather than deliberately engineered, making writing insecure PHP applications far too easy and common. If you want to use PHP securely, then you should be aware of all its pitfalls.
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
This talk describes how to implement conceptual search (semantic search) within a modern search engine using the word2vec algorithm to learn concepts. We also cover how to auto-tune the search engine parameters using black box optimization techniques, and the problems of feedback loops encountered when building machine learning systems that modify the user behavior used to train the system.
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.
Vectors in Search - Towards More Semantic MatchingSimon Hughes
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then I will describe a few different techniques for efficiently searching vector-based representations in an inverted index, such as learning sparse representations of vectors, clustering, and learning binary vectors. Finally, I will discuss some of the pitfalls of vector-based search, and how to get the best of both worlds by combining vector-based scoring with traditional relevancy metrics such as BM25.
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...Dr. Haxel Consult
In 2013 we witnessed an evolutionary change in the NLP field evolved thanks to the introduction of space embeddings that, with the use of deep learning architectures, achieved human-level performances in many NLP tasks. With the introduction of the Attention mechanism in 2017 the results were further improved and, as result, embeddings are quickly becoming the de facto standards in solving many NLP problems. In this presentation, you will learn how generate and use space embedding for search purposes and provide comparison metrics to more traditional relevance-based search engines. Moreover, I will provide some initial results on a paper currently under review that provides an insight on hyperparameter tuning during the generation of embeddings.
Relations play a vital role on knowledge construction
and maintenance thereof. They for example connect domain
type entities to range type entities, like the relation born in
connects some Persons to some Places. Over any dataset, the
domain-range information is used to maintain data consistency.
Therefore, we see that knowledge construction frameworks
sometime engage costly Knowledge Engineers to define the
domain-range information in form of a schema or an ontology.
We also see that frameworks that hold such defined domain-range information, often do not follow them strictly. In the worst case some frameworks do not even allow to define a
domain-range, rather they just gather the knowledge entries.
One reason of not defining the domain-range information is
that it is costly. On the other hand, the reason for not following
the domain-range constraint is that the most of them are either
manual or semi-automatic, therefore they face adaptation
difficulty. In this research, we propose a relation-wise machine
learning model that can define and validate domain-range
information automatically. The initial experiment shows that
the proposed framework performs promisingly.
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
In the talk I describe two approaches for improve the recall and precision of an enterprise search engine using machine learning techniques. The main focus is improving relevancy with ML while using your existing search stack, be that Luce, Solr, Elastic Search, Endeca or something else.
This tutorial gives an overview of how search engines and machine learning techniques can be tightly coupled to address the need for building scalable recommender or other prediction based systems. Typically, most of them architect retrieval and prediction in two phases. In Phase I, a search engine returns the top-k results based on constraints expressed as a query. In Phase II, the top-k results are re-ranked in another system according to an optimization function that uses a supervised trained model. However this approach presents several issues, such as the possibility of returning sub-optimal results due to the top-k limits during query, as well as the prescence of some inefficiencies in the system due to the decoupling of retrieval and ranking.
To address this issue the authors created ML-Scoring, an open source framework that tightly integrates machine learning models into Elasticsearch, a popular search engine. ML-Scoring replaces the default information retrieval ranking function with a custom supervised model that is trained through Spark, Weka, or R that is loaded as a plugin in Elasticsearch. This tutorial will not only review basic methods in information retrieval and machine learning, but it will also walk through practical examples from loading a dataset into Elasticsearch to training a model in Spark, Weka, or R, to creating the ML-Scoring plugin for Elasticsearch. No prior experience is required in any system listed (Elasticsearch, Spark, Weka, R), though some programming experience is recommended.
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
Search engines have focused on solving the document retrieval problem, so their scoring functions do not handle naturally non-traditional IR data types, such as numerical or categorical. Therefore, on domains beyond traditional search, scores representing strengths of associations or matches may vary widely. As such, the original model doesn’t suffice, so relevance ranking is performed as a two-phase approach with 1) regular search 2) external model to re-rank the filtered items. Metrics such as click-through and conversion rates are associated with the users’ response to items served. The predicted selection rates that arise in real-time can be critical for optimal matching. For example, in recommender systems, predicted performance of a recommended item in a given context, also called response prediction, is often used in determining a set of recommendations to serve in relation to a given serving opportunity. Similar techniques are used in the advertising domain. To address this issue the authors have created ML-Scoring, an open source framework that tightly integrates machine learning models into a popular search engine (SOLR/Elasticsearch), replacing the default IR-based ranking function. A custom model is trained through either Weka or Spark and it is loaded as a plugin used at query time to compute custom scores.
Presentation of the Semantic Knowledge Graph research paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics (Montreal, Canada - October 18th, 2016)
Abstract—This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...Aman Grover
Modern day social media search and recommender systems require complex query formulation that incorporates both user context and their explicit search queries. Users expect these systems to be fast and provide relevant results to their query and context. With millions of documents to choose from, these systems utilize a multi-pass scoring function to narrow the results and provide the most relevant ones to users. Candidate selection is required to sift through all the documents in the index and select a relevant few to be ranked by subsequent scoring functions. It becomes crucial to narrow down the document set while maintaining relevant ones in resulting set. In this tutorial we survey various candidate selection techniques and deep dive into case studies on a large scale social media platform. In the later half we provide hands-on tutorial where we explore building these candidate selection models on a real world dataset and see how to balance the tradeoff between relevance and latency.
GITHUB : https://github.com/candidate-selection-tutorial-sigir2017/candidate-selection-tutorial
An overview of some core concept in natural language processing, some example (experimental for now!) use cases, and a brief survey of some tools I have explored.
Similar to Improving Search in Workday Products using Natural Language Processing (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
3. 1. User data is very noisy
• Abbreviations and misspellings are common
• Synonyms are common, eg “word vectors”, “word embeddings”
2. Classification
• Search “data science”, return docs with “deep learning”/R
3. Recommendation
• “deep” → deep learning
• nlp → machine learning
Problems looking for NLP
4. • Why Word Vectors
• Word Vectors Explained
• Word Representation Use Cases
• Model Evaluation
Agenda
5. • Captures Semantics Captures Syntax
Athens
Greece
Norway Oslo
dollars dollar
mousemice
Athens - Greece
Why Might Word Vectors Help?
𝑑𝑜𝑙𝑙𝑎𝑟𝑠 − 𝑑𝑜𝑙𝑙𝑎𝑟 + 𝑚𝑜𝑢𝑠𝑒 ≈ 𝑚𝑖𝑐𝑒
6. • Similar terms point in similar direction: cosine similarity
• Rank suggestions, given existing profile
• Find similar terms in taxonomy
• Cluster for folksonomy, synonyms
• Cheap to train
• No labelled data needed
• Very efficient training algorithms
Why Might Word Vectors Help?
7. Steps for using fastText:
1. Get a corpus (news feeds, web scrapes, social media posts,
Wikipedia)
2. Minimize the difference between predictions and corpus
3. Tune hyperparameters
Where Do Word Vectors Come From?
9. SGNS in fastText in Detail
• Let 𝑣 𝑤 be canonical vector for word 𝑤
𝑐 𝑤 be context vector for word 𝑤
𝜎 𝑥 =
1
1−𝑒−𝑥
• 𝑝 𝑤 𝑓 = 𝜎(𝑣 𝑓∙ 𝑐 𝑤)
• Maximize 𝑝 𝑐 𝑓 for c in context of focus word f
Minimize 𝑝 𝑛 𝑓 for n randomly sampled from the vocabulary [1, 2]
10. • “You shall know a word by the company it keeps” - J.R. Firth
• Context window
• Narrow window → functional, syntactic vectors
• Wide window → topical, semantic vectors
• Dimensionality
• Character n-gram model for OOV terms
• Phrases
• “data science” → “data_science”
• compositional vectors (eg [2, 3])
The Art of Training Word Vectors
11. • Search at the core
• Talent: Candidate search and assignment
• HCM: Job Title, Job Qualification
• Having clean entities is paramount to success of Workday’s products
Why improve search?
14. • 7 different data sources
• Rechunking
• Bag-of-sentences
• Task specific “Intrinsic Evaluation”
Word representations
15. • Usage
• Search on broad term
• How?
• Hierarchy
• How are word vectors useful?
• Add new entity
• Clustering
• Implementation details
• Cosine similarity
Word Representations - Use Case 1: Broad Search
16. Affinity photo
Pixlr
….
Sql
Data Modeling
Relational Databases
…
Productivity
Software
Graphics
Software
Office
Software
Illustration
Software
Photo editing
Software
Photo
Management
Software
Presentation
Software
Spreadsheet
Software
Adobe Photoshop
Adobe Photoshop
Adobe Photoshop
Affinity photo
Data Modeling
Pixlr
Relational Databases
SQL
Word Representations - Use Case 1: New Entity Ingestion
17. • Quality of category recommendation
• Parent, grandparent or sibling
• 48.5% increase in ingesting new entity in hierarchy
Word Vector Evaluation: New Entity Ingestion Score
18. • Usage
• Recommend related entities
• Search results: Exact Match + Related Skills
• Siblings vs Related entities
• How are word vectors used?
• Cosine similarity
Word Representations - Use Case 2: Related Entities Recommendation
19.
20. • Quality of related entity recommendations
• Skills co-occurrences in Resume
• 49.5% additive increase in the recommendations
Word Vectors Evaluation: Co-occurrence Score
21. • Usage
• Disambiguation
• Query Expansion
• How are word vectors used?
• Subword model
• Cosine similarity
• Additionally
• Similar meaning
• e.g. Software Developer -> Software Engineer
• Abbreviation
• e.g. MS Excel -> Microsoft Excel
• Partial matches
• e.g. Microsoft Excel 2008 -> Microsoft Excel
• Spelling Errors
Word Representations - Use Case 3 - Synonyms
22. • Synonyms
• Polysemy
• Compositional phrase vectors
• Document vectors
Ongoing and Future Work
23. 1. P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov. Enriching Word
Vectors with Subword Information. Transactions of the Association
for Computational Linguistics, 5: 135-146, 2017.
2. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey
Dean. Distributed representations of words and phrases and their
compositionality. In NIPS, pages 3111–3119, 2013
3. Minh-Thang Luong, Richard Socher, and Christopher D. Manning.
Better Word Representations with Recursive Neural Networks for
Morphology. In Proceedings of the Seventeenth Conference on
Computational Natural Language Learning, pages 104–113, 2013
References
24. • Chief Data Scientists
• Joseph Turian
• Parag Namjoshi
• Engineering Manager
• Harikrishna Naraynan
• Tech Lead
• Saumil Shah
• Dev Team
• Adam Baker
• Namrata Ghadi
• Sergei Wintzki
• Rohit Kumar
Team