This document discusses incorporating probabilistic retrieval knowledge into TFIDF-based search engines. It provides an overview of different retrieval models such as Boolean, vector space, probabilistic, and language models. It then describes using a probabilistic model that estimates the probability of a document being relevant or non-relevant given its terms. This model can be combined with the BM25 ranking algorithm. The document proposes applying probabilistic knowledge to different document fields during ranking to improve relevance.
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
What one needs to know to work in Natural Language Processing field and the aspects of developing an NLP project using the example of a system to identify text language
What really are recommendations engines nowadays?
This presentation introduces the foundations of recommendation algorithms, and covers common approaches as well as some of the most advanced techniques. Although more focused on efficiency than theoretical properties, basics of matrix algebra and optimization-based machine learning are used through the presentation.
Table of Contents:
1. Collaborative Filtering
1.1 User-User
1.2 Item-Item
1.3 User-Item
* Matrix Factorization
* Stochastic Gradient Descent (SGD)
* Truncated Singular Value Decomposition (SVD)
* Alternating Least Square (ALS)
* Deep Learning
2. Content Extraction
* Item-Item Similarities
* Deep Content Extraction: NLP, CNN, LSTM
3. Hybrid Models
4. In Production
4.1 Problematics
4.2 Solutions
4.3 Tools
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
What one needs to know to work in Natural Language Processing field and the aspects of developing an NLP project using the example of a system to identify text language
What really are recommendations engines nowadays?
This presentation introduces the foundations of recommendation algorithms, and covers common approaches as well as some of the most advanced techniques. Although more focused on efficiency than theoretical properties, basics of matrix algebra and optimization-based machine learning are used through the presentation.
Table of Contents:
1. Collaborative Filtering
1.1 User-User
1.2 Item-Item
1.3 User-Item
* Matrix Factorization
* Stochastic Gradient Descent (SGD)
* Truncated Singular Value Decomposition (SVD)
* Alternating Least Square (ALS)
* Deep Learning
2. Content Extraction
* Item-Item Similarities
* Deep Content Extraction: NLP, CNN, LSTM
3. Hybrid Models
4. In Production
4.1 Problematics
4.2 Solutions
4.3 Tools
Research IT at the University of BristolSimon Price
Invited talk at the UCISA Community of Practice Workshop on IT Provisions in Support of Research in July 2015 on Research IT support at the University of Bristol. Topics include specialist IT staff skills requirements, addressing scarcity of data science and advanced IT skills amongst IT staff, and the challenges of costing specialist support.
SubSift: a novel application of the vector space model to support the academi...Simon Price
Paper presentation at the Workshop on Applications of Pattern Analysis, August 2011, Windsor. SubSift matches submitted conference or journal papers to potential peer reviewers based on the similarity between the paper's abstract and the reviewer's publications as found in online bibliographic databases such as Google Scholar. Using concepts from information retrieval including a bag-of-words representation and cosine similarity, the SubSift tools were originally created to streamline the peer review process for the ACM SIGKDD'09 data mining conference. This paper describes how these tools were subsequently developed and deployed in the form of web services designed to support not only peer review but also personalised data discovery and mashups. SubSift has already been used by several major data mining conferences and interesting applications in other fields are now emerging.
This presentation is intended for giving an introduction to Genetic Algorithm. Using an example, it explains the different concepts used in Genetic Algorithm. If you are new to GA or want to refresh concepts , then it is a good resource for you.
Extending BM25 with multiple query operatorsRoi Blanco
Traditional probabilistic relevance frameworks for informational retrieval refrain from taking positional information into account, due to the hurdles of developing a sound model while avoiding an explosion in the number of parameters. Nonetheless, the well-known BM25F extension of the successful Okapi ranking function can be seen as an embryonic attempt in that direction. In this paper, we proceed along the same line, defining the notion of virtual region: a virtual region is a part of the document that, like a BM25F-field, can provide a (larger or smaller, depending on a tunable weighting parameter) evidence of relevance of the document; differently from BM25F fields, though, virtual regions are generated implicitly by applying suitable (usually, but not necessarily, positional-aware) operators to the query. This technique fits nicely in the eliteness model behind BM25 and provides a principled explanation to BM25F; it specializes to BM25(F) for some trivial operators, but has a much more general appeal. Our experiments (both on standard collections, such as TREC, and on Web-like repertoires) show that the use of virtual regions is beneficial for retrieval effectiveness.
INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
Inductive Triple Graphs: A purely functional approach to represent RDFJose Emilio Labra Gayo
Slides of my presentation on 3rd International Workshop on Graph Structures for Knowledge Representation, part of the International Joint Conference on Artificial Intelligence, Beijing, China. 4 August 2013
In general dimension, there is no known total polynomial algorithm for either convex hull or vertex enumeration, i.e. an algorithm whose complexity depends polynomially on the input and output sizes. It is thus important to identify problems (and polytope representations) for which total polynomial-time algorithms can be obtained. We offer the first total polynomial-time algorithm for computing the edge-skeleton (including vertex enumeration) of a polytope given by an optimization or separation oracle, where we are also given a superset of its edge directions. We also offer a space-efficient variant of our algorithm by employing reverse search. All complexity bounds refer to the (oracle) Turing machine model. There is a number of polytope classes naturally defined by oracles; for some of them neither vertex nor facet representation is obvious. We consider two main applications, where we obtain (weakly) total polynomial-time algorithms: Signed Minkowski sums of convex polytopes, where polytopes can be subtracted provided the signed sum is a convex polytope, and computation of secondary, resultant, and discriminant polytopes. Further applications include convex combinatorial optimization and convex integer programming, where we offer a new approach, thus removing the complexity's exponential dependence in the dimension.
Two further methods for obtaining post-quantum security are discussed, namely code-based and isogeny-based cryptography. Topic 1: Revocable Identity-based Encryption from Codes with Rank Metric (will be presented by Dr. Reza Azarderakhsh) Authors: Donghoon Chang; Amit Kumar Chauhan; Sandeep Kumar; Somitra Kumar Sanadhya Topic 2: An Exposure Model for Supersingular Isogeny Diffie-Hellman Key Exchange Authors: Brian Koziel; Reza Azarderakhsh; David Jao
(Source: RSA Conference USA 2018)
Finite-State Queries in Lucene:
* Background, improvement/evolution of MultiTermQuery API in 2.9 and Flex
* Implementing existing Lucene queries with NFA/DFA for better performance: Wildcard, Regex, Fuzzy
* How you can use this Query programmatically to improve relevance (I'll use an English test collection/English examples)
Quick overview of other Lucene features in development, such as:
* Flexible Indexing
* "More-Flexible" Scoring: challenges/supporting BM25, more vector-space models, field-specific scoring, etc.
* Improvements to analysis
Bonus:
* Lucene / Solr merger explanation and future plans
About the presenter:
Robert Muir is a super-active Lucene developer. He works as a software developer for Abraxas Corporation. Robert received his MS in Computer Science from Johns Hopkins and BS in CS from Radford University. For the last few years Robert has been working on foreign language NLP problems - "I really enjoy working with Lucene, as it's always receptive to better int'l/language support, even though everyone seems to be a performance freak... such a weird combination!"
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
2. Overview of Retrieval Models
Boolean Retrieval
Vector Space Model
Probabilistic Model
Language Model
3. Boolean Retrieval
lincolnAND NOT (car AND automobile)
The earliest model and still in use today
The result is very easy to explain to users
Highly efficient computationally
The major drawback – lack of sophisticated
ranking algorithm.
4. Vector Space Model
Term2 Doc1
Doc2
t
Query
∑d ij *qj
j=1
Cos(Di ,Q) =
t t
Term3
∑ d * ∑q2
ij
2
j
j=1 j=1
Major flaws: It lacks guidance on the details of
€
how weighting and ranking algorithms are
related to relevance
6. Probabilistic Retrieval Model
P(D | R)P(R) P(D | NR)P(NR)
P(R | D) = P(NR | D) =
P(D) P(D)
IfP(D | R)P(R) > P(D | NR)P(NR)
€ €
then classify D as relevant
€
7. Estimate P(D|R) and P(D|NR)
Define D = (d1,d2 ,...,dt )
t
then P(D | R) = ∏ P(di | R)
i=1
t
€ P(D | NR) = ∏ P(di | NR)
i=1
€
Binary Independence Model
€ term independence + binary features in documents
8. Likelihood Ratio
Likelihood ratio:
P(D | R) P(NR)
>
P(D | NR) P(R)
si: in non-relevant set, the probability of term i occurring
pi: in relevant set, the probability of term i occurring
P(D | R) pi 1− pi pi (1− si )
=∏ ⋅ ∏ = ∑ log
€ P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi )
(ri + 0.5) /(R − ri + 0.5)
= ∑ log
i:d i = q i =1 (n i − ri + 0.5) /(N − n i − R + ri + 0.5)
€
N: total number of Non-relevant documents
ni: number of non-relevant documents that contain a term
ri: number of relevant documents that contain a term
R: total number of Relevant documents
€
9. Combine with BM25 Ranking
Algorithm
BM25 extends the scoring function for the binary
independence model to include document and
query term weight.
It performs very well in TREC experiments
(ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i
R(q,D) = ∑ log ⋅ i ⋅
i∈Q
(n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i
dl
K = k1 ((1− b) + b ⋅ )
avgdl
€
k1 k2 b: tuning parameters
dl: document length
avgdl: average document length in data set
€
qf: term frequency in query terms
10. Weighted Fields Boolean Search
doc-id field0 field1 … text
1
2
3
…
n
R(q,D) = ∑ ∑w f mi
i∈q f ∈ fileds
€
11. Apply Probabilistic Knowledge
into Fields
Higher gradient Lower
doc-id field0 field1 … Text
1
Lightyear Buzz
2
3
…
n
Relevant
P(R|D)
Document
Non-
Relevant P(NR|D)
12. Use the Knowledge during Ranking
doc-id field0 field1 … Text
1
Lightyear Buzz
2
3
…
n
The goal is:
t
t
P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi
i=1
i=1 i∈q f ∈F
Learnable
€
13. Comparison of Approaches
f ik N
RTF −IDF = tf ik ⋅ idf i = t
⋅ log
nk
∑f ij
j=1
(k1 + 1) f i (k2 + 1)qf i dl
Rbm 25 (q,D) = ⋅ K = k1 ((1− b) + b ⋅ )
K + fi k 2 + qf i avgdl
€ (ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i
R(q,D) = ∑ log ⋅ 1 ⋅
i∈Q
(n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i
€ €
IDF TF
€ (k1 + 1) f i (k 2 + 1)qf i
R(q,D) = ∑ ∑ w f mi ⋅ ⋅
i∈q f ∈F K + fi k 2 + qf i
IDF TF
€
14. Other Considerations
Thisis not a formal model
Require user relevance feedback (search log)
Harder to handle real-time search queries
How to Prevent Love/Hate attacks
Si: in non-relevant set, the probability of term i occurringPi: inrelevant set, the probability of term i occurringN: total number of Non-relevant documentsni: number of non-relevant documents that contain a termri: number of relevant documents that contain a term R: total number of Relevant documents