This document provides an overview of key concepts in information retrieval. It discusses issues with crawlers getting stuck in loops and storage issues. It also covers the basics of indexing documents, including creating an inverted index to speed up retrieval. Common techniques for detecting duplicate and near-duplicate documents are also summarized.
Modified version of Chapter 18 of the book Fundamentals_of_Database_Systems,_6th_Edition with review questions
as part of database management system course
All information in a file is always in binary form or a series of ones and zeros. A document includes any file you have created. It can be a true text document, sound file, graphics, images, or any other type of information the computer can create, store, or size from the internet.
Modified version of Chapter 18 of the book Fundamentals_of_Database_Systems,_6th_Edition with review questions
as part of database management system course
All information in a file is always in binary form or a series of ones and zeros. A document includes any file you have created. It can be a true text document, sound file, graphics, images, or any other type of information the computer can create, store, or size from the internet.
This presentation gives a basic introduction to files as a Data Structure. Physical Files and Logical Files are covered. Files as a collection of records and as a stream of bytes are talked about. Basic operations in files are explained. C syntax is given. Types of files are briefly talked about.
h2kinfosys is offering the IT Online Courses with Certificates ,H2kinfosys is the best place to learn online coding classes as we offer the most job oriented training led by experienced instructors through live classroom sessions. It courses online from h2kinfosys . top trending courses like learn tableau online, hadoop certification Training, python certification online and more courses register for free demo class .
https://www.h2kinfosys.com/
File is the basic unit of information storage on a secondary storage device. Therefore, almost every form of data and information reside on these devices in form of file – whether audio data or video, whether text or binary.
Files may be classified on different bases as follows:
1. On the basis of content:
Text files: Files containing data/information in textual form. It is merely a collection of characters. Document files etc.
Binary files: Files containing machine code. The contents are non-recognizable and can be interpreted only in a specified way using the same application that created it. E.g. executable program files, audio files, video files etc.
Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter TreesDavid Lillis
Presentation by David Lillis at ICDF2C 2017 conference. Full paper at: http://lill.is/pubs/Lillis2017.pdf
Collaboration between UCD Forensics and Security Research Group (https://forensicsandsecurity.com/) and UNH Cyber Forensics Research and Education Group (http://unhcfreg.com/).
A basic course on Research data management, part 3: sharing your dataLeon Osinski
A basic course on research data management for PhD students. The course consists of 4 parts. The course was given at Eindhoven University of Technology (TUe), 24-01-2017
This presentation gives a basic introduction to files as a Data Structure. Physical Files and Logical Files are covered. Files as a collection of records and as a stream of bytes are talked about. Basic operations in files are explained. C syntax is given. Types of files are briefly talked about.
h2kinfosys is offering the IT Online Courses with Certificates ,H2kinfosys is the best place to learn online coding classes as we offer the most job oriented training led by experienced instructors through live classroom sessions. It courses online from h2kinfosys . top trending courses like learn tableau online, hadoop certification Training, python certification online and more courses register for free demo class .
https://www.h2kinfosys.com/
File is the basic unit of information storage on a secondary storage device. Therefore, almost every form of data and information reside on these devices in form of file – whether audio data or video, whether text or binary.
Files may be classified on different bases as follows:
1. On the basis of content:
Text files: Files containing data/information in textual form. It is merely a collection of characters. Document files etc.
Binary files: Files containing machine code. The contents are non-recognizable and can be interpreted only in a specified way using the same application that created it. E.g. executable program files, audio files, video files etc.
Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter TreesDavid Lillis
Presentation by David Lillis at ICDF2C 2017 conference. Full paper at: http://lill.is/pubs/Lillis2017.pdf
Collaboration between UCD Forensics and Security Research Group (https://forensicsandsecurity.com/) and UNH Cyber Forensics Research and Education Group (http://unhcfreg.com/).
A basic course on Research data management, part 3: sharing your dataLeon Osinski
A basic course on research data management for PhD students. The course consists of 4 parts. The course was given at Eindhoven University of Technology (TUe), 24-01-2017
The Why and How of Knowledge Management: Some Applications in Teaching and Le...Olivier Serrat
Knowledge management—the process of identifying, creating, storing, sharing, and using organizational knowledge—aims to provide support for improved decision making. Its higher objective is to advance organizational performance. It is best exercised if the motive behind knowledge management initiatives is clear, with sundry possible areas of activity and associated perspectives.
Information Storage and Retrieval : A Case StudyBhojaraju Gunjal
Bhojaraju.G, M.S.Banerji and Muttayya Koganurmath (2004). Information Storage and Retrieval: A Case Study, In Proceedings of International Conference on Digital Libraries (ICDL 2004), New Delhi, Feb 24-27, 2004.
(Best Poster Presentation Award)
Information Management in a Web 2.0 World May 2009Collabor8now Ltd
The current information management polices, standards, processes and systems are not fit for purpose for Web 2.0 working practices, and specifically for the emerging Cloud Computing models.
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
Modern data lakes are now built on cloud storage, helping organizations leverage the scale and economics of object storage while simplifying overall data storage and analysis flow
This is a presentation I delivered at CodeMash 2.0.1.0 dealing with lessons learned while building an application for handling the post-processing of scientific data using the Windows Azure platform.
Introduction into Search Engines and Information RetrievalA. LE
Gives a brief introduction into search engines and information retrieval. Covers basics about Google and Yahoo, fundamental terms in the area of information retrieval and an introduction into the famous page rank algorithm
Recently a new breed of "multi-model" databases has emerged. They are a document store, a graph database and a key/value store combined in one program. Therefore they are able to cover a lot of use cases which otherwise would need multiple different database systems.
This approach promises a boost to the idea of "polyglot persistence", which has become very popular in recent years although it creates some friction in the form of data conversion and synchronisation between different systems. This is, because with a multi-model database one can enjoy the benefits of polyglot persistence without the disadvantages.
In this talk I will explain the motivation behind the multi-model approach, discuss its advantages and limitations, and will then risk to make some predictions about the NoSQL database market in five years time, which I shall only reveal during the talk.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
2. - Some crawlers were stuck in loops.
- Issues with storage
- MSU’s site has about 60,00+ URLs
including subdomains.
3. - 37.5% of grade based on Crawler
- 37.5% on presentation
- 25% Attendance
- Paper topics handed out next week
4. - Information retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers)
5. Three reasons to use multiple computers for
crawling
Helps to put the crawler closer to the sites it
crawls
Reduces the number of sites the crawler has to
remember
Reduces computing resources required
Distributed crawler uses a hash function to
assign URLs to crawling computers
hash function should be computed on the host
part of each URL
6. Linear Scan of Documents is called
“grepping”’
In order to perform “ranked retrieval” we
need the best answer among many
documents (billions).
Need to Index documents in advance.
Ex: Boolean Retrieval
7. Ex: The works of Shakespeare.
Record whether each book contains a word
out of all the words (~32,000)
Create binary-term document. Incident
Matrix.
8. Suppose we want to answer the Brutus AND
Ceaser and NOT Calpurnia. We operate on
the row vectors (NOT(Calpurnia))
110100 AND 110111 AND 101111 =
100100
=> Ans: “Antony and Cleopatra” and “Hamlet”
9. Collection of Documents: Corpus
To asses the effectiveness of an IR system
Precision: What fraction returned documents are
relevant to the information need?
Recall: What fraction of the relevant documents in
a the collection were returned by the system
10. Last example, incidence matrix, too large. A
500k x 1M matrix has ½ trillion 0’s and 1’s.
And too sparse. 99.8% of matrix is zero.
Inverted Index: Record items that do appear.
To gain speed at retrieval time we build an
index in advance.
Two steps:
Collect documents
Tokenize the text (break into terms)
11.
12. Text is stored in hundreds of incompatible file
formats.. Bytes on a web server.
e.g., raw text, RTF, HTML, XML, Microsoft
Word, ODF, PDF
Other types of files also important
e.g., PowerPoint, Excel
Typically use a conversion tool
converts the document content into a tagged text
format such as HTML or XML
retains some of the important formatting
information
13. A character encoding is a mapping
between bits and glyphs
i.e., getting from bits in a file to characters on
a screen
Can be a major source of incompatibility
ASCII (1963) is basic character encoding
scheme for English
encodes 128 letters, numbers, special
characters, and control characters in 7
bits, extended with an extra bit for storage in
bytes
14. Other languages can have many more
glyphs
e.g., Chinese has more than 40,000
characters, with over 3,000 in common use
Many languages have multiple encoding
schemes
e.g., CJK (Chinese-Japanese-Korean) family of
East Asian languages, Hindi, Arabic
must specify encoding
can’t have multiple languages in one file
15. ASCII: Was able to rep. all of English
characters use numbers 32-127. (Unix and C
being invented)
IBM started using the other bits for special
characters.
Then came the ANSI Standard
The WWW caused all this to
“crash”….UNICODE (1991)
16. A different way of thinking about character
representation
Question is A different than A? In some
languages even the shape matters.
Every latter is assigned a number (code
point) by Unicode consortium. Ex: A = U +
0041 (Hex)
17. UTF-8 -> Identical to ASCII codes. Coding
bits to Glyphs.
You have to know what Encoding is being
used!
18. Single mapping from numbers to glyphs
that attempts to include all glyphs in
common use in all known languages
Unicode is a mapping between numbers
and glyphs
does not uniquely specify bits to glyph
mapping!
e.g., UTF-8 (1-8 bytes), UTF-16, UTF-32
19. e.g., Greek letter pi (π) is Unicode symbol
number 960
In binary, 00000011 11000000 (3C0 in
hexadecimal)
Final encoding is (110)01111 (10)000000 (CF80
in hexadecimal)
20.
21. Web Site use:
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8" />
22. Requirements for document storage system:
Random access
○ request the content of a document based on its
URL
○ hash function based on URL is typical
Compression and large files
○ reducing storage requirements and efficient access
Update
○ handling large volumes of new and modified
documents
○ adding new anchor text
23. Text is highly redundant (or predictable)
Compression techniques exploit this
redundancy to make files smaller without
losing any of the content
Popular algorithms can compress HTML and
XML text by 80%
e.g., DEFLATE (zip, gzip) and LZW (UNIX
compress, PDF)
24. Google’s document storage system (Chang
et al., 2006)
Customized for storing, finding, and updating web
pages. Tablets served by 1000’s of machines.
Handles large collection sizes using inexpensive
computers
25. No query language, no complex queries to
optimize
Only row-level transactions
Tablets are stored in a replicated file system
that is accessible by all BigTable servers
Any changes to a BigTable tablet are
recorded to a transaction log, which is also
stored in a shared file system
If any tablet server crashes, another server
can immediately read the tablet data and
transaction log from the file system and take
over
26. Logically organized into rows
A row stores data for a single web page
Combination of a row key, a column key, and
a timestamp point to a single cell in the row
http://video.google.com/videoplay?docid=727
8544055668715642
27. BigTable can have a huge number of
columns per row
all rows have the same column groups
not all rows have the same columns
important for reducing disk reads to access
document data
Rows are partitioned into tablets based
on their row keys
simplifies determining which server is
appropriate
28. Duplicate and near-duplicate documents
occur in many situations
Copies, versions, plagiarism, spam, mirror
sites
30% of the web pages in a large crawl are
exact or near duplicates of pages in the other
70%
Duplicates consume significant resources
during crawling, indexing, and search
Little value to most users
29. Exact duplicate detection is relatively easy
Checksum techniques
A checksum is a value that is computed based on the
content of the document
○ e.g., sum of the bytes in the document file
Possible for files with different text to have same
checksum
Functions such as a cyclic redundancy check
(CRC), have been developed that consider the
positions of the bytes
30. More challenging task
Are web pages with same text context but
different advertising or format near-duplicates?
A near-duplicate document is defined using a
threshold value for some similarity measure
between pairs of documents
e.g., document D1 is a near-duplicate of
document D2 if more than 90% of the words in
the documents are the same
31. Many web pages contain text, links, and
pictures that are not directly related to the
main content of the page
This additional material is mostly noise that
could negatively affect the ranking of the
page
Techniques have been developed to detect
the content blocks in a web page
Non-content material is either ignored or reduced
in importance in the indexing process
32.
33. Many web pages contain text, links, and
pictures that are not directly related to the
main content of the page
This additional material is mostly noise that
could negatively affect the ranking of the
page
Techniques have been developed to detect
the content blocks in a web page
Non-content material is either ignored or reduced
in importance in the indexing process
34. Other approaches
use DOM
structure and
visual (layout)
features
35. Cumulative distribution of tags in the example
web page
Main text content of the page corresponds to the
“plateau” in the middle of the distribution
36. R. Song et al., 2004. “Learning Block
Importance Models for Web Pages”
Editor's Notes
Grab all terms and store docID. 2nd column organize alphabetical. 3rd column doc. Freq is number of documents it appears in (good for speed later). Last column, postings lists, is which documents it appears in. Good for ranking algorithms later.
Comp
2^8 = 256
2^8 = 256
2^8 = 256
960 = 1111000000 so 10 bits. Which means we need to use 128-2047. The first numbers are FIXED in the coding scheme.
Distributed database p6. 57 IR book. In use at Google. Petabyte 10^15.n Can’t use SQL it is a large relational database. Table
Notes: Not all rows have to have the same columns. Big difference between relational databases. Site is example.com…. Other.com links to example .com (since that is the row we are talking about) with anchor text example. Annchor is the the anchor text. Nulll.com links to example.com with anchor text click here.