SlideShare a Scribd company logo
IMPLEMENTATION OF
INFORMATION RETRIEVAL
  SYSTEMS VIA RDBMS
Relational Database: Definitions

 Relational database: a set of relations
 Relation: made up of 2 parts:
     Instance : a table, with rows and columns.
      #Rows = cardinality, #fields = degree / arity.
     Schema : specifies name of relation, plus name and type of
      each column.
        E.G. Students(sid: string, name: string, login: string,
              age: integer, gpa: real).
 Can think of a relation as a set of rows or tuples (i.e.,
 all rows are distinct).
Example Instance of Students Relation


        sid     name      login            age   gpa
       53666    Jones jones@cs             18    3.4
       53688    Smith smith@eecs           18    3.2
       53650    Smith smith@math           19    3.8

Cardinality = 3, degree = 5, all rows distinct
Relational Query Languages

 A major strength of the relational model: supports
  simple, powerful querying of data.
 Queries can be written intuitively, and the DBMS is
  responsible for efficient evaluation.
The SQL Query Language

 Developed by IBM (system R) in the 1970s
 Need for a standard since it is used by many vendors
 Standards:
    SQL-86
    SQL-89 (minor revision)
    SQL-92 (major revision, current standard)
    SQL-99 (major extensions)
The SQL Query Language

 To find all 18 year old students, we can write:

  SELECT *               sid   name    login     age gpa
  FROM Students S      53666 Jones    jones@cs   18 3.4
  WHERE S.age=18       53688 Smith smith@ee 18 3.2


 •To find just names and logins, replace the first line:
   SELECT S.name, S.login
Querying Multiple Relations
     sid          cid   grade
    53831   Carnatic101  C
    53831   Reggae203    B
    53650   Topology112  A
    53666   History105   B

    SELECT S.name, E.cid
    FROM Students S, Enrolled E
    WHERE S.sid=E.sid AND E.grade=“A”



    S.name E.cid
    Smith  Topology112
Creating Relations in SQL
 Creates the Students relation. Observe
  that the type (domain) of each field      CREATE TABLE Students
   is specified, and enforced by the DBMS        (sid: CHAR(20),
  whenever tuples are added or modified.          name: CHAR(20),
 As another example, the Enrolled table          login: CHAR(10),
  holds information about courses that
  students take.                                  age: INTEGER,
                                                  gpa: REAL)


                                            CREATE TABLE Enrolled
                                                 (sid: CHAR(20),
                                                  cid: CHAR(20),
                                                  grade: CHAR(2))
Combining Separate Systems

  Use an IR and RDBMS systems which are
  independent.
  Divide the query into two:
      Structured part for the RDBMS
      Unstructured (text) part for the IR
  Combine the results from IR and RDBMS
  Good for letting each vendor develop its own system
  Bad for data integrity, recovery, portability, and
  performance
User Defined Operators

  Allow users to modify SQL by adding their own functions
  Some vendors used this approach (such as IBM DB2 text
  extender)
  Lynch and Stonebreaker defined “user defined operators” to
  implement information retrieval in 1988
      //Retrieves documents that contain term1, term2, term3
      SELECT Doc_Id
      FROM Doc
      WHERE SEARCH-TERM(Text, Term1, Term 2, Term3)

       //Retrieves documents that contain term1, term2, term3
       // within a window of 5 terms
       SELECT Doc_Id
       FROM Doc
       WHERE PROXIMITY(Text,5, Term1, Term 2, Term3)
Non-First Normal Form Approaches

  Capture the many-to-many relationships into sets via nested
  relations
  Hard to implement ad-hoc queries
  No standard yet
Using RDBMS for IR

  Benefits:
      Recovery
      Performance
      Data migration
      Concurrency Control
      Access control mechanism
      Logical and physical data independence
Using RDBMS for IR


  Example: A bibliography that includes both structured and
  unstructured information
      DIRECTORY (name, institution) : affiliation of the author
      AUTHOR(name,DocId) :authorship information
      INDEX (name, DocId) :terms that are used to index a document
Using RDBMS for IR

   Preprocessing
       SGML can be used as a starting point which is a standard for
        defining parts of documents

 <DOC>
 <DOCNO> WSJ834234234 </DOCNO>
  <HL> How to make students suffer in IR Course </HL>
 <DD> 03/23/87</DD>
 <DATELINE> Sabanci, Turkey </DATELINE>
 <TEXT>
 Crawler HW, Inverted Index, Querying
 </TEXT>
 </DOC>
Using RDBMS for IR
   Preprocessing
       SGML can be used as a starting point which is a standard for
        defining parts of documents
       Use a parser together with a hash function to identify terms
       Use STOP_TERM table for referencing stop words
       Produce three output tables
          INDEX (DocId, Term, TermFrequency) : Models the inverted index
          DOC (DocId, DocName, PubDate, DateLine) : Document metadata
          TERM (Term, Idf) : stored the weights of each term

 //Construct TERM table, N is the total number of documents
 INSERT INTO TERM
 SELECT Term,log(N/Count(*))
 FROM INDEX
 GROUP BY Term
Using RDBMS for IR
 An offset can be added together with the term to be able to answer proximity
    queries. For example “Vice President” should occur together in the same
    document for relevant documents etc.

 INDEX_PROX (DocId, Term, OffSet)

 //Construct TERM table, N is the total number of documents
 INSERT INTO INDEX
 SELECT DocId, Term, COUNT(*)
 FROM INDEX_PROX
 GROUP BY DocId, Term
Using RDBMS for IR

   Query can be modeled as a relation as well when it is a long
   document
       QUERY(Term,TermFreq)


   Ex: “Find all news documents written on 03/03/2005 about
   Sabanci University
       Data will be extracted from the structured fields
       Terms will be extracted using the inverted index


SELECT d.DocId
FROM DOC d, INDEX i
WHERE i.Term IN (“Sabanci”, “University”) AND d.PubDate = “03/03/2005”
      AND d.DocId = i.DocId
Using RDBMS for IR

    Boolean Queries: Consists of terms with boolean operators
    (AND, OR, and NOT)
    For a single inputTerm: retrieve the document texts that contain
    that term

SELECT d.Text
FROM DOC d,
WHERE d.DocId IN
     (SELECT DISTINCT (i.DocId)
      FROM INDEX i
      WHERE i.Term = inputTerm)


Note that we can store the text part of a document using BLOB or CLOG (
Binary or Character Large Object)
Using RDBMS for IR

   Boolean Queries that contain OR

SELECT DISTINCT (i.DocId)
FROM INDEX i
WHERE i.Term = inputTerm1 OR
      i.Term = inputTerm2 OR
      …..
      i.Term = inputTermn OR
Using RDBMS for IR

     Boolean Queries that contain AND

SELECT DISTINCT (i.DocId)
FROM INDEX i
WHERE i.Term = inputTerm1 AND
      i.Term = inputTerm2 AND
      …..
      i.Term = inputTermn AND

??
Using RDBMS for IR

   Boolean Queries that contain AND (Previous Answer Was
   Wrong)

SELECT DISTINCT (i.DocId)
FROM INDEX i1, INDEX i2, INDEX i3, …. INDEX in
WHERE i1.Term = inputTerm1 AND
       i2.Term = inputTerm2 AND
      …..
      in.Term = inputTermn AND
      i1.DocID = i2.DocId AND
      i2.DocID = i3.DocId AND
      …
      in-1 = in.DocID

OR YOU CAN USE INTERSECTION
Using RDBMS for IR

  Boolean Queries that contain AND
  Commercial DBMSs are not able to process more than a fixed number
  of joins.
  Solution


   SELECT i.DocId
   FROM INDEX i, Query q
   WHERE i.Term = q.term
   GROUP BY i.DocId
   HAVING COUNT(i.Term) = (SELECT COUNT(*) FROM QUERY)

   Works only when the INDEX contains only one occurrence of a given term
   Together with its frequency. No Proximity is recorded.
Using RDBMS for IR

  Boolean Queries that contain AND
  Commercial DBMSs are not able to process more than a fixed number
  of joins.
  Solution for terms appearing more than once in the INDEX


   SELECT i.DocId
   FROM INDEX i, Query q
   WHERE i.Term = q.term
   GROUP BY i.DocId
   HAVING COUNT(DISTINCT(i.Term)) = (SELECT COUNT(*) FROM QUERY)

   This is slower since DISTINC requires a sort for duplicate elimination.
Using RDBMS for IR

  Boolean Queries that contain AND
  Commercial DBMSs are not able to process more than a fixed number
  of joins.
  Implementation of TAND (Threshold AND) is also simple


   SELECT i.DocId
   FROM INDEX i, Query q
   WHERE i.Term = q.term
   GROUP BY i.DocId
   HAVING COUNT(DISTINCT(i.Term)) > k
Using RDBMS for IR

  Proximity Queries for terms within a specific window width


 SELECT a.DocId
 FROM INDEX_PROX a, INDEX_PROX b
 WHERE a.Term IN (SELECT q.Term FROM QUERY q) AND
        b.Term IN (SELECT q.Term FROM QUERY q) AND
        a.DocId = b.DocId AND
        (a.offset –b.offset) BETWEEN 0 AND (width-1)
 GROUP BY a.DocId, b.DocId, a.Term, a.offset
 HAVING COUNT(DISTINCT(b.Term)) = SELECT (COUNT(*) FROM QUERY)
Using RDBMS for IR

  Calculating Relevance

   SELECT i.DocId, SUM(q.tf*t.idf*t.tf*t.idf)
   FROM QUERY q, INDEX i, TERM t
   WHERE q.Term = t.term AND i.Term = t.Term
   GROUP BY i.DocId
   ORDER BY 2 DESC

More Related Content

What's hot

Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19
AnwarrChaudary
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14
AnwarrChaudary
 
Sql commands
Sql commandsSql commands
Sql commands
Prof. Dr. K. Adisesha
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
Rick Watts
 
Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005
rainynovember12
 
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
Beat Signer
 
DBMS_INTRODUCTION OF SQL
DBMS_INTRODUCTION OF SQLDBMS_INTRODUCTION OF SQL
DBMS_INTRODUCTION OF SQL
Azizul Mamun
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
SriTeja Allaparthi
 
BAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureBAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 Lecture
Wake Tech BAS
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
ijseajournal
 
BAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureBAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 Lecture
Wake Tech BAS
 
Database management system chapter12
Database management system chapter12Database management system chapter12
Database management system chapter12
Md. Mahedi Mahfuj
 
SQL
SQL SQL
Unit08 dbms
Unit08 dbmsUnit08 dbms
Unit08 dbms
arnold 7490
 
DBMS _Relational model
DBMS _Relational modelDBMS _Relational model
DBMS _Relational model
Azizul Mamun
 
SQL : introduction
SQL : introductionSQL : introduction
SQL : introduction
Shakila Mahjabin
 
Sql fundamentals
Sql fundamentalsSql fundamentals
Sql fundamentals
Ravinder Kamboj
 
Unit 08 dbms
Unit 08 dbmsUnit 08 dbms
Unit 08 dbms
anuragmbst
 
DATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEM
Sonia Pahuja
 
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation:  Data Files, and Data Cleaning & PreparationAaa ped-6-Data manipulation:  Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
AminaRepo
 

What's hot (20)

Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14
 
Sql commands
Sql commandsSql commands
Sql commands
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
 
Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005
 
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
 
DBMS_INTRODUCTION OF SQL
DBMS_INTRODUCTION OF SQLDBMS_INTRODUCTION OF SQL
DBMS_INTRODUCTION OF SQL
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
BAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureBAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 Lecture
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
 
BAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureBAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 Lecture
 
Database management system chapter12
Database management system chapter12Database management system chapter12
Database management system chapter12
 
SQL
SQL SQL
SQL
 
Unit08 dbms
Unit08 dbmsUnit08 dbms
Unit08 dbms
 
DBMS _Relational model
DBMS _Relational modelDBMS _Relational model
DBMS _Relational model
 
SQL : introduction
SQL : introductionSQL : introduction
SQL : introduction
 
Sql fundamentals
Sql fundamentalsSql fundamentals
Sql fundamentals
 
Unit 08 dbms
Unit 08 dbmsUnit 08 dbms
Unit 08 dbms
 
DATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEM
 
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation:  Data Files, and Data Cleaning & PreparationAaa ped-6-Data manipulation:  Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
 

Viewers also liked

Vector space classification
Vector space classificationVector space classification
Vector space classification
Ujjawal
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
dalal404
 
similarity measure
similarity measure similarity measure
similarity measure
ZHAO Sam
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
Jane Garay
 
Storage And Retrieval Of Information
Storage And Retrieval Of InformationStorage And Retrieval Of Information
Storage And Retrieval Of Information
Marcus9000
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
Sadaf Rafiq
 
Prefixes 2
Prefixes 2Prefixes 2
Prefixes 2
Calisto y Melibea
 
Prefixes
PrefixesPrefixes
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
Mounia Lalmas-Roelleke
 

Viewers also liked (12)

Vector space classification
Vector space classificationVector space classification
Vector space classification
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
similarity measure
similarity measure similarity measure
similarity measure
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
 
Storage And Retrieval Of Information
Storage And Retrieval Of InformationStorage And Retrieval Of Information
Storage And Retrieval Of Information
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 
Prefixes 2
Prefixes 2Prefixes 2
Prefixes 2
 
Prefixes
PrefixesPrefixes
Prefixes
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 

Similar to 2005 fall cs523_lecture_4

PT- Oracle session01
PT- Oracle session01 PT- Oracle session01
PT- Oracle session01
Karthik Venkatachalam
 
Ch3_Rel_Model-95.ppt
Ch3_Rel_Model-95.pptCh3_Rel_Model-95.ppt
Ch3_Rel_Model-95.ppt
AtharvaBagul2
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
rahulnadola3
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQL
DHAAROUN
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
poovathi nps
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
pradnyamulay
 
ch3.ppt
ch3.pptch3.ppt
Ch 3.pdf
Ch 3.pdfCh 3.pdf
Cassandra20141009
Cassandra20141009Cassandra20141009
Cassandra20141009
Brian Enochson
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
Ashwini Rao
 
Database Management Lab -SQL Queries
Database Management Lab -SQL Queries Database Management Lab -SQL Queries
Database Management Lab -SQL Queries
shamim hossain
 
MongoDB
MongoDBMongoDB
MongoDB
kesavan N B
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
Steven Johnson
 
Cassandra20141113
Cassandra20141113Cassandra20141113
Cassandra20141113
Brian Enochson
 
Ado.net by Awais Majeed
Ado.net by Awais MajeedAdo.net by Awais Majeed
Ado.net by Awais Majeed
Awais Majeed
 
Vsam interview questions and answers.
Vsam interview questions and answers.Vsam interview questions and answers.
Vsam interview questions and answers.
Sweta Singh
 
3.1- Data Management & Retrieval using data analytics techniques
3.1- Data Management & Retrieval using data analytics techniques3.1- Data Management & Retrieval using data analytics techniques
3.1- Data Management & Retrieval using data analytics techniques
thilagavathis16
 
2 rel-algebra
2 rel-algebra2 rel-algebra
2 rel-algebra
Mahesh Jeedimalla
 
DBMS summer 19.pdf
DBMS summer 19.pdfDBMS summer 19.pdf
DBMS summer 19.pdf
SohamKotalwar1
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
Chirag vasava
 

Similar to 2005 fall cs523_lecture_4 (20)

PT- Oracle session01
PT- Oracle session01 PT- Oracle session01
PT- Oracle session01
 
Ch3_Rel_Model-95.ppt
Ch3_Rel_Model-95.pptCh3_Rel_Model-95.ppt
Ch3_Rel_Model-95.ppt
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQL
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Ch 3.pdf
Ch 3.pdfCh 3.pdf
Ch 3.pdf
 
Cassandra20141009
Cassandra20141009Cassandra20141009
Cassandra20141009
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Database Management Lab -SQL Queries
Database Management Lab -SQL Queries Database Management Lab -SQL Queries
Database Management Lab -SQL Queries
 
MongoDB
MongoDBMongoDB
MongoDB
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Cassandra20141113
Cassandra20141113Cassandra20141113
Cassandra20141113
 
Ado.net by Awais Majeed
Ado.net by Awais MajeedAdo.net by Awais Majeed
Ado.net by Awais Majeed
 
Vsam interview questions and answers.
Vsam interview questions and answers.Vsam interview questions and answers.
Vsam interview questions and answers.
 
3.1- Data Management & Retrieval using data analytics techniques
3.1- Data Management & Retrieval using data analytics techniques3.1- Data Management & Retrieval using data analytics techniques
3.1- Data Management & Retrieval using data analytics techniques
 
2 rel-algebra
2 rel-algebra2 rel-algebra
2 rel-algebra
 
DBMS summer 19.pdf
DBMS summer 19.pdfDBMS summer 19.pdf
DBMS summer 19.pdf
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
 

Recently uploaded

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 

Recently uploaded (20)

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 

2005 fall cs523_lecture_4

  • 2. Relational Database: Definitions Relational database: a set of relations Relation: made up of 2 parts:  Instance : a table, with rows and columns. #Rows = cardinality, #fields = degree / arity.  Schema : specifies name of relation, plus name and type of each column.  E.G. Students(sid: string, name: string, login: string, age: integer, gpa: real). Can think of a relation as a set of rows or tuples (i.e., all rows are distinct).
  • 3. Example Instance of Students Relation sid name login age gpa 53666 Jones jones@cs 18 3.4 53688 Smith smith@eecs 18 3.2 53650 Smith smith@math 19 3.8 Cardinality = 3, degree = 5, all rows distinct
  • 4. Relational Query Languages  A major strength of the relational model: supports simple, powerful querying of data.  Queries can be written intuitively, and the DBMS is responsible for efficient evaluation.
  • 5. The SQL Query Language Developed by IBM (system R) in the 1970s Need for a standard since it is used by many vendors Standards:  SQL-86  SQL-89 (minor revision)  SQL-92 (major revision, current standard)  SQL-99 (major extensions)
  • 6. The SQL Query Language To find all 18 year old students, we can write: SELECT * sid name login age gpa FROM Students S 53666 Jones jones@cs 18 3.4 WHERE S.age=18 53688 Smith smith@ee 18 3.2 •To find just names and logins, replace the first line: SELECT S.name, S.login
  • 7. Querying Multiple Relations sid cid grade 53831 Carnatic101 C 53831 Reggae203 B 53650 Topology112 A 53666 History105 B SELECT S.name, E.cid FROM Students S, Enrolled E WHERE S.sid=E.sid AND E.grade=“A” S.name E.cid Smith Topology112
  • 8. Creating Relations in SQL  Creates the Students relation. Observe that the type (domain) of each field CREATE TABLE Students is specified, and enforced by the DBMS (sid: CHAR(20), whenever tuples are added or modified. name: CHAR(20),  As another example, the Enrolled table login: CHAR(10), holds information about courses that students take. age: INTEGER, gpa: REAL) CREATE TABLE Enrolled (sid: CHAR(20), cid: CHAR(20), grade: CHAR(2))
  • 9. Combining Separate Systems Use an IR and RDBMS systems which are independent. Divide the query into two:  Structured part for the RDBMS  Unstructured (text) part for the IR Combine the results from IR and RDBMS Good for letting each vendor develop its own system Bad for data integrity, recovery, portability, and performance
  • 10. User Defined Operators Allow users to modify SQL by adding their own functions Some vendors used this approach (such as IBM DB2 text extender) Lynch and Stonebreaker defined “user defined operators” to implement information retrieval in 1988 //Retrieves documents that contain term1, term2, term3 SELECT Doc_Id FROM Doc WHERE SEARCH-TERM(Text, Term1, Term 2, Term3) //Retrieves documents that contain term1, term2, term3 // within a window of 5 terms SELECT Doc_Id FROM Doc WHERE PROXIMITY(Text,5, Term1, Term 2, Term3)
  • 11. Non-First Normal Form Approaches Capture the many-to-many relationships into sets via nested relations Hard to implement ad-hoc queries No standard yet
  • 12. Using RDBMS for IR Benefits:  Recovery  Performance  Data migration  Concurrency Control  Access control mechanism  Logical and physical data independence
  • 13. Using RDBMS for IR Example: A bibliography that includes both structured and unstructured information  DIRECTORY (name, institution) : affiliation of the author  AUTHOR(name,DocId) :authorship information  INDEX (name, DocId) :terms that are used to index a document
  • 14. Using RDBMS for IR Preprocessing  SGML can be used as a starting point which is a standard for defining parts of documents <DOC> <DOCNO> WSJ834234234 </DOCNO> <HL> How to make students suffer in IR Course </HL> <DD> 03/23/87</DD> <DATELINE> Sabanci, Turkey </DATELINE> <TEXT> Crawler HW, Inverted Index, Querying </TEXT> </DOC>
  • 15. Using RDBMS for IR Preprocessing  SGML can be used as a starting point which is a standard for defining parts of documents  Use a parser together with a hash function to identify terms  Use STOP_TERM table for referencing stop words  Produce three output tables  INDEX (DocId, Term, TermFrequency) : Models the inverted index  DOC (DocId, DocName, PubDate, DateLine) : Document metadata  TERM (Term, Idf) : stored the weights of each term //Construct TERM table, N is the total number of documents INSERT INTO TERM SELECT Term,log(N/Count(*)) FROM INDEX GROUP BY Term
  • 16. Using RDBMS for IR An offset can be added together with the term to be able to answer proximity queries. For example “Vice President” should occur together in the same document for relevant documents etc. INDEX_PROX (DocId, Term, OffSet) //Construct TERM table, N is the total number of documents INSERT INTO INDEX SELECT DocId, Term, COUNT(*) FROM INDEX_PROX GROUP BY DocId, Term
  • 17. Using RDBMS for IR Query can be modeled as a relation as well when it is a long document  QUERY(Term,TermFreq) Ex: “Find all news documents written on 03/03/2005 about Sabanci University  Data will be extracted from the structured fields  Terms will be extracted using the inverted index SELECT d.DocId FROM DOC d, INDEX i WHERE i.Term IN (“Sabanci”, “University”) AND d.PubDate = “03/03/2005” AND d.DocId = i.DocId
  • 18. Using RDBMS for IR Boolean Queries: Consists of terms with boolean operators (AND, OR, and NOT) For a single inputTerm: retrieve the document texts that contain that term SELECT d.Text FROM DOC d, WHERE d.DocId IN (SELECT DISTINCT (i.DocId) FROM INDEX i WHERE i.Term = inputTerm) Note that we can store the text part of a document using BLOB or CLOG ( Binary or Character Large Object)
  • 19. Using RDBMS for IR Boolean Queries that contain OR SELECT DISTINCT (i.DocId) FROM INDEX i WHERE i.Term = inputTerm1 OR i.Term = inputTerm2 OR ….. i.Term = inputTermn OR
  • 20. Using RDBMS for IR Boolean Queries that contain AND SELECT DISTINCT (i.DocId) FROM INDEX i WHERE i.Term = inputTerm1 AND i.Term = inputTerm2 AND ….. i.Term = inputTermn AND ??
  • 21. Using RDBMS for IR Boolean Queries that contain AND (Previous Answer Was Wrong) SELECT DISTINCT (i.DocId) FROM INDEX i1, INDEX i2, INDEX i3, …. INDEX in WHERE i1.Term = inputTerm1 AND i2.Term = inputTerm2 AND ….. in.Term = inputTermn AND i1.DocID = i2.DocId AND i2.DocID = i3.DocId AND … in-1 = in.DocID OR YOU CAN USE INTERSECTION
  • 22. Using RDBMS for IR Boolean Queries that contain AND Commercial DBMSs are not able to process more than a fixed number of joins. Solution SELECT i.DocId FROM INDEX i, Query q WHERE i.Term = q.term GROUP BY i.DocId HAVING COUNT(i.Term) = (SELECT COUNT(*) FROM QUERY) Works only when the INDEX contains only one occurrence of a given term Together with its frequency. No Proximity is recorded.
  • 23. Using RDBMS for IR Boolean Queries that contain AND Commercial DBMSs are not able to process more than a fixed number of joins. Solution for terms appearing more than once in the INDEX SELECT i.DocId FROM INDEX i, Query q WHERE i.Term = q.term GROUP BY i.DocId HAVING COUNT(DISTINCT(i.Term)) = (SELECT COUNT(*) FROM QUERY) This is slower since DISTINC requires a sort for duplicate elimination.
  • 24. Using RDBMS for IR Boolean Queries that contain AND Commercial DBMSs are not able to process more than a fixed number of joins. Implementation of TAND (Threshold AND) is also simple SELECT i.DocId FROM INDEX i, Query q WHERE i.Term = q.term GROUP BY i.DocId HAVING COUNT(DISTINCT(i.Term)) > k
  • 25. Using RDBMS for IR Proximity Queries for terms within a specific window width SELECT a.DocId FROM INDEX_PROX a, INDEX_PROX b WHERE a.Term IN (SELECT q.Term FROM QUERY q) AND b.Term IN (SELECT q.Term FROM QUERY q) AND a.DocId = b.DocId AND (a.offset –b.offset) BETWEEN 0 AND (width-1) GROUP BY a.DocId, b.DocId, a.Term, a.offset HAVING COUNT(DISTINCT(b.Term)) = SELECT (COUNT(*) FROM QUERY)
  • 26. Using RDBMS for IR Calculating Relevance SELECT i.DocId, SUM(q.tf*t.idf*t.tf*t.idf) FROM QUERY q, INDEX i, TERM t WHERE q.Term = t.term AND i.Term = t.Term GROUP BY i.DocId ORDER BY 2 DESC