Your SlideShare is downloading. ×
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
A machine learning approach to building domain specific search
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A machine learning approach to building domain specific search

737

Published on

Indexing and crawling the web intelligently by an agent that aids to develop a search engine which goes on learning contextually and semantically.

Indexing and crawling the web intelligently by an agent that aids to develop a search engine which goes on learning contextually and semantically.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
737
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
26
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A Machine Learning Approach to Building Domain-Specific Search Engines
    Presented By:
    Niharjyoti Sarangi
    Roll:06/232
    8th Semester, B.Tech, IT
    VSSUT, Burla
  • 2. Machine Learning
    • Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases.
    • 3. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data.
    • 4. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
  • Vertical Search
    • A vertical search engine, as distinct from a general Web search engine, focuses on a specific segment of online content. The vertical content area may be based on topicality, media type, or genre of content.
    • 5. General Web search engines :- Attempt to index large portions of the World Wide Web using a Web crawler.
    • 6. Vertical search engines :- Typically use a focused crawler that attempts to index only Web pages that are relevant to a pre-defined topic or set of topics.
  • Domain-Specific Search
    • Domain-specific search solutions focus on one area of knowledge, creating customized search experiences, that because of the domain's limited corpus and clear relationships between concepts, provide extremely relevant results for searchers.
    • 7. Potential Benefits over general search engines:-
    Greater precision due to limited scope
    Leverage domain knowledge including taxonomies and ontologies
    Support specific unique user tasks
  • 8. Anatomy of a Search Engine
    Crawling the web
    Indexing the web
    Searching the indices
    Major Data structures
    Big Files
    Repositories
    Document Index
    Lexicon
    Hit Lists
    Forward Index
  • 9. Web Crawling
    • A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.
    • 10. Other terms for Web crawlers are ants, automatic indexers, bots, and worms or Web spider, Web robot, or—especially in the FOAF community—Web scutter.
    • 11. A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
  • Web Crawling (contd.)
  • 12. foodscience.com-Job2
    JobTitle: Ice Cream Guru
    Employer: foodscience.com
    JobCategory: Travel/Hospitality
    JobFunction: Food Services
    JobLocation: Upper Midwest
    Contact Phone: 800-488-2611
    DateExtracted: January 8, 2001
    Source: www.foodscience.com/jobs_midwest.html
    OtherCompanyJobs: foodscience.com-Job1
    Information Extraction
  • 13. Information Extraction (contd.)
    As a task:
    As a task:
    Filling slots in a database from sub-segments of text.
    Filling slots in a database from sub-segments of text.
    October 14, 2002, 4:00 a.m. PT
    For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
    Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
    "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
    Richard Stallman, founder of the Free Software Foundation, countered saying…
    October 14, 2002, 4:00 a.m. PT
    For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
    Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
    "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
    Richard Stallman, founder of the Free Software Foundation, countered saying…
    NAME TITLE ORGANIZATION
    NAME TITLE ORGANIZATION
  • 14. Information Extraction (contd.)
    As a task:
    Filling slots in a database from sub-segments of text.
    October 14, 2002, 4:00 a.m. PT
    For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
    Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
    "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“
    Richard Stallman, founder of the Free Software Foundation, countered saying…
    IE
    NAME TITLE ORGANIZATION
    Bill GatesCEOMicrosoft
    Bill VeghteVPMicrosoft
    Richard StallmanfounderFree Soft..
  • 15. Information Extraction (contd.)
    As a familyof techniques:
    Information Extraction =
    segmentation + classification + clustering + association
    October 14, 2002, 4:00 a.m. PT
    For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
    Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
    "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“
    Richard Stallman, founder of the Free Software Foundation, countered saying…
    Microsoft Corporation
    CEO
    Bill Gates
    Microsoft
    Gates
    Microsoft
    Bill Veghte
    Microsoft
    VP
    Richard Stallman
    founder
    Free Software Foundation
  • 16. Information Extraction (contd.)
    As a familyof techniques:
    Information Extraction =
    segmentation + classification + association + clustering
    October 14, 2002, 4:00 a.m. PT
    For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
    Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
    "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“
    Richard Stallman, founder of the Free Software Foundation, countered saying…
    Microsoft Corporation
    CEO
    Bill Gates
    Microsoft
    Gates
    Microsoft
    Bill Veghte
    Microsoft
    VP
    Richard Stallman
    founder
    Free Software Foundation
  • 17. Information Extraction (contd.)
    As a familyof techniques:
    Information Extraction =
    segmentation + classification+ association + clustering
    October 14, 2002, 4:00 a.m. PT
    For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
    Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
    "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“
    Richard Stallman, founder of the Free Software Foundation, countered saying…
    Microsoft Corporation
    CEO
    Bill Gates
    Microsoft
    Gates
    Microsoft
    Bill Veghte
    Microsoft
    VP
    Richard Stallman
    founder
    Free Software Foundation
  • 18. NAME
    TITLE ORGANIZATION
    Bill Gates
    CEO
    Microsoft
    Bill
    Veghte
    VP
    Microsoft
    Free Soft..
    Richard
    Stallman
    founder
    Information Extraction (contd.)
    As a familyof techniques:
    Information Extraction =
    segmentation + classification+ association+ clustering
    October 14, 2002, 4:00 a.m. PT
    For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
    Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
    "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“
    Richard Stallman, founder of the Free Software Foundation, countered saying…
    Microsoft Corporation
    CEO
    Bill Gates
    Microsoft
    Gates
    Microsoft
    Bill Veghte
    Microsoft
    VP
    Richard Stallman
    founder
    Free Software Foundation
    *
    *
    *
    *
  • 19. Context of Extraction
    Create ontology
    Spider
    Filter by relevance
    IE
    Segment
    Classify
    Associate
    Cluster
    Database
    Load DB
    Query,
    Search
    Documentcollection
    Train extraction models
    Data mine
    Label training data
  • 20. IE Techniques
    Classify Pre-segmentedCandidates
    Lexicons
    Sliding Window
    Abraham Lincoln was born in Kentucky.
    Abraham Lincoln was born in Kentucky.
    Abraham Lincoln was born in Kentucky.
    member?
    Classifier
    Classifier
    Alabama
    Alaska

    Wisconsin
    Wyoming
    which class?
    which class?
    Try alternatewindow sizes:
    Context Free Grammars
    Finite State Machines
    Boundary Models
    Abraham Lincoln was born in Kentucky.
    Abraham Lincoln was born in Kentucky.
    Abraham Lincoln was born in Kentucky.
    Most likely state sequence?
    NNP
    V
    P
    NP
    V
    NNP
    Most likely parse?
    Classifier
    PP
    which class?
    VP
    NP
    VP
    BEGIN
    END
    BEGIN
    END
    S
    …and beyond
    Any of these models can be used to capture words, formatting or both.
  • 21. Sliding Window
    GRAND CHALLENGES FOR MACHINE LEARNING
    Jaime Carbonell
    School of Computer Science
    Carnegie Mellon University
    3:30 pm
    7500 Wean Hall
    Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
    CMU UseNet Seminar Announcement
  • 22. Sliding Window
    GRAND CHALLENGES FOR MACHINE LEARNING
    Jaime Carbonell
    School of Computer Science
    Carnegie Mellon University
    3:30 pm
    7500 Wean Hall
    Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
    CMU UseNet Seminar Announcement
  • 23. Sliding Window
    GRAND CHALLENGES FOR MACHINE LEARNING
    Jaime Carbonell
    School of Computer Science
    Carnegie Mellon University
    3:30 pm
    7500 Wean Hall
    Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
    CMU UseNet Seminar Announcement
  • 24. Sliding Window
    GRAND CHALLENGES FOR MACHINE LEARNING
    Jaime Carbonell
    School of Computer Science
    Carnegie Mellon University
    3:30 pm
    7500 Wean Hall
    Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
    CMU UseNet Seminar Announcement
  • 25. P(“Wean Hall Rm 5409” = LOCATION) =
    Prior probabilityof start position
    Prior probabilityof length
    Probabilityprefix words
    Probabilitycontents words
    Probabilitysuffix words
    Try all start positions and reasonable lengths
    Estimate these probabilities by (smoothed) counts from labeled training data.
    If P(“Wean Hall Rm 5409” = LOCATION)is above some threshold, extract it.
    Naïve Bayes Model
    00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun

    w t-m
    w t-1
    w t
    w t+n
    w t+n+1
    w t+n+m
    prefix
    contents
    suffix
  • 26. Hidden Markov Model
    HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …
    Graphical model
    Finite state model
    S
    S
    S
    transitions
    t
    -
    1
    t
    t+1
    ...
    ...
    observations
    ...
    Generates:
    State
    sequence
    Observation
    sequence
    O
    O
    O
    t
    t
    +1
    -
    t
    1
    o1 o2 o3 o4 o5 o6 o7 o8
    Parameters: for all states S={s1,s2,…}
    Start state probabilities: P(st )
    Transition probabilities: P(st|st-1 )
    Observation (emission) probabilities: P(ot|st )
    Training:
    Maximize probability of training observations (w/ prior)
    Usually a multinomial over atomic, fixed alphabet
  • 27. IE with HMM
    Given a sequence of observations:
    Yesterday Lawrence Saul spoke this example sentence.
    and a trained HMM:
    Find the most likely state sequence: (Viterbi)
    YesterdayLawrence Saulspoke this example sentence.
    Any words said to be generated by the designated “person name”
    state extract as a person name:
    Person name: Lawrence Saul
  • 28. Limitations of HMM
    HMM/CRF models have a linearstructure.
    Web documents have a hierarchicalstructure.
  • 29. Tree Based Models
    Extracting from one web site
    Use site-specificformatting information: e.g., “the JobTitle is a bold-faced paragraph in column 2”
    For large well-structured sites, like parsing a formal language
    Extracting from many web sites:
    Need general solutions to entity extraction, grouping into records, etc.
    Primarily use content information
    Must deal with a wide range of ways that users present data.
    Analogous to parsing natural language
    Problems are complementary:
    Site-dependent learning can collect training data for a site-independent learner
  • 30. Stalker: Hierarchical decomposition of two web sites
  • 31. Wrapster
    Common representations for web pages include:
    a rendered image
    a DOMtree(tree of HTML markup & text)
    gives some of the power of hierarchical decomposition
    a sequence of tokens
    a bag of words, a sequence of characters, a node in a directed graph, . . .
    Questions:
    How can we engineer a system to generalize quickly?
    How can we explorerepresentational choices easily?
  • 32. Wrapster
    html
    http://wasBang.org/aboutus.html
    WasBang.com contact info:
    Currently we have offices in two locations:
    • Pittsburgh, PA
    • 33. Provo, UT
    head
    body

    p
    p
    “WasBang.com .. info:”
    ul
    “Currently..”
    li
    li
    a
    a
    “Pittsburgh, PA”
    “Provo, UT”
  • 34. Wrapster Builders
    • Compose `tagpaths’ and `brackets’
    • 35. E.g., “extract strings between ‘(‘ and ‘)’ inside a list item inside an unordered list”
    • 36. Compose `tagpaths’ and language-based extractors
    • 37. E.g., “extract city names inside the first paragraph”
    • 38. Extract items based on position inside a rendered table, or properties of the rendered text
    • 39. E.g., “extract items inside any column headed by text containing the words ‘Job’ and ‘Title’”
    • 40. E.g. “extract items in boldfaced italics”
  • Table Based Builders
    How to represent “links to pages about singers”?
    Builders can be based on a geometric view of a page.
  • 41. Wrapster Results
    F1
    #examples
  • 42. References
    [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP’97, p194-201.
    [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).
    [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)
    [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).
    [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98.
    [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3).
    [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000).
  • 43. Niharjyoti Sarangi
    VSSUT, Burla
    THANK YOU

×