Evaluation and comparison of automatic methods to identify health queries

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Event

    Evaluation and comparison of automatic methods to identify health queries - Presentation Transcript

    1. Doctoral Symposium on Informatics Engineering (DSIE’08) Evaluation and comparison of automatic methods to identify health queries Carla Teixeira Lopes (carla.lopes@fe.up.pt) Faculdade de Engenharia da Universidade do Porto 7 de Fevereiro de 2008
    2. Contents • Health Information Retrieval • Methods – CHV and co-occurrence methods – Implementation – Evaluation • Results and its discussion • Conclusions • Future Work Evaluation and comparison of automatic methods to identify health queries 2 Carla Teixeira Lopes
    3. Health Information Retrieval (HIR) • The use of the Web to find health information is a common practice – 8 in 10 American users go online for health information [Online Health Search 2006, Pew Internet & American Life Project] – 71% of online consumers use search engines to find health information [JupiterResearch Finds Strong Consumer Demand and Market Opportunity for Health Search Engines] • The information found in this type of search has a great impact on people’s life – 74% of all health seekers said health search allowed them to make more appropriate health decisions [Online Health Search 2006, Pew Internet & American Life Project] Evaluation and comparison of automatic methods to identify health queries 3 Carla Teixeira Lopes
    4. Health Queries Identification • Health Query – Intends to retrieve health-related information and to satisfy a health information need. • Its identification in a pool of queries is usually one of the first steps in HIR studies • The most frequent classification method involves human intervention – Slow process – Requires the availability of one or more human classifiers • Automatic identification methods would be useful Evaluation and comparison of automatic methods to identify health queries 4 Carla Teixeira Lopes
    5. Literature review • Eysenbach and Kohler proposed a method based on the – # pages with search query and “health”/# pages with search query – Values near 1 depict queries more health-related • No other study was found on this specific area • The nearest, but broader topic, is generic automatic query classification • Our restriction to health domain make us believe some simpler and more targeted strategies may be developed Evaluation and comparison of automatic methods to identify health queries 5 Carla Teixeira Lopes
    6. Proposed methods • Two categories – CHV methods (discrete) • 11 different methods that use a health vocabulary entitled Consumer Health Vocabulary (CHV) which links everyday health terms to more technical terms • The methods differ on the subset of terms used to classify the queries • The presence of a term in a query is sufficient to classify it as health-related • Discrete – Co-occurrence methods (continuous) • 3 different methods based on the previous work of Eysenbach and Kohler, this is, on the idea that health-related terms should occur together with the word “health” more often than non-health terms Evaluation and comparison of automatic methods to identify health queries 6 Carla Teixeira Lopes
    7. CHV methds • Variants were defined, empirically, in an iterative process, fed by the data analysis of the variants already defined • CHV1 (all CHV terms) • CHV2, CHV3, CHV4, CHV5, CHV6 (terms associated with the 200, 400, 600, 800, 1 000 most frequent concepts) • CHV7 (UMLS preferred terms) • CHV8 (CHV preferred terms) • CHV9 (UMLS or CHV preferred terms) • CHV10, CHV11 (6 000, 10 000 more frequent terms) Evaluation and comparison of automatic methods to identify health queries 7 Carla Teixeira Lopes
    8. Co-occurrence methods (1/2) • For each query, the co-occurrence rate (cooc) was calculated # results(termsQ I health) cooc(Q) = # results(termsQ ) – If #results(termsQ)=0 then cooc(Q)=0 • Examples: ! 478000 – ‘diabetes symptoms’ has a cooc of = 0,51 929000 359000 – ‘Pavarotti’ has a cooc of = 0,06 6440000 Evaluation and comparison of automatic methods to identify health queries ! 8 Carla Teixeira Lopes !
    9. Co-occurence methods (2/2) • In the work of Eysenbach and Kohler, Google was used as the search engine • Here, we propose three variants: – One that uses Google results – One that uses Yahoo! results – One that combines Google and Yahoo! results • After the cooc calculation, this value was compared with several thresholds (0; 0,05; 0,1; 0,15; ...; 1) – If the cooc rate was larger or equal to the threshold, the query is considered to be a health query at that threshold. Evaluation and comparison of automatic methods to identify health queries 9 Carla Teixeira Lopes
    10. Implementation • Pool of queries – A collection of 20000 queries, randomly sampled from AOL Search in the Fall of 2004 – Each query was classified into one of 20 topical categories by a team of ten human assessors – Health is one of the topical categories with 1197 queries • Was also used a file with stop-words provided by the University of Glasgow • Were developed several Perl scripts Evaluation and comparison of automatic methods to identify health queries 10 Carla Teixeira Lopes
    11. CHV methods implementation Evaluation and comparison of automatic methods to identify health queries 11 Carla Teixeira Lopes
    12. Co-occurrence methods implementation Evaluation and comparison of automatic methods to identify health queries 12 Carla Teixeira Lopes
    13. Evaluation • Human classification was considered the correct classification • Each method performance was evaluated comparing its classification with the human one. • Several measures were calculated: – Sensitivity (capacity of automatically classify a health query as health related) – Specificity (capacity of automatically classify a non-health query as non- health related) – Accuracy (tax of correct automatic classifications) • Two Receiver Operating Characteristics (ROC) graphs were also drawn (one for each category of methods) – X axis -> false positive rate (1-specificity) – Y axis -> sensibility – Depicts relative tradeoffs between benefits (true positives) and costs (false positives) Evaluation and comparison of automatic methods to identify health queries 13 Carla Teixeira Lopes
    14. CHV Methods Performance (1/2) Evaluation and comparison of automatic methods to identify health queries 14 Carla Teixeira Lopes
    15. CHV Methods Performance (2/2) Evaluation and comparison of automatic methods to identify health queries 15 Carla Teixeira Lopes
    16. Cooc Methods Performance (1/3) Evaluation and comparison of automatic methods to identify health queries 16 Carla Teixeira Lopes
    17. Cooc Methods Performance (2/3) Evaluation and comparison of automatic methods to identify health queries 17 Carla Teixeira Lopes
    18. Cooc Methods Performance (3/3) Evaluation and comparison of automatic methods to identify health queries 18 Carla Teixeira Lopes
    19. Comparison to previous results • Google results are different from Eysenbach and Kohler’ results • Their results – Threshold of 35% was considered an optimal trade-off between sensitivity (85,2%) and specificity (80,4%) – Pool of 2985 queries • Our results – Worse sensitivity (68% or 72%) and specificity (59% or 55%) values – Different optimal threshold values (0,6 or 0,55) • The larger sample used in our study make us believe our results are a better portray of reality. Evaluation and comparison of automatic methods to identify health queries 19 Carla Teixeira Lopes
    20. Conclusions • The indicated optimal methods may be discarded when compared to others if sensitivity is preferable to specificity or vice-versa. • Yahoo! Results better than Google but worser than Eysenbach and Kohler results • None of CHV methods behaved better than Yahoo! method. • Sensitivity values in all methods are not very high (highest is 73%) Evaluation and comparison of automatic methods to identify health queries 20 Carla Teixeira Lopes
    21. Future Work (1/2) • Manual definition of a term’s list might improve CHV methods – Elimination of terms that produce false positives – Add terms that could help reduce false negatives • Use UMLS vocabulary instead of the CHV • Definition of a continuous output based on the number of health terms in the query Evaluation and comparison of automatic methods to identify health queries 21 Carla Teixeira Lopes
    22. Future Work (2/2) • Evaluation of co-occurrence methods in Portuguese queries • Analysis of co-occurrence with words different from ‘health’ or with a set of terms separated by the OR logical operator • Identification of health queries by a health specialist (some human classifications of health queries are dubious) Evaluation and comparison of automatic methods to identify health queries 22 Carla Teixeira Lopes

    + fcorreiafcorreia, 2 years ago

    custom

    671 views, 0 favs, 0 embeds more stats

    Presented on DSIE'08 - http://www.fe.up.pt/dsie08

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 671
      • 671 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 2
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Groups / Events