Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analysing search engine data on socially relevant topics

23 views

Published on

Search engines are seen by their users as trustworthy and neutral intermediaries between users and the content of the web. This is not true, however, as can be seen from the self-interest of search engine operators, which has led, among other things, to an antitrust case by the European Commission against Google. On the other hand, content providers and the search engine optimizers they commission have considerable opportunities to influence the search results of Google and other search engines in their favour.
This raises the question of what results or what kind of results users get to see in the top positions of search engines. We seek answers to this question by automatically evaluating the top results for a large number of search queries on the same topic. We extract the search queries from search engine log files so that we can realistically map the query behavior of users. The analysis of the search results takes place both on the level of the domain and on the level of the providers behind them (by automatically collecting the imprint data of the websites).
In addition to software development, we analysed search queries on the subject of insurance comparisons as a first use case. Among other things, it became apparent that Google's top search results, from which the majority of the hits are selected by the users, are provided by only a few companies and that these companies can thus exert a strong influence on the perception of a topic. Other topics that we will work on include gender stereotypes in the search results, controversial topics such as nuclear power or economic topics such as financing.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Analysing search engine data on socially relevant topics

  1. 1. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski ANALYSING SEARCH ENGINE DATA ON SOCIALLY RELEVANT TOPICS Prof. Dr. Dirk Lewandowski Work done in collaboration with Sebastian Sünkler GESIS, 19 September 2018
  2. 2. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski AGENDA 1. Background: Search engine bias, user behaviour in search engines 2. How can we crawl data from search engines? 3. How can we build relevant query sets? 4. Case study 5. Conclusion 1
  3. 3. BACKGROUND: SEARCH ENGINE BIAS, USER BAHAVIOUR 2
  4. 4. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski 3
  5. 5. FAKULTÄT DMI, DEPARTMENT INFORMATION Sebastian Sünkler 4
  6. 6. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski BIASES IN SEARCH ENGINE RESULTS Biases in regards to… • Race (Noble, 2018) • Gender (Noble, 2018; Otterbacher et al., 2017) • Confirmatory information to queries regarding conspiracy theories (Ballatore, 2015) • Promotion of hate speech (Bar-Ilan, 2006) • Health information (White and Horvitz, 2009) Problems with these studies in general: • Case studies, anecdotal evidence • Queries chosen by the researchers (not rule-based); no consideration of query polularity • No information on results providers; results ranking not considered 5
  7. 7. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski WHAT MAKES USERS CLICK? Users’ visual attention and selection behaviour are influenced by • Position • Visible area “above the fold” • The relevance of the results description (“snippet”) • Size and design of the snippet Users trust search engines: • Results are seen as accurate and trustworthy (Purcell, Brenner & Raine 2012) • Search engine ranking even is a criterion for trustworthiness (Westerwick 2013) 6
  8. 8. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski USERS’ SELECTION BEHAVIOUR 7
  9. 9. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski INFLUENCING SEARCH ENGINE RESULTS Search engine providers • Search engine providers act as content providers to their own search engines, e.g., Google/YouTube. • Vertical search engines are integrated into the main search engine, e.g., Google Shopping. External influences • Influence on the search results through search engine optimization (SEO), now a multi-billion Euro industry 8
  10. 10. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski FROM WHICH SOURCES DO THE SEARCH RESULTS COME FROM? Studies investigating domains (“sources”) • About 80 per cent of all clicked results are accounted for by only 10,000 websites (Goel et al., 2010; 2.6 billion queries from Yahoo logs) • Low overlap between different search engines in the top 10 results (Spink et al., 2006; 22.000 queries from Infospace/Dogpile) • Most popular sources in the top 10 differ between search engines; search engine providers prefer their own offerings (Yahoo Answers; YouTube) (Höchstötter & Lewandowski, 2009; 1.000 queries from Ask.com logs) Provider level • To our knowledge, no studies to date 9
  11. 11. HOW CAN WE CRAWL DATA FROM SEARCH ENGINES? 10
  12. 12. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski COLLECTING SEARCH RESULTS AT SCALE Problem • Commercial search engines do not allow access to their results through APIs. • Data collection in scholarly studies is usually done manually. • Scholarly studies are usually low-scale, i.e., using only some sample queries. Our approach • Querying search engines automatically, using large numbers of queries • Using screen scraping to collect search results from the HTML pages of search engines 11
  13. 13. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski SOFTWARE DEVELOPMENT Three software systems for screen-scraping and analysing search engine data 1. Relevance Assessment Tool (RAT) (2011 – ) - Purpose: Information Retrieval evaluation (juror-based assessments) - User interface / crowdsourcing for collecting (relevance) judgments 2. AAPVL (2013 – 2018) - Purpose: Identifying non-compliant food products in search engine results - Classifiers for shop identification, food shop identification, imprint identification 3. N.N. (2018 – ) - Purpose: Collecting and analysing search results as users see them - Using relevant query sets - Using results positions, domains and provider information (from imprints) 12
  14. 14. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski PROCESS DIAGRAMS 13 RAT AAPVL
  15. 15. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski SOFTWARE ARCHITECTURE 14 RAT AAPVL
  16. 16. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski OUTPUT Structured data including • Query • Search engine • Result position • URL • Domain • Provider (from imprint) • Shop (yes/no) • Manual assessment (if collected) KNIME 15
  17. 17. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski APPLICABILITY Some examples: • Search engine results evaluation using 1.000 queries, comparing Google and Bing (Lewandowski, 2015) • Finding non-compliant food products (several studies, some 10.000 results analysed manually by expert jurors as well as automatically) • Finding out what Google users get to see for certain topics (case study discussed later) 16
  18. 18. HOW CAN WE BUILD RELEVANT QUERY SETS? 17
  19. 19. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski SELECTING SEARCH QUERIES Why work with query sets? • Represent actual user behaviour • Address problem of anecdotal evidence and small-scale studies Approaches to selecting search queries to build a query set 1. Use your own inspiration (leads to arbitrary query sets) 2. Transaction log data from search engines (problem of access) 3. Google Trends (no absolute numbers but useful for comparing query popularity) 4. Google’s (or Bing’s) advertisement campaign planning tools (not tested yet) 18
  20. 20. USE CASE: QUERIES ON INSURANCE COMPARISONS 19
  21. 21. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski WHY INSURANCE COMPARISONS? • Searching for comparisons of all kinds is popular in web search • Many portals allow users to compare every possible type of insurance concerning the providers and the associated conditions and costs. • Highly competitive market; heavy use of search engine optimization. • Search results may not only contain "neutral" comparison sites like Stiftung Warentest, a not- for-profit foundation. • Free comparison sites may take commissions from vendors, may not take into account all vendors in their comparisons, or base their comparisons on outdated prices (see Stiftung Warentest, 2017). 20
  22. 22. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski QUERY SET 21 • Based on a query log from a commercial German search portal (see Lewandowski, 2015) • All queries containing *versicherung* (insurance) and *vergleich* (comparison) • Final query set consisting of 121 queries, e.g., - autoversicherungen vergleich - berufsunfähigkeitsversicherung vergleich - haftpflichtversicherung im vergleich
  23. 23. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski DATA PROCESSING • Query Google; screen-scraping results pages (getting as many results as possible) • Crawl imprint pages for every domain found, extract and structure data • Data cleansing • Match domains to providers • Descriptive statistics  Combination of scripting (PHP) and KNIME workflows. 22
  24. 24. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski DATA SET • Number of queries: 121 • Number of search results: 22,138 • Results per query: 183 [1, 298] • Number of different domains: 3,278 23
  25. 25. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski DATA TABLE 24
  26. 26. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski RESULTS SUMMARY 25 ● 116 different domains in top 10 results [10; 1210] ● 93 different providers in top 10 results, i.e., some providers have more than one domain ● The 5 most popular providers account for 47% of top 10 results (and 67% of top 5 results)  Two thirds of top 5 results are from only five different providers
  27. 27. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski THE FIVE MOST POPULAR PROVIDERS IN THE TOP 10 SEARCH RESULTS WITH THEIR PRIMARY DOMAINS 26 Provider Number of domains in the top 10 Number of results in the top 10 finanzen.de Vermittlungsgesellschaft für Verbraucherverträge AG 8 149 Müller & Kollegen UG (haftungsbeschränkt) 5 22 G. Zmuda 4 7 Axel Springer SE 2 53 Verivox GmbH 2 127
  28. 28. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski THE FIVE MOST POPULAR PROVIDERS IN THE TOP 10 SEARCH RESULTS 27 Provider Absolute share of results in the top 5 (n = 605) Relative share of results in the Top5 Absolute share of results in the top 10 (n = 1210) Relative share of results in the Top10 finanzen.de Vermittlungsgesellschaft für Verbraucherverträge AG 41 6.8% 133 11% Verivox GmbH 118 19.5% 127 10.1% CHECK24 GmbH 118 19.5% 120 9.9% Scout24 Holding GmbH 52 8.6% 97 8% TARIFCHECK24 GmbH 80 13.2% 90 7.4%
  29. 29. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski DISTRIBUTION OF PROVIDERS AMONG THE TOP10 SEARCH RESULTS When considering only the first position, Google only shows results from 10 different providers. 28
  30. 30. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski RELATIVE DISTRIBUTION OF THE DOMAINS ON THE POSITIONS 1 – 10 For 88% of queries, one of the three most popular providers is shown on the top position. 29
  31. 31. CONCLUSION 30
  32. 32. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski SUMMARY • Approach that can be used to quantify the distribution of search engine results based on actual user queries. • Software is readily available for conducting similar studies. • Case study shows the feasibility of the approach and gives a first impression of how results in different topical areas could be analysed. • In the insurance domain, few domains/providers dominate the top positions. Other providers are listed, but they tend to be in the lower positions in the top 10 search results. 31
  33. 33. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski LIMITATIONS • Imprint information may not be available in all countries (obligation to provide an imprint in Germany). • Our approach does not consider personalized search results. • Case study is limited in terms of number of queries and number of results positions analysed. • We did not consider query frequencies in the case study. 32
  34. 34. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski FUTURE WORK • Use the methods described for investigating controversial topics, e.g., nuclear power, abortion, health information. • Develop rules for building query sets including query frequencies (Google AdWords tool?) • Find ways to deal with personalization • Streamline workflow • Refine classifiers 33
  35. 35. THANK YOU Prof. Dr. Dirk Lewandowski Hamburg University of Applied Sciences dirk.lewandowski@haw-hamburg.de www.searchstudies.org/dirk Twitter: Dirk_Lew
  36. 36. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski REFERENCES Ballatore, A. (2015), „Google chemtrails: A methodology to analyze topic representation in search engine results“, First Monday, Vol. 20 No. 7, verfügbar unter: http://www.firstmonday.org/ojs/index.php/fm/article/view/5597/4652. Bar-Ilan, J. (2006), ‘Web links and search engine ranking: The case of Google and the query “Jew”’, Journal of the American Society for Information & Techology, Vol. 57 No. 12, pp. 1581–1589. Goel, S., Broder, A., Gabrilovich, E. und Pang, B. (2010), „Anatomy of the long tail“, Proceedings of the third ACM international conference on Web search and data mining - WSDM ’10, ACM Press, New York, New York, USA, S. 201. Höchstötter, N., & Lewandowski, D. (2009). What users see – Structures in search engine results pages. Information Sciences, 179(12), 1796–1812. https://doi.org/10.1016/j.ins.2009.01.028 Noble, S.U. (2018), Algorithms of Oppression: How Search Engines Reinforce Racism, New York University Press, New York, USA. Otterbacher, J., Bates, J. and Clough, P. (2017), ‘Competent Men and Warm Women’, Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI ’17, ACM Press, New York, New York, USA, pp. 6620–6631. Purcell, K., Brenner, J. and Rainie, L. (2012), ‘Search Engine Use 2012’, PEW Research Center, Washington, DC. Spink, A., Jansen, B.J., Blakely, C. und Koshman, S. (2006), „A study of results overlap and uniqueness among major Web search engines“, Information Processing & Management, Vol. 42 No. 5, S. 1379–1391. Stiftung Warentest. (2017). About us: An introduction to Stiftung Warentest. Abgerufen von https://www.test.de/unternehmen/about-us- 5017053-0/ Westerwick, A. (2013), ‘Effects of Sponsorship, Web Site Design, and Google Ranking on the Credibility of Online Information’, Journal of Computer-Mediated Communication, Vol. 18 No. 2, pp. 80–97. White, R.W. and Horvitz, E. (2009), ‘Cyberchondria’, ACM Transactions on Information Systems, Vol. 27 No. 4, p. Article No. 23. 35

×