Analysing search engine data on socially relevant topics

Dirk Lewandowski
Dirk LewandowskiProfessor at Hamburg University of Applied Sciences
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
ANALYSING SEARCH ENGINE DATA ON SOCIALLY
RELEVANT TOPICS
Prof. Dr. Dirk Lewandowski
Work done in collaboration with Sebastian Sünkler
GESIS, 19 September 2018
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
AGENDA
1. Background: Search engine bias, user behaviour in search engines
2. How can we crawl data from search engines?
3. How can we build relevant query sets?
4. Case study
5. Conclusion
1
BACKGROUND: SEARCH
ENGINE BIAS, USER
BAHAVIOUR
2
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
3
FAKULTÄT DMI, DEPARTMENT INFORMATION
Sebastian Sünkler
4
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
BIASES IN SEARCH ENGINE RESULTS
Biases in regards to…
• Race (Noble, 2018)
• Gender (Noble, 2018; Otterbacher et al., 2017)
• Confirmatory information to queries regarding conspiracy theories (Ballatore, 2015)
• Promotion of hate speech (Bar-Ilan, 2006)
• Health information (White and Horvitz, 2009)
Problems with these studies in general:
• Case studies, anecdotal evidence
• Queries chosen by the researchers (not rule-based); no consideration of query polularity
• No information on results providers; results ranking not considered
5
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
WHAT MAKES USERS CLICK?
Users’ visual attention and selection behaviour
are influenced by
• Position
• Visible area “above the fold”
• The relevance of the results description (“snippet”)
• Size and design of the snippet
Users trust search engines:
• Results are seen as accurate and trustworthy
(Purcell, Brenner & Raine 2012)
• Search engine ranking even is a criterion for
trustworthiness (Westerwick 2013) 6
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
USERS’ SELECTION BEHAVIOUR
7
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
INFLUENCING SEARCH ENGINE RESULTS
Search engine providers
• Search engine providers act as content providers to their own search engines, e.g.,
Google/YouTube.
• Vertical search engines are integrated into the main search engine, e.g., Google Shopping.
External influences
• Influence on the search results through search engine optimization (SEO), now a multi-billion
Euro industry
8
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
FROM WHICH SOURCES DO THE SEARCH RESULTS COME
FROM?
Studies investigating domains (“sources”)
• About 80 per cent of all clicked results are accounted for by only 10,000 websites (Goel et al.,
2010; 2.6 billion queries from Yahoo logs)
• Low overlap between different search engines in the top 10 results (Spink et al., 2006; 22.000
queries from Infospace/Dogpile)
• Most popular sources in the top 10 differ between search engines; search engine providers
prefer their own offerings (Yahoo Answers; YouTube) (Höchstötter & Lewandowski, 2009;
1.000 queries from Ask.com logs)
Provider level
• To our knowledge, no studies to date
9
HOW CAN WE CRAWL DATA
FROM SEARCH ENGINES?
10
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
COLLECTING SEARCH RESULTS AT SCALE
Problem
• Commercial search engines do not allow access to their results through APIs.
• Data collection in scholarly studies is usually done manually.
• Scholarly studies are usually low-scale, i.e., using only some sample queries.
Our approach
• Querying search engines automatically, using large numbers of queries
• Using screen scraping to collect search results from the HTML pages of search engines
11
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
SOFTWARE DEVELOPMENT
Three software systems for screen-scraping and analysing search engine data
1. Relevance Assessment Tool (RAT) (2011 – )
- Purpose: Information Retrieval evaluation (juror-based assessments)
- User interface / crowdsourcing for collecting (relevance) judgments
2. AAPVL (2013 – 2018)
- Purpose: Identifying non-compliant food products in search engine results
- Classifiers for shop identification, food shop identification, imprint identification
3. N.N. (2018 – )
- Purpose: Collecting and analysing search results as users see them
- Using relevant query sets
- Using results positions, domains and provider information (from imprints)
12
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
PROCESS DIAGRAMS
13
RAT
AAPVL
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
SOFTWARE ARCHITECTURE
14
RAT AAPVL
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
OUTPUT
Structured data including
• Query
• Search engine
• Result position
• URL
• Domain
• Provider (from imprint)
• Shop (yes/no)
• Manual assessment (if collected)
KNIME
15
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
APPLICABILITY
Some examples:
• Search engine results evaluation using 1.000 queries, comparing Google and Bing
(Lewandowski, 2015)
• Finding non-compliant food products (several studies, some 10.000 results analysed manually
by expert jurors as well as automatically)
• Finding out what Google users get to see for certain topics (case study discussed later)
16
HOW CAN WE BUILD
RELEVANT QUERY SETS?
17
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
SELECTING SEARCH QUERIES
Why work with query sets?
• Represent actual user behaviour
• Address problem of anecdotal evidence and small-scale studies
Approaches to selecting search queries to build a query set
1. Use your own inspiration (leads to arbitrary query sets)
2. Transaction log data from search engines (problem of access)
3. Google Trends (no absolute numbers but useful for comparing query popularity)
4. Google’s (or Bing’s) advertisement campaign planning tools (not tested yet)
18
USE CASE: QUERIES ON
INSURANCE COMPARISONS
19
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
WHY INSURANCE COMPARISONS?
• Searching for comparisons of all kinds is popular in web search
• Many portals allow users to compare every possible type of insurance concerning the
providers and the associated conditions and costs.
• Highly competitive market; heavy use of search engine optimization.
• Search results may not only contain "neutral" comparison sites like Stiftung Warentest, a not-
for-profit foundation.
• Free comparison sites may take commissions from vendors, may not take into account all
vendors in their comparisons, or base their comparisons on outdated prices (see Stiftung
Warentest, 2017).
20
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
QUERY SET
21
• Based on a query log from a commercial German search portal (see Lewandowski, 2015)
• All queries containing *versicherung* (insurance) and *vergleich* (comparison)
• Final query set consisting of 121 queries, e.g.,
- autoversicherungen vergleich
- berufsunfähigkeitsversicherung vergleich
- haftpflichtversicherung im vergleich
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
DATA PROCESSING
• Query Google; screen-scraping results pages (getting as many results as possible)
• Crawl imprint pages for every domain found, extract and structure data
• Data cleansing
• Match domains to providers
• Descriptive statistics
 Combination of scripting (PHP) and KNIME workflows.
22
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
DATA SET
• Number of queries: 121
• Number of search results: 22,138
• Results per query: 183 [1, 298]
• Number of different domains: 3,278
23
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
DATA TABLE
24
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
RESULTS SUMMARY
25
● 116 different domains in top 10 results [10; 1210]
● 93 different providers in top 10 results, i.e., some providers have more than one domain
● The 5 most popular providers account for 47% of top 10 results (and 67% of top 5 results)
 Two thirds of top 5 results are from only five different providers
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
THE FIVE MOST POPULAR PROVIDERS IN THE TOP 10 SEARCH
RESULTS WITH THEIR PRIMARY DOMAINS
26
Provider Number of domains in the top 10 Number of results in the top 10
finanzen.de
Vermittlungsgesellschaft
für Verbraucherverträge
AG
8 149
Müller & Kollegen UG
(haftungsbeschränkt)
5 22
G. Zmuda 4 7
Axel Springer SE 2 53
Verivox GmbH 2 127
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
THE FIVE MOST POPULAR PROVIDERS IN THE TOP 10 SEARCH
RESULTS
27
Provider Absolute share
of results in the
top 5 (n = 605)
Relative share
of results in the
Top5
Absolute share of
results in the top 10
(n = 1210)
Relative share of results
in the Top10
finanzen.de
Vermittlungsgesellschaft für
Verbraucherverträge AG
41 6.8% 133 11%
Verivox GmbH 118 19.5% 127 10.1%
CHECK24 GmbH 118 19.5% 120 9.9%
Scout24 Holding GmbH 52 8.6% 97 8%
TARIFCHECK24 GmbH 80 13.2% 90 7.4%
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
DISTRIBUTION OF PROVIDERS AMONG THE TOP10 SEARCH RESULTS
When considering only the first position, Google only shows results from 10 different providers.
28
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
RELATIVE DISTRIBUTION OF THE DOMAINS ON THE POSITIONS 1 – 10
For 88% of queries, one of the three most popular providers is shown on the top position.
29
CONCLUSION
30
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
SUMMARY
• Approach that can be used to quantify the distribution of search engine results based on
actual user queries.
• Software is readily available for conducting similar studies.
• Case study shows the feasibility of the approach and gives a first impression of how results in
different topical areas could be analysed.
• In the insurance domain, few domains/providers dominate the top positions. Other providers
are listed, but they tend to be in the lower positions in the top 10 search results.
31
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
LIMITATIONS
• Imprint information may not be available in all countries (obligation to provide an imprint in
Germany).
• Our approach does not consider personalized search results.
• Case study is limited in terms of number of queries and number of results positions analysed.
• We did not consider query frequencies in the case study.
32
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
FUTURE WORK
• Use the methods described for investigating controversial topics, e.g., nuclear power, abortion,
health information.
• Develop rules for building query sets including query frequencies (Google AdWords tool?)
• Find ways to deal with personalization
• Streamline workflow
• Refine classifiers
33
THANK YOU
Prof. Dr. Dirk Lewandowski
Hamburg University of Applied Sciences
dirk.lewandowski@haw-hamburg.de
www.searchstudies.org/dirk
Twitter: Dirk_Lew
FAKULTÄT DMI, DEPARTMENT INFORMATION
Dirk Lewandowski
REFERENCES
Ballatore, A. (2015), „Google chemtrails: A methodology to analyze topic representation in search engine results“, First Monday, Vol. 20 No.
7, verfügbar unter: http://www.firstmonday.org/ojs/index.php/fm/article/view/5597/4652.
Bar-Ilan, J. (2006), ‘Web links and search engine ranking: The case of Google and the query “Jew”’, Journal of the American Society for
Information & Techology, Vol. 57 No. 12, pp. 1581–1589.
Goel, S., Broder, A., Gabrilovich, E. und Pang, B. (2010), „Anatomy of the long tail“, Proceedings of the third ACM international conference
on Web search and data mining - WSDM ’10, ACM Press, New York, New York, USA, S. 201.
Höchstötter, N., & Lewandowski, D. (2009). What users see – Structures in search engine results pages. Information Sciences, 179(12),
1796–1812. https://doi.org/10.1016/j.ins.2009.01.028
Noble, S.U. (2018), Algorithms of Oppression: How Search Engines Reinforce Racism, New York University Press, New York, USA.
Otterbacher, J., Bates, J. and Clough, P. (2017), ‘Competent Men and Warm Women’, Proceedings of the 2017 CHI Conference on Human
Factors in Computing Systems - CHI ’17, ACM Press, New York, New York, USA, pp. 6620–6631.
Purcell, K., Brenner, J. and Rainie, L. (2012), ‘Search Engine Use 2012’, PEW Research Center, Washington, DC.
Spink, A., Jansen, B.J., Blakely, C. und Koshman, S. (2006), „A study of results overlap and uniqueness among major Web search
engines“, Information Processing & Management, Vol. 42 No. 5, S. 1379–1391.
Stiftung Warentest. (2017). About us: An introduction to Stiftung Warentest. Abgerufen von https://www.test.de/unternehmen/about-us-
5017053-0/
Westerwick, A. (2013), ‘Effects of Sponsorship, Web Site Design, and Google Ranking on the Credibility of Online Information’, Journal of
Computer-Mediated Communication, Vol. 18 No. 2, pp. 80–97.
White, R.W. and Horvitz, E. (2009), ‘Cyberchondria’, ACM Transactions on Information Systems, Vol. 27 No. 4, p. Article No. 23.
35
1 of 36

Recommended

In a World of Biased Search Engines by
In a World of Biased Search EnginesIn a World of Biased Search Engines
In a World of Biased Search EnginesDirk Lewandowski
583 views48 slides
The Australian Search Experience Project by
The Australian Search Experience ProjectThe Australian Search Experience Project
The Australian Search Experience ProjectAxel Bruns
148 views32 slides
Alternatives to Google by
Alternatives to GoogleAlternatives to Google
Alternatives to GoogleDirk Lewandowski
1.2K views39 slides
Hobbit presentation at Apache Big Data Europe 2016 by
Hobbit presentation at Apache Big Data Europe 2016Hobbit presentation at Apache Big Data Europe 2016
Hobbit presentation at Apache Big Data Europe 2016Holistic Benchmarking of Big Linked Data
625 views51 slides
What is the future of data strategy? by
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?Denodo
144 views30 slides
Big data analytics, research report by
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
12K views57 slides

More Related Content

Similar to Analysing search engine data on socially relevant topics

Using the power of OpenAI with your own data: what's possible and how to start? by
Using the power of OpenAI with your own data: what's possible and how to start?Using the power of OpenAI with your own data: what's possible and how to start?
Using the power of OpenAI with your own data: what's possible and how to start?Maxim Salnikov
58 views38 slides
Build a Foundation for Data Integrity with Analytics Auditing by
Build a Foundation for Data Integrity with Analytics AuditingBuild a Foundation for Data Integrity with Analytics Auditing
Build a Foundation for Data Integrity with Analytics AuditingTinuiti
538 views28 slides
Navigating the Workday Analytics and Reporting Ecosystem by
Navigating the Workday Analytics and Reporting EcosystemNavigating the Workday Analytics and Reporting Ecosystem
Navigating the Workday Analytics and Reporting EcosystemWorkday, Inc.
861 views36 slides
Google Analytics Training - full 2017 by
Google Analytics Training - full 2017Google Analytics Training - full 2017
Google Analytics Training - full 2017Nate Plaunt
200 views141 slides
The Need for and fundamentals of an Open Web Index by
The Need for and fundamentals of an Open Web IndexThe Need for and fundamentals of an Open Web Index
The Need for and fundamentals of an Open Web IndexDirk Lewandowski
230 views17 slides
Benchmarking Saas Financials and Ratios by
Benchmarking Saas Financials and RatiosBenchmarking Saas Financials and Ratios
Benchmarking Saas Financials and RatiosISV World
1.3K views27 slides

Similar to Analysing search engine data on socially relevant topics(20)

Using the power of OpenAI with your own data: what's possible and how to start? by Maxim Salnikov
Using the power of OpenAI with your own data: what's possible and how to start?Using the power of OpenAI with your own data: what's possible and how to start?
Using the power of OpenAI with your own data: what's possible and how to start?
Maxim Salnikov58 views
Build a Foundation for Data Integrity with Analytics Auditing by Tinuiti
Build a Foundation for Data Integrity with Analytics AuditingBuild a Foundation for Data Integrity with Analytics Auditing
Build a Foundation for Data Integrity with Analytics Auditing
Tinuiti538 views
Navigating the Workday Analytics and Reporting Ecosystem by Workday, Inc.
Navigating the Workday Analytics and Reporting EcosystemNavigating the Workday Analytics and Reporting Ecosystem
Navigating the Workday Analytics and Reporting Ecosystem
Workday, Inc.861 views
Google Analytics Training - full 2017 by Nate Plaunt
Google Analytics Training - full 2017Google Analytics Training - full 2017
Google Analytics Training - full 2017
Nate Plaunt200 views
The Need for and fundamentals of an Open Web Index by Dirk Lewandowski
The Need for and fundamentals of an Open Web IndexThe Need for and fundamentals of an Open Web Index
The Need for and fundamentals of an Open Web Index
Dirk Lewandowski230 views
Benchmarking Saas Financials and Ratios by ISV World
Benchmarking Saas Financials and RatiosBenchmarking Saas Financials and Ratios
Benchmarking Saas Financials and Ratios
ISV World1.3K views
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar by Gramener
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
Gramener136 views
Analytics that Matter: Metrics that Drive SEO Engagement by Kirill Kronrod
Analytics that Matter: Metrics that Drive SEO EngagementAnalytics that Matter: Metrics that Drive SEO Engagement
Analytics that Matter: Metrics that Drive SEO Engagement
Kirill Kronrod345 views
When Data Visualizations and Data Imports Just Don’t Work by Jim Kaplan CIA CFE
When Data Visualizations and Data Imports Just Don’t WorkWhen Data Visualizations and Data Imports Just Don’t Work
When Data Visualizations and Data Imports Just Don’t Work
Jim Kaplan CIA CFE181 views
An Introduction to RedMonk Analytics by sogrady
An Introduction to RedMonk AnalyticsAn Introduction to RedMonk Analytics
An Introduction to RedMonk Analytics
sogrady10.6K views
Monitoring and Measuring SharePoint to Guarantee Your ROI by Christian Buckley
Monitoring and Measuring SharePoint to Guarantee Your ROIMonitoring and Measuring SharePoint to Guarantee Your ROI
Monitoring and Measuring SharePoint to Guarantee Your ROI
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望 by Rakuten Group, Inc.
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望E-commerce企業におけるビッグデータ活用の取り組みと今後の展望
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望
Rakuten Group, Inc.4.4K views
SOMLink the Innovator in Suspicious Order Monitoring by Kelly Forrester
SOMLink the Innovator in Suspicious Order MonitoringSOMLink the Innovator in Suspicious Order Monitoring
SOMLink the Innovator in Suspicious Order Monitoring
Kelly Forrester204 views
How DMP Will Save Marketing - Myths, Truths and Best Practices by Annalect Finland
How DMP Will Save Marketing - Myths, Truths and Best PracticesHow DMP Will Save Marketing - Myths, Truths and Best Practices
How DMP Will Save Marketing - Myths, Truths and Best Practices
Annalect Finland5.1K views
The value of benchmarking IT projects - H.S. van Heeringen by Harold van Heeringen
The value of benchmarking IT projects - H.S. van HeeringenThe value of benchmarking IT projects - H.S. van Heeringen
The value of benchmarking IT projects - H.S. van Heeringen
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber... by DataBench
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
DataBench142 views
RNUG 2020: Domino Application Strategy: Key insights for successful moderniza... by panagenda
RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...
RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...
panagenda484 views
Tableau 2018 - Introduction to Visual analytics by Arun K
Tableau 2018 - Introduction to Visual analyticsTableau 2018 - Introduction to Visual analytics
Tableau 2018 - Introduction to Visual analytics
Arun K518 views
Role of Data in Digital Transformation by VMware Tanzu
Role of Data in Digital TransformationRole of Data in Digital Transformation
Role of Data in Digital Transformation
VMware Tanzu2.2K views
Production & Well Work Reporting: 7 Keys to Success by NeoFirma
Production & Well Work Reporting: 7 Keys to SuccessProduction & Well Work Reporting: 7 Keys to Success
Production & Well Work Reporting: 7 Keys to Success
NeoFirma298 views

More from Dirk Lewandowski

EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni... by
EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...
EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...Dirk Lewandowski
319 views45 slides
Künstliche Intelligenz bei Suchmaschinen by
Künstliche Intelligenz bei SuchmaschinenKünstliche Intelligenz bei Suchmaschinen
Künstliche Intelligenz bei SuchmaschinenDirk Lewandowski
466 views28 slides
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändert by
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändertGoogle Assistant, Alexa & Co.: Wie sich die Welt der Suche verändert
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändertDirk Lewandowski
278 views27 slides
Suchverhalten und die Grenzen von Suchdiensten by
Suchverhalten und die Grenzen von SuchdienstenSuchverhalten und die Grenzen von Suchdiensten
Suchverhalten und die Grenzen von SuchdienstenDirk Lewandowski
173 views10 slides
Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden? by
Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden?Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden?
Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden?Dirk Lewandowski
265 views12 slides
Are Ads on Google search engine results pages labeled clearly enough? by
Are Ads on Google search engine results pages labeled clearly enough?Are Ads on Google search engine results pages labeled clearly enough?
Are Ads on Google search engine results pages labeled clearly enough?Dirk Lewandowski
910 views23 slides

More from Dirk Lewandowski(20)

EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni... by Dirk Lewandowski
EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...
EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...
Dirk Lewandowski319 views
Künstliche Intelligenz bei Suchmaschinen by Dirk Lewandowski
Künstliche Intelligenz bei SuchmaschinenKünstliche Intelligenz bei Suchmaschinen
Künstliche Intelligenz bei Suchmaschinen
Dirk Lewandowski466 views
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändert by Dirk Lewandowski
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändertGoogle Assistant, Alexa & Co.: Wie sich die Welt der Suche verändert
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändert
Dirk Lewandowski278 views
Suchverhalten und die Grenzen von Suchdiensten by Dirk Lewandowski
Suchverhalten und die Grenzen von SuchdienstenSuchverhalten und die Grenzen von Suchdiensten
Suchverhalten und die Grenzen von Suchdiensten
Dirk Lewandowski173 views
Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden? by Dirk Lewandowski
Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden?Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden?
Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden?
Dirk Lewandowski265 views
Are Ads on Google search engine results pages labeled clearly enough? by Dirk Lewandowski
Are Ads on Google search engine results pages labeled clearly enough?Are Ads on Google search engine results pages labeled clearly enough?
Are Ads on Google search engine results pages labeled clearly enough?
Dirk Lewandowski910 views
Search Engine Bias - sollen wir Googles Suchergebnissen vertrauen? by Dirk Lewandowski
Search Engine Bias - sollen wir Googles Suchergebnissen vertrauen?Search Engine Bias - sollen wir Googles Suchergebnissen vertrauen?
Search Engine Bias - sollen wir Googles Suchergebnissen vertrauen?
Dirk Lewandowski1.2K views
Wie Suchmaschinen die Inhalte des Web interpretieren by Dirk Lewandowski
Wie Suchmaschinen die Inhalte des Web interpretierenWie Suchmaschinen die Inhalte des Web interpretieren
Wie Suchmaschinen die Inhalte des Web interpretieren
Dirk Lewandowski1.5K views
Perspektiven eines Open Web Index by Dirk Lewandowski
Perspektiven eines Open Web IndexPerspektiven eines Open Web Index
Perspektiven eines Open Web Index
Dirk Lewandowski1.6K views
Wie entwickeln sich Suchmaschinen heute, was kommt morgen? by Dirk Lewandowski
Wie entwickeln sich Suchmaschinen heute, was kommt morgen?Wie entwickeln sich Suchmaschinen heute, was kommt morgen?
Wie entwickeln sich Suchmaschinen heute, was kommt morgen?
Dirk Lewandowski1.5K views
Internet-Suchmaschinen: Aktueller Stand und Entwicklungsperspektiven by Dirk Lewandowski
Internet-Suchmaschinen: Aktueller Stand und EntwicklungsperspektivenInternet-Suchmaschinen: Aktueller Stand und Entwicklungsperspektiven
Internet-Suchmaschinen: Aktueller Stand und Entwicklungsperspektiven
Dirk Lewandowski871 views
Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Sim... by Dirk Lewandowski
Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Sim...Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Sim...
Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Sim...
Dirk Lewandowski1.3K views
Verwendung von Skalenbewertungen in der Evaluierung von Suchmaschinen by Dirk Lewandowski
Verwendung von Skalenbewertungen in der Evaluierung von SuchmaschinenVerwendung von Skalenbewertungen in der Evaluierung von Suchmaschinen
Verwendung von Skalenbewertungen in der Evaluierung von Suchmaschinen
Dirk Lewandowski833 views
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (3) by Dirk Lewandowski
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (3)Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (3)
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (3)
Dirk Lewandowski646 views
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (2) by Dirk Lewandowski
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (2)Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (2)
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (2)
Dirk Lewandowski595 views
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (1) by Dirk Lewandowski
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (1)Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (1)
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (1)
Dirk Lewandowski571 views
Medientage 2013: Die Zukunft der Suche by Dirk Lewandowski
Medientage 2013: Die Zukunft der SucheMedientage 2013: Die Zukunft der Suche
Medientage 2013: Die Zukunft der Suche
Dirk Lewandowski1.1K views

Recently uploaded

Affiliate Marketing by
Affiliate MarketingAffiliate Marketing
Affiliate MarketingNavin Dhanuka
17 views30 slides
information by
informationinformation
informationkhelgishekhar
10 views4 slides
The Dark Web : Hidden Services by
The Dark Web : Hidden ServicesThe Dark Web : Hidden Services
The Dark Web : Hidden ServicesAnshu Singh
14 views24 slides
Marketing and Community Building in Web3 by
Marketing and Community Building in Web3Marketing and Community Building in Web3
Marketing and Community Building in Web3Federico Ast
14 views64 slides
IETF 118: Starlink Protocol Performance by
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol PerformanceAPNIC
414 views22 slides
ATPMOUSE_융합2조.pptx by
ATPMOUSE_융합2조.pptxATPMOUSE_융합2조.pptx
ATPMOUSE_융합2조.pptxkts120898
35 views70 slides

Recently uploaded(9)

The Dark Web : Hidden Services by Anshu Singh
The Dark Web : Hidden ServicesThe Dark Web : Hidden Services
The Dark Web : Hidden Services
Anshu Singh14 views
Marketing and Community Building in Web3 by Federico Ast
Marketing and Community Building in Web3Marketing and Community Building in Web3
Marketing and Community Building in Web3
Federico Ast14 views
IETF 118: Starlink Protocol Performance by APNIC
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol Performance
APNIC414 views
ATPMOUSE_융합2조.pptx by kts120898
ATPMOUSE_융합2조.pptxATPMOUSE_융합2조.pptx
ATPMOUSE_융합2조.pptx
kts12089835 views
Building trust in our information ecosystem: who do we trust in an emergency by Tina Purnat
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergency
Tina Purnat110 views
How to think like a threat actor for Kubernetes.pptx by LibbySchulze1
How to think like a threat actor for Kubernetes.pptxHow to think like a threat actor for Kubernetes.pptx
How to think like a threat actor for Kubernetes.pptx
LibbySchulze15 views

Analysing search engine data on socially relevant topics

  • 1. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski ANALYSING SEARCH ENGINE DATA ON SOCIALLY RELEVANT TOPICS Prof. Dr. Dirk Lewandowski Work done in collaboration with Sebastian Sünkler GESIS, 19 September 2018
  • 2. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski AGENDA 1. Background: Search engine bias, user behaviour in search engines 2. How can we crawl data from search engines? 3. How can we build relevant query sets? 4. Case study 5. Conclusion 1
  • 4. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski 3
  • 5. FAKULTÄT DMI, DEPARTMENT INFORMATION Sebastian Sünkler 4
  • 6. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski BIASES IN SEARCH ENGINE RESULTS Biases in regards to… • Race (Noble, 2018) • Gender (Noble, 2018; Otterbacher et al., 2017) • Confirmatory information to queries regarding conspiracy theories (Ballatore, 2015) • Promotion of hate speech (Bar-Ilan, 2006) • Health information (White and Horvitz, 2009) Problems with these studies in general: • Case studies, anecdotal evidence • Queries chosen by the researchers (not rule-based); no consideration of query polularity • No information on results providers; results ranking not considered 5
  • 7. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski WHAT MAKES USERS CLICK? Users’ visual attention and selection behaviour are influenced by • Position • Visible area “above the fold” • The relevance of the results description (“snippet”) • Size and design of the snippet Users trust search engines: • Results are seen as accurate and trustworthy (Purcell, Brenner & Raine 2012) • Search engine ranking even is a criterion for trustworthiness (Westerwick 2013) 6
  • 8. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski USERS’ SELECTION BEHAVIOUR 7
  • 9. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski INFLUENCING SEARCH ENGINE RESULTS Search engine providers • Search engine providers act as content providers to their own search engines, e.g., Google/YouTube. • Vertical search engines are integrated into the main search engine, e.g., Google Shopping. External influences • Influence on the search results through search engine optimization (SEO), now a multi-billion Euro industry 8
  • 10. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski FROM WHICH SOURCES DO THE SEARCH RESULTS COME FROM? Studies investigating domains (“sources”) • About 80 per cent of all clicked results are accounted for by only 10,000 websites (Goel et al., 2010; 2.6 billion queries from Yahoo logs) • Low overlap between different search engines in the top 10 results (Spink et al., 2006; 22.000 queries from Infospace/Dogpile) • Most popular sources in the top 10 differ between search engines; search engine providers prefer their own offerings (Yahoo Answers; YouTube) (Höchstötter & Lewandowski, 2009; 1.000 queries from Ask.com logs) Provider level • To our knowledge, no studies to date 9
  • 11. HOW CAN WE CRAWL DATA FROM SEARCH ENGINES? 10
  • 12. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski COLLECTING SEARCH RESULTS AT SCALE Problem • Commercial search engines do not allow access to their results through APIs. • Data collection in scholarly studies is usually done manually. • Scholarly studies are usually low-scale, i.e., using only some sample queries. Our approach • Querying search engines automatically, using large numbers of queries • Using screen scraping to collect search results from the HTML pages of search engines 11
  • 13. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski SOFTWARE DEVELOPMENT Three software systems for screen-scraping and analysing search engine data 1. Relevance Assessment Tool (RAT) (2011 – ) - Purpose: Information Retrieval evaluation (juror-based assessments) - User interface / crowdsourcing for collecting (relevance) judgments 2. AAPVL (2013 – 2018) - Purpose: Identifying non-compliant food products in search engine results - Classifiers for shop identification, food shop identification, imprint identification 3. N.N. (2018 – ) - Purpose: Collecting and analysing search results as users see them - Using relevant query sets - Using results positions, domains and provider information (from imprints) 12
  • 14. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski PROCESS DIAGRAMS 13 RAT AAPVL
  • 15. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski SOFTWARE ARCHITECTURE 14 RAT AAPVL
  • 16. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski OUTPUT Structured data including • Query • Search engine • Result position • URL • Domain • Provider (from imprint) • Shop (yes/no) • Manual assessment (if collected) KNIME 15
  • 17. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski APPLICABILITY Some examples: • Search engine results evaluation using 1.000 queries, comparing Google and Bing (Lewandowski, 2015) • Finding non-compliant food products (several studies, some 10.000 results analysed manually by expert jurors as well as automatically) • Finding out what Google users get to see for certain topics (case study discussed later) 16
  • 18. HOW CAN WE BUILD RELEVANT QUERY SETS? 17
  • 19. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski SELECTING SEARCH QUERIES Why work with query sets? • Represent actual user behaviour • Address problem of anecdotal evidence and small-scale studies Approaches to selecting search queries to build a query set 1. Use your own inspiration (leads to arbitrary query sets) 2. Transaction log data from search engines (problem of access) 3. Google Trends (no absolute numbers but useful for comparing query popularity) 4. Google’s (or Bing’s) advertisement campaign planning tools (not tested yet) 18
  • 20. USE CASE: QUERIES ON INSURANCE COMPARISONS 19
  • 21. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski WHY INSURANCE COMPARISONS? • Searching for comparisons of all kinds is popular in web search • Many portals allow users to compare every possible type of insurance concerning the providers and the associated conditions and costs. • Highly competitive market; heavy use of search engine optimization. • Search results may not only contain "neutral" comparison sites like Stiftung Warentest, a not- for-profit foundation. • Free comparison sites may take commissions from vendors, may not take into account all vendors in their comparisons, or base their comparisons on outdated prices (see Stiftung Warentest, 2017). 20
  • 22. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski QUERY SET 21 • Based on a query log from a commercial German search portal (see Lewandowski, 2015) • All queries containing *versicherung* (insurance) and *vergleich* (comparison) • Final query set consisting of 121 queries, e.g., - autoversicherungen vergleich - berufsunfähigkeitsversicherung vergleich - haftpflichtversicherung im vergleich
  • 23. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski DATA PROCESSING • Query Google; screen-scraping results pages (getting as many results as possible) • Crawl imprint pages for every domain found, extract and structure data • Data cleansing • Match domains to providers • Descriptive statistics  Combination of scripting (PHP) and KNIME workflows. 22
  • 24. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski DATA SET • Number of queries: 121 • Number of search results: 22,138 • Results per query: 183 [1, 298] • Number of different domains: 3,278 23
  • 25. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski DATA TABLE 24
  • 26. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski RESULTS SUMMARY 25 ● 116 different domains in top 10 results [10; 1210] ● 93 different providers in top 10 results, i.e., some providers have more than one domain ● The 5 most popular providers account for 47% of top 10 results (and 67% of top 5 results)  Two thirds of top 5 results are from only five different providers
  • 27. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski THE FIVE MOST POPULAR PROVIDERS IN THE TOP 10 SEARCH RESULTS WITH THEIR PRIMARY DOMAINS 26 Provider Number of domains in the top 10 Number of results in the top 10 finanzen.de Vermittlungsgesellschaft für Verbraucherverträge AG 8 149 Müller & Kollegen UG (haftungsbeschränkt) 5 22 G. Zmuda 4 7 Axel Springer SE 2 53 Verivox GmbH 2 127
  • 28. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski THE FIVE MOST POPULAR PROVIDERS IN THE TOP 10 SEARCH RESULTS 27 Provider Absolute share of results in the top 5 (n = 605) Relative share of results in the Top5 Absolute share of results in the top 10 (n = 1210) Relative share of results in the Top10 finanzen.de Vermittlungsgesellschaft für Verbraucherverträge AG 41 6.8% 133 11% Verivox GmbH 118 19.5% 127 10.1% CHECK24 GmbH 118 19.5% 120 9.9% Scout24 Holding GmbH 52 8.6% 97 8% TARIFCHECK24 GmbH 80 13.2% 90 7.4%
  • 29. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski DISTRIBUTION OF PROVIDERS AMONG THE TOP10 SEARCH RESULTS When considering only the first position, Google only shows results from 10 different providers. 28
  • 30. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski RELATIVE DISTRIBUTION OF THE DOMAINS ON THE POSITIONS 1 – 10 For 88% of queries, one of the three most popular providers is shown on the top position. 29
  • 32. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski SUMMARY • Approach that can be used to quantify the distribution of search engine results based on actual user queries. • Software is readily available for conducting similar studies. • Case study shows the feasibility of the approach and gives a first impression of how results in different topical areas could be analysed. • In the insurance domain, few domains/providers dominate the top positions. Other providers are listed, but they tend to be in the lower positions in the top 10 search results. 31
  • 33. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski LIMITATIONS • Imprint information may not be available in all countries (obligation to provide an imprint in Germany). • Our approach does not consider personalized search results. • Case study is limited in terms of number of queries and number of results positions analysed. • We did not consider query frequencies in the case study. 32
  • 34. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski FUTURE WORK • Use the methods described for investigating controversial topics, e.g., nuclear power, abortion, health information. • Develop rules for building query sets including query frequencies (Google AdWords tool?) • Find ways to deal with personalization • Streamline workflow • Refine classifiers 33
  • 35. THANK YOU Prof. Dr. Dirk Lewandowski Hamburg University of Applied Sciences dirk.lewandowski@haw-hamburg.de www.searchstudies.org/dirk Twitter: Dirk_Lew
  • 36. FAKULTÄT DMI, DEPARTMENT INFORMATION Dirk Lewandowski REFERENCES Ballatore, A. (2015), „Google chemtrails: A methodology to analyze topic representation in search engine results“, First Monday, Vol. 20 No. 7, verfügbar unter: http://www.firstmonday.org/ojs/index.php/fm/article/view/5597/4652. Bar-Ilan, J. (2006), ‘Web links and search engine ranking: The case of Google and the query “Jew”’, Journal of the American Society for Information & Techology, Vol. 57 No. 12, pp. 1581–1589. Goel, S., Broder, A., Gabrilovich, E. und Pang, B. (2010), „Anatomy of the long tail“, Proceedings of the third ACM international conference on Web search and data mining - WSDM ’10, ACM Press, New York, New York, USA, S. 201. Höchstötter, N., & Lewandowski, D. (2009). What users see – Structures in search engine results pages. Information Sciences, 179(12), 1796–1812. https://doi.org/10.1016/j.ins.2009.01.028 Noble, S.U. (2018), Algorithms of Oppression: How Search Engines Reinforce Racism, New York University Press, New York, USA. Otterbacher, J., Bates, J. and Clough, P. (2017), ‘Competent Men and Warm Women’, Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI ’17, ACM Press, New York, New York, USA, pp. 6620–6631. Purcell, K., Brenner, J. and Rainie, L. (2012), ‘Search Engine Use 2012’, PEW Research Center, Washington, DC. Spink, A., Jansen, B.J., Blakely, C. und Koshman, S. (2006), „A study of results overlap and uniqueness among major Web search engines“, Information Processing & Management, Vol. 42 No. 5, S. 1379–1391. Stiftung Warentest. (2017). About us: An introduction to Stiftung Warentest. Abgerufen von https://www.test.de/unternehmen/about-us- 5017053-0/ Westerwick, A. (2013), ‘Effects of Sponsorship, Web Site Design, and Google Ranking on the Credibility of Online Information’, Journal of Computer-Mediated Communication, Vol. 18 No. 2, pp. 80–97. White, R.W. and Horvitz, E. (2009), ‘Cyberchondria’, ACM Transactions on Information Systems, Vol. 27 No. 4, p. Article No. 23. 35