UB Utrecht HvA-MIC GO Opleidingen searching the internetwhat patent searchers should know Eric Sieverts WON, 11-12-2012
agenda• searching the web• the volatile google landscape• smart searching• dating and back to the past• reliability• google options• beyond google• beyond general web search• the social landscape
the general agenda importance web of specific ?=? material everything types? general specific web material search search how to …how to … when & why
an ever changing google landscape • unreliable numbers • irreproducible results • disappearing functions • changing interfaces
"coping" with numbers of resultsin structured databases the effect on the number of results ofhow you combine terms, generally meets expectations, but:• with Google (and other web search) numbers are not stable, irreproducible, unreliable, with inexplicable effects – refine with an AND-relation may increase number of results – expand with an OR-relation may decrease number of results – numbers are only extrapolations from small part of search index – depends on distribution of the index over servers – depends on Google version, browser, whether logged in, history, ... – not just Google: Bing results also depend on geographic setting• Danny Sullivan explains why Google can not calculate: http://searchengineland.com/why-google-cant-count-results-properly-53559 Why Google Can’t Count Results Properly
Google as a vanishing machinesome services and options disappear completely – timeline, wonder wheel, toolbar, ... – + operator – real time results, code search – google buzz, google wave, google directory, ...others are only hidden – links for advanced search and for settings hidden under “cog wheel” (sometimes dependent on browser) – Scholar, Patents and Groups no longer mentioned in menus – backlink search no longer in advanced search – search for "similar" pages & "cache"-link are hidden in "invisible" pop-up page preview – …
e but m eanwhil dy this is alrea nt erface ! an "old" irefinements and additional functionslike in modern "web scale discovery" systems
tools & facets from clear left columnto blurry top menu (for mobiles sake?) google.nl [until 2 weeks ago] google.com
all options by material type, in old interface
Google tries outsmarting usGoogle tries to improve and to broaden your queries• automatic spelling corrections (veilgheid >> veiligheid)• search for words with same word stem (singular/plural, verb, conjugation, inflection, …)• expands acronyms (jfk >> john f kennedy | wwii >> world war II)• adds synonyms (vaccination >> immunization)• transforms separate words to compound term & vice versa (veiligheid maatregel >> veiligheidsmaatregel | catfood >> cat food)• may leave out term as optional if not differentiating enoughnever often what/when or notEnglish than in Dutchmore sure and elaborate in• personalises search, based on previous search behaviourand if you dont like all of this ........ >> "verbatim"
option recently to moved u top mennew option introduced early 2012 verbatimon google.nl: "woord voor woord"
standard semantic coding allowed Google to make a recipe search engine "embedded metadata" standardisation of property descriptions in HTML of recipe pages, with"microformats"/"rich snippets markup"
Googles "Knowledge Graph"knows 500 million objectswith 3,5 billion properties(but only in English)
publication dates• limitation while searching google – before search: only "past day/week/month/year" – after search: also limitation on custom range "from .. to .." search tools:
disappeared / old versions of pages• recently disappeared: try search engine cache not just google! : Bing Yahoo Exalead
disappeared / old versions of pagesfor older versions: try web archive (waybackmachine)http://archive.org• links within same site are mostly working• if particular page has not been crawled, they show which other pages on that site have been crawled• some pages/sites have only recently been crawled• other pages/sites go far back in time• if domain name has changed, you must use the old name• some sites dont want to be crawled
reliability & integrity - generalgeneral website assessment criteria• professional lay-out• indication of author/organisation (“about us”)• data about organisation: address, telephone, map/driving directions• indication of targeted audience• not too many advertisements and pop-ups (although every site has them)• clear navigation• internal search option• speed of web server• backlinks from well known organisations **• up to date-ness (with date given)• language use• interpret the URL/domain-name (eg: edu, edu.au, edu.sg, edu.ng, edu.lb, ac.uk, gov, gov.uk, gov.hk, gov.au, gov.on.ca, gob.es, gob.mx, gob.ve, gob.ec, ...)
reliability & integrity - organisationInformation about organisation• Google pagerank (backlinks) use for instance: http://www.prchecker.info/ http://www.checkpagerank.net/• Alexa rank (web traffic) see for instance: http://www.alexa.com/ http://www.seomastering.com/alexa-rank-checker.php• domain owner use for instance: http://centralops.net/co/DomainDossier.aspx http://whois.domaintools.com/• search for "backlinks"
reliability & integrity - backlinkssearch backlinks to particular web-page/-site• Google: link:http://www.domain.zz/folder/file.html very incomplete result• Yahoo site explorer: died last year• DuckDuckGo: link:http://www.domain.zz/folder/file.html often > google; no total numbers given• OpenSiteExplorer: linking pages + linking domains very complete; also domain & page authority paid subscription if more than 3 queries /day• Exalead: link:http://www.domain.zz/ no backlinks to specific page, but to whole site• Alexa: 100 most important domains backlinking to site
the 35 sitesmentioned under "reputation" after 9 no more results
some more "how to"• domain search: site:edu OR site:edu.* [for all edu (sub)domains] site:shell.com OR site:philips.com• url search: inurl:novelty• title search: intitle:catalytic just• filetype search: filetype:pdf filetype:xls OR filetype:xlsx filetype:doc OR filetype:docx more than shown in advanced search drop-down menu filetype:rss• exact search: "greenhouses“ [or VERBATIM for all words]
general search engines besides google• Bing microsoft, large• Yahoo! content=Bing, large• Blekko uses hashtags to search more [domain-] selective also many predefined hashtags; e.g. /likes for Facebook• DuckDuckGo assures privacy, no personalisation, no filter-bubble, rather small, !Bang-function offers many extras• Gigablast green search engine, rather small, some unique functions• Exalead french, many advanced functions, primarily demo system• Millionshort leaves out results from most popular sites → the long tail• WolframAlpha knowledge engine, facts, calculationstogether, these others have 30% market share in US; in NL only 3%• Yandex in Russia more popular than Google• Baidu in China more popular than Google• Naver, Daum in South Korea more popular than Google• Seznam in Czechia more popular than Google
material type specific searchblogs google blogs, icerocket, technorati [rss] CTRLQ, RSS SearchHubvideo google video, youtube, youtube edu channel, bing video, blinkx, voxalead-newsimages google image, yahoo image, bing image, flickr, tineye (ip-check), panoramio (geo-search)science google scholar, microsoft academic, scirus, oaister, scientific commons, science.govnieuws google news, yahoo news, bing news, cnn, bbc, historische kranten KB, historic american newspapers (LOC)tweets twitter search, topsy, tweetzi, postpost, snapbirdsocial socialsearcher, socialmention, samepoint, whostalkin, kurrentlyforums google groups, omgili, boardtracker
tweets & social search• Twitter in 140 characters – often with shortened links – often with photo- or video-link – often with hashtags (#agreeduponkeyword) search (often limited to last 1 - 2 weeks, and .... to those 140 characters) – twitter-search (also advanced search), tweetzi, … – topsy (also older messages) – postpost (your own timeline - i.e. everything youre following) – snapbird (full tweet history of 1 person – by his/her twittername) – twicsy (photos on twitter) – ... overview/review of tools: All the easiest ways to search old tweets 57
tweets & social search• “Real time / social search engines” – socialsearcher, socialmention, samepoint, whostalkin, kurrently, … (tweets + blogs + facebook + …) – Google personal results / Google+ ("search plus your world") – real-time pictures: skylines• Forum discussions – omgili, boardtracker, ... – Google groups (also old newsgroup discussions)for research methods: – advice from Henk van Ess (dutch): "de digitale detective" (2012) – How to: use social media in newsgathering (2012) – 100+ Social Media Monitoring Tools (2010) 66