Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Brave New World of Free, Open Data and Open Access


Published on

Pre-conference seminar given at BIALL 2014, 11th June 2014, Harrogate

  • Be the first to comment

The Brave New World of Free, Open Data and Open Access

  1. 1. 15/06/14 1 Karen Blakeman, Slides available at This work is licensed under a Creative Commons Attribution License The Brave New World of Free, Open Data and Open Access Pre-conference workshop, BIALL 2014, Harrogate Wednesday, 11th June 2014 #BIALL2014
  2. 2. All change! Search engines – new algorithms, ranking and display, quality of results – EU ruling on “right to be forgotten”, how much is being censored/removed? Free government and legal resources – how reliable are they? – data and information being moved to (is it?) or web archives (really?) Open Access vs open access vs “open access” – how accessible is it? – authority, version – predatory publishers Open data – how “open” is it really – findability, ease of use, quality Changes to copyright – data/text mining 15/06/14 2
  3. 3. Impact of Social Sciences – The right to read is the right to mine: Text and data mining copyright exceptions introduced in the UK. 15/06/14 3
  4. 4. Search engines Google in particular is undergoing major change Don’t forget alternative search engines (they may still be subject to recent EU ruling on search results) Need to understand how search tools work, how the different country versions work and how they present results Google and Bing (to a lesser degree) personalise results How to assess quality of results 15/06/14 4
  5. 5. Five things you need to know about Google search 1. Google personalises your search Personalises search based on – location – device that you are using – past search history – past browsing activity – activity in other areas of Google e.g. YouTube, blogs, images – content from contacts in your personal networks may be given priority (possibly) 15/06/14 5
  6. 6. Private browsing - quickest way “un-personalise”search Chrome - New Incognito window Ctrl+Shift+N FireFox Ctrl+Shift+P Internet Explorer Ctrl+Shift+P Opera Ctrl+Shift+N Will not remove country personalisation
  7. 7. Five things you need to know about Google search 2. Google automatically looks for variations on your search terms and sometimes drops terms from your search – Google may or may not tell you that it has ignored some of your terms – “..” around terms, phrases, names, titles of documents does not always work – To force an exact match and inclusion of a term in a search prefix it with ‘intext:’ public transport intext:algal biofuels – Use Verbatim for an exact match search
  8. 8. Google Verbatim
  9. 9. Google now showing missing search terms? Not always shown – possibly still a live experiment? 15/06/14 9
  10. 10. Five things you need to know about Google search 3. Google web search does not search everything it has in its database – two indexes: main, default index and the supplemental index – supplemental index may contain less popular, unusual, specialist material – supplemental index comes into play when Google thinks your search has returned too few results – Verbatim and some advanced search commands seems to trigger a search in the supplemental index
  11. 11. Five things you need to know about Google search 4. Google changes its algorithms several hundred times a year How Google makes improvements to its search algorithm - YouTube
  12. 12. Five things you need to know about Google search 5. We are all Google’s lab rats Just Testing: Google Users May See Up To A Dozen Experiments Mostly minor effects on search but sometimes totally bizarre results
  13. 13. What I see on my screen will not be what you see on your screen, will not be what your colleagues see on theirs, will not be what your users see. 15/06/14 13
  14. 14. Hummingbird Not just an update but a completely new algorithm Tries to make “sense” of your query and put it into context, natural language queries Not just search history but also your location, device being used Now difficult to predict how Google will handle your search and how results will be displayed Layout of results and menu options depend on type of search 15/06/14 14
  15. 15. So called “right to be forgotten” ruling 15/06/14 15 Edition of Monday, January 19, 1998, page 23 - Newspaper - http://hemeroteca.lavanguardia.c
  16. 16. Information is NOT removed from the web Subject can apply to have links in search results that point to specific information removed from the results Only applies to searches conducted in the EU + Norway, Switzerland, Iceland and Lichtenstein Not automatic – subject has to apply and request will be assessed to see if the information is “inadequate, irrelevant or no longer relevant, or excessive in relation to the purposes for which they were processed.” Request form available at 15/06/14 16
  17. 17. Could have a serious impact on business and legal research How far will it go? People already trying to get links to their company directorships, bankruptcy, IVAs etc removed Land Registry records Crime reports in the newspapers Unfavourable reviews of services 15/06/14 17
  18. 18. Four things we’ve learned from the EU Google judgment | ICO Blog BBC News - More Google 'forget' requests emerge after EU ruling EU Law Analysis: The CJEU's Google Spain judgment: failing to balance privacy and freedom of expression Google Offers Webform To Comply With Europe’s ‘Right To Be Forgotten’ Ruling | TechCrunch The Myths & Realities Of How Of The EU’s New “Right To Be Forgotten” In Google Works 15/06/14 18
  19. 19. Last Week Tonight with John Oliver (HBO) Right To Be Forgotten - YouTube 15/06/14 19
  20. 20. How to get around it? Until we see how it works in practice, difficult to give advice Google says it will indicate on the results page if information has been excluded Will we be able to use or, for example, or will Google recognise we are in the EU from our IP address and still block results? Use alternative search engine with no business footprint in Europe? New search engines starting up without a European presence? Anonymous proxy servers? 15/06/14 20
  21. 21. Search commands (Google, Bing) Think file format – PDF for research documents, government reports, industry papers, company reports – ppt or pptx for presentations, tracking down an expert on a topic – xls or xlsx for spreadsheets containing data Use the advanced search screen or the filetype: command Land registry tax evasion data mining filetype:pdf tax evasion data mining filetype:ppt tax evasion data mining filetype:pptx tax evasion UK filetype:xls tax evasion UK filetype:xlsx 15/06/14 21
  22. 22. Search commands (Google, Bing) Filetype may not always work – The filetype: command may not always work even when a site offers data, for example, in spreadsheet format – Sometimes this is because the data is held in a database and the files are created from a subset of the data when you request the file. – Instead of filetype:xlsx or filetype:xls simply include the word Excel, xls or csv in your search. 15/06/14 22
  23. 23. Search commands – site: (most search engines) Site search For searching large websites, or groups of sites by type for example government, NHS, academic Can exclude sites using -site: agricultural occupational asthma UK agricultural occupational asthma UK agricultural occupational asthma UK agricultural occupational asthma UK – 15/06/14 23
  24. 24. Searching in multiple languages A significant amount of information is in the local language Google has removed the extremely useful “Translated foreign pages” search option  This is how it can be done now 1.Use Google Translate ( to translate your search into the required language. 2.Copy the translated search and paste it into Google search. 3.Google Chrome will offer to translate page If using another browser click on the ‘Translate this page’ link next to a result to view a translation of just that page. 15/06/14 24
  25. 25. 15/06/14 25
  26. 26. Evaluating resources Date of publication, 'last updated' Check text for clues of publication date Stated date for a web page or document may be automatically generated when it is put onto the web site After a web site redesign pages are re-uploaded and are assigned a new time-stamp Some pages are generated "on the fly" so will always have today's date Type of web site for example: –,, .gov, .edu Who is really behind the site? – use a domain name register such as 15/06/14 26
  27. 27. Who owns a website? good starting point for identifying who owns a domain name but.... – may be hiding behind an agent – may be using a service such as Privacy Protect ( – if in the UK and an individual with a personal page then contact details other than name are not publicly available (Data Protection) Run a Google or Bing search on the ‘registrant’ or agent and see if you can find out anything more about them 15/06/14 27
  28. 28. Domaintools check – good news 15/06/14 28
  29. 29. Domaintools check – what you do not want to see 15/06/14 29
  30. 30. Quality of statistics Read the definitions – Definitions and scope may have changed over the years and may be different for each country Political manipulation e.g. unemployment statistics Some data may not exist for some countries e.g. minimum wage Sudden jumps or flatlining in graphs should alert you to oddities in the data (see later) Official data not immune from errors May need an industry expert to help interpret and analyse sector specific data e.g. energy reserves vs resources, “liquids”/oil/petroleum 15/06/14 30
  31. 31. Some web sites...... 15/06/14 31
  32. 32. 15/06/14 32
  33. 33. UK Government Web Archive | The National Archives Browse by category or choose your organisation from an A-Z list Choose the date of the archived version of the website you want to view 15/06/14 33
  34. 34. UK Government Web Archive | The National Archives 15/06/14 34
  35. 35. Wayback Machine 15/06/14 35
  36. 36. Launched in 2010 Not all legislation has been updated Pending updates are flagged 15/06/14 36
  37. 37. UK Parliament 15/06/14 37
  38. 38. Monitoring progress of legislation 15/06/14 38
  39. 39. 15/06/14 39
  40. 40. The Gazette | Official Public Record 15/06/14 40
  41. 41. FLARE - Foreign Law Research Group Website 15/06/14 41
  42. 42. FLARE - Union List of Official Gazettes: Europe 15/06/14 42
  43. 43. They work for you 15/06/14 43
  44. 44. WhatDoTheyKnow - Freedom of Information (FOI) requests 15/06/14 44
  45. 45. Official company information Companies House Lists of official company registers – Official Company Registers – Company registration around the world – Companies House Links US – listed companies – SEC Edgar IDEA – Interactive Date Electronic Applications Canada - listed companies – SEDAR 15/06/14 45
  46. 46. Company Check UK & Ireland companies and director search Some info free, some free with login, some free with “pro” account, some pay as you go Provides 5 years of figures and graphs for Cash at Bank, Net Worth, Total Current Liabilities and Total Current Assets Download up to 5 yrs of accounts Companies house documents (£2 or 99p depending on subscription) Credit risk, charges, CCJs, Monitor company for financial changes and when new accounts are filed Lists directors of a company - see what other directorships they have 15/06/14 46
  47. 47. Company Check 15/06/14 47
  48. 48. Company Check 15/06/14 48
  49. 49. Company Check Dashboard 15/06/14 49
  50. 50. DUEDIL 15/06/14 50
  51. 51. OpenCorporates 15/06/14 51
  52. 52. OpenCorporates 15/06/14 52
  53. 53. OpenCorporates 15/06/14 53
  54. 54. OpenCorporates and Google Refine 15/06/14 54 How to use OpenCorporates to match companies in Google Refine on Vimeo
  55. 55. OpenCharities 15/06/14 55
  56. 56. OpenCharities 15/06/14 56
  57. 57. Mapping Corporate Networks With OpenCorporates 15/06/14 57
  58. 58. ICIJ Offshore Leaks Database 15/06/14 58 ICIJ Releases Offshore Leaks Database Revealing Names Behind Secret Companies, Trusts | International Consortium of Investigative Journalists
  59. 59. Open Access Research Literature 15/06/14 59
  60. 60. Open Access vs open access vs open access Open Access Publicly funded research made available free of charge to the user Open access journals Publications that are free to the reader, may or may not be a charge for publishing in the journal “Open access” predatory publishers Publications of dubious quality that have been set up purely as a means of generating money 15/06/14 60
  61. 61. Mandated Open Access US All research publications resulting from work funded by the US National Institutes of Health are expected to be deposited in PubMed Central ( – some material embargoed for up to 12 or 24 months ( – Europe PubMed Central ( part of PMC network of international repositories UK 1st of April 2013 - researchers at UK Research Institutions are expected to publish as open access any peer reviewed research‐ papers and conference proceedings that acknowledge Research Council UK funding 15/06/14 61
  62. 62. UK Gold versus Green OA Gold OA – researchers publish their articles in journals that offer open access publishing (can be established “conventional” publishers) – articles can be made available free of charge to readers immediately – author or institution/department pays article processing fee – CC-BY Green OA – researchers deposit copies of articles in an institutional or subject- based repository, subject to copyright/license permissions – repository makes copies available to the public either immediately or embargoed (more common) – period of embargo varies (for example ) – CC-BY-NC 15/06/14 62
  63. 63. Jeffrey Beall List of Predatory Publishers 2014 | Scholarly Open Access 15/06/14 63
  64. 64. Fragmentation of open access Where are the open access publications? – Individual OA articles within existing subscription journals – Separate OA journals, publishers website – Author’s website – Institutional repositories – Aggregators e.g. Scopus, Web of Science, Google Scholar – Mendeley, ResearchGate? 15/06/14 64
  65. 65. ResearchGate Increasingly used to request copies of articles from authors 15/06/14 65
  66. 66. Google Scholar 15/06/14 66
  67. 67. Google Scholar Does not cover all key journals in all subjects – no source list Top publications for subjects and languages under Metrics link on home page or Scholar indexes the full text but you may have to pay to view the whole article Groups different versions of an article together 15/06/14 67
  68. 68. Google Scholar 15/06/14 68
  69. 69. Google Scholar Includes open access material, pre-prints, institutional repositories (but not necessarily author self archived papers on personal websites) Includes material that is NOT peer reviewed but is structured and looks like an academic article (title in large font, authors, affiliations, abstract, keywords, citations) Pre-prints and IR copies may differ from final published version – charts and images may be redacted because of copyright restrictions 15/06/14 69
  70. 70. Google Scholar 15/06/14 70 Does NOT use the publishers’ metadata Sometimes gets the author wrong Beware the advanced search screen and commands – Date and author search looks in the area of the document where those elements are usually found – Page numbers, part of an address, data item may be mistaken for publication year
  71. 71. Institutional repositories and open access BASE - Bielefeld Academic Search Engine CORE (COnnecting Repositories) DART-Europe E-theses Portal DOAJ: Directory of Open Access Journals Institutional Repository Search (IRS) Open DOAR RIAN - Pathways to Irish Research ROAR - Registry of Open Access Repositories OpenAIRE 15/06/14 71
  72. 72. Specialist search tools for research information A selection can be found at ArXiv BioMed Central Chemistry Central ChemSpider Deep Web Technologies Mednar Science Research WorldWideScience 15/06/14 72
  73. 73. Specialist search tools for research information Europe PubMed Central Mendeley Open Biology PhilPapers: Online Research in Philosophy PubMed Central TechXtra 15/06/14 73
  74. 74. Elsevier’s take down notices Elsevier clamps down on academics posting their own papers online (Wired UK) "Why do we send take down notices? One key reason is to ensure that the final published version of an article is readily discoverable and citable via the journal itself in order to maximise the usage metrics and credit for our authors, and to protect the quality and integrity of the scientific record. The formal publications on our platforms also give researchers better tools and links, for example to data" 15/06/14 74
  75. 75. Grey literature Literature that has been “peer reviewed” or assessed/approved in some way by colleagues or subject experts but is not easy to find or access Print run may have been small, possibly never published electronically Published on the web but page or site is no longer available Research and technical papers, government reports, pre-prints, market surveys, press releases, committee working papers, conference papers and presentations Use advanced Google commands and web archives to find documents May or may not be open access GreyNet International, Grey Literature Network Service – – 75
  76. 76. OpenGrey 15/06/14 76
  77. 77. Article may be OA but public access outside of the institution may be not allowed or difficult 15/06/14 77 WATER report published | SCONUL Walk-in Access To E- Resources
  78. 78. BBC News - Public libraries get online access to research journals 15/06/14 78 For personal research, non-commercial use.
  79. 79. Public Library Initiative by PLS and ProQuest | Access to Research List of participating libraries and publishers Public Library Initiative by PLS and ProQuest | Access To Research Search tool for the journals and articles covered by the agreement. List of journals covered by the agreement Not just Open Access articles but subscription services as well Gold Open Access articles can be viewed anywhere. Other articles can only be viewed on library premises 15/06/14 79
  80. 80. Free patent information Patent Searching 101: A Patent Search Tutorial Patents & Patent Law Compares Google patent search with other services “Holes in the database” – may not be able to find patents you know exist Does go back further than some e.g. to US patent no.1 but difficult to focus search See also Patents Searching with Esp@cenet, Google Patents and USPTO - University of Bradford 15/06/14 80
  81. 81. Google Patents Coverage: – US – Canada – European Patent Office (EPO) – Germany – China – World Intellectual Property Organisation (WIPO) Patents available in original language and English (Google Translate) 15/06/14 81
  82. 82. Google Patents 15/06/14 82
  83. 83. Google Patents advanced search 15/06/14 83
  84. 84. Statistics More open data but.... Raw data – no pretty layouts or visualisations, you have to do that Data may need weeding and cleaning before it is usable May have to do a LOT of work on the data to get anything sensible out of it To see what you could be letting yourself in for look at Tony Hirst’s blog postings on open data at, for example Reshaping Horse Import/Export Data to Fit a Sankey Diagram importexport-data-to-fit-a-sankey-diagram/ 15/06/14 84
  85. 85. Official statistics OFFSTATS UK National Statistics Publication Hub Office for National Statistics Welsh Government | Statistics Welsh Assembly Government StatsWales Eurostat European Union Open Data Portal 15/06/14 85
  86. 86. 15/06/14 86
  87. 87. UK National Statistics & ONS, 15/06/14 87
  88. 88. Publication Hub ( is an “index” to what is available and links through to other sites ONS ( only shows reports since 2008 even if there are earlier editions. Use the Publication Hub to search for the report title/series. Once you have found an edition of the title click on “Current and past editions” to see the list of editions available. Then click on the relevant report. 15/06/14 88
  89. 89. Publication Hub search ( 15/06/14 89
  90. 90. 15/06/14 90
  91. 91. 15/06/14 91
  92. 92. Not all of the data on this site is open data – may be restrictions on use Download links sometimes take you to the wrong dataset Download links sometimes completely broken It’s all or nothing! May have to filter the datasets for the information you want and produce your own graphs and charts Variety of formats 15/06/14 92
  93. 93. Eurostat 15/06/14 93
  94. 94. European Union - Open Data Portal http://open- 15/06/14 94
  95. 95. European Union - Open Data Portal http://open- 15/06/14 95
  96. 96. Google Public Data Explorer One of Google's best kept secrets! Public data sets made available by Eurostat, World Bank, IMF, CSO Ireland, OECD, ITU, some national statistics offices (but not ONS), and many more. Source and date updated given. Charts and charting options can highlight oddities and missing data Look at the charts to see if there is a sudden change in the trends. 15/06/14 96
  97. 97. Google Public Data Explorer Minimum Wage 15/06/14 97
  98. 98. Google Public Data Explorer Minimum Wage 15/06/14 98
  99. 99. Eurostat - Minimum Wage 15/06/14 99
  100. 100. 15/06/14 100
  101. 101. Google Public Data Explorer – DPT immunizations 15/06/14 101
  102. 102. World Bank DPT 15/06/14 102
  103. 103. Datamarket Open portal to datasets worldwide and market research Creates visualisations of the data 15/06/14 103
  104. 104. Statista “The Statistics Portal for Market Data, Market Research and Market Studies” – 60,000 topics from over 18,000 sources – – some information free, registration (free) required – Chart of the day 15/06/14 104
  105. 105. Statista 15/06/14 105
  106. 106. Zanran Searches graphs, charts, tables, PDFs, spreadsheets Can limit by location of server, date, and filetype Title in results list is usually the title or caption to the table and not title of the document Hover over the thumbnail to see a preview of the table or page Click on the URL button next to the result to view the original URL of the document – clicking on it may take you to “page not found” 404 Click on the title of the result to see Zanran’s own copy (free registration usually required Useful if document no longer available at it’s original location and can’t be found in any of the web archives 15/06/14 106
  107. 107. Zanran Zanran – great for data in tables, charts and graphs tables-charts-and-graphs/ 15/06/14 107
  108. 108. Guardian Data Store 15/06/14 108 Data and analysis on topics that are in the news Some data sets created from information obtained via FoI Links to the original datasets are provided
  109. 109. Migrants crossing the Mediterranean: key numbers mediterranean-key-numbers-libya-european#start-of-comments 15/06/14 109
  110. 110. Public Data Group 15/06/14 110
  111. 111. Land Registry 15/06/14 111
  112. 112. Land Registry price paid data 15/06/14 112
  113. 113. 15/06/14 113 6 months ago the sold price listed was £185,000 and for 2012.
  114. 114. Land Registry summary of the postcode 15/06/14 114
  115. 115. Title document 15/06/14 115
  116. 116. Missing data 15/06/14 116 Error report filed with the Land Registry three weeks ago Still waiting for a response Why might a property/price paid not appear in the data? Seems not that uncommon according to discussion boards – usually data entry error (but the above example was in the public data until the last update) Absence of price – gift of property or purchase of a share Impractical to calculate price e.g. bulk purchase of properties Commercial transactions
  117. 117. Price paid data report builder 15/06/14 117
  118. 118. House price indices How the Land Registry undercooks house prices by £90k [Daily Mail alert!] 2270763/How-Land-Registry-undercooks-house-prices-90k.html Which house price index can you trust? 980/Which-house-price-index-can-you-trust.html House Prices - Who Should You Believe? believe.htm Why do average house prices differ? - prices-differ/ 15/06/14 www.rb 118
  119. 119. Crime statistics 15/06/14 119
  120. 120. Crime statistics 15/06/14 120
  121. 121. Crime statistics 15/06/14 121
  122. 122. Electricity micro-generation 15/06/14 122 Variable Pitch
  123. 123. Electricity micro-generation 15/06/14 123 Variable Pitch
  124. 124. Electricity micro-generation 15/06/14 124 Variable Pitch Virginia Station is the hydroelectric installation that feeds Windsor Castle
  125. 125. FoI request generation data for Virginia Station 15/06/14 125
  126. 126. Chart and image gallery: 30+ free tools for data visualization and analysis - Computerworld llery_30_free_tools_for_data_visualization_and_analysis 15/06/14 126
  127. 127. Google Fusion Tables 15/06/14 127
  128. 128. And finally.... Per capita consumption of mozzarella cheese (US) correlates with Civil engineering doctorates awarded (US) 15/06/14 128