Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big and Small Web Data


Published on

Workshop session given at the Institutional Web Management Workshop 2012 (IWMW 2012) event held at the University of Edinburgh on 18th - 20th June 2012.

Published in: Education, Technology
  • Be the first to comment

Big and Small Web Data

  1. 1. Big and Small Web DataMarieke Guy, Institutional Support Officer,Digital Curation Centre, UKOLN, University of Bath, UKInstitutional Web Management Workshop 2012 UKOLN is supported by: This work is licensed under a Creative Commons Licence Attribution-ShareAlike 2.0 1
  2. 2. Who Am I? • Have worked for UKOLN for over 12 years • Worked on variety of projects: Subject portals project, IMPACT, Good APIs, JISC Observatory, cultural heritage work, digital preservation work, …etc • Remote worker, into amplified events • Co-chair of IWMW for a number of years • Now working for Digital Curation Curation • Institutional Support Officer helping HEIs with their RDM • New to data….2
  3. 3. The Digital Curation Centre • A consortium comprising units from the Universities of Bath (UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII) • launched 1st March 2004 as a national centre for solving challenges in digital curation that could not be tackled by any single institution or discipline • Funded by JISC with additional HEFCE funding from 2011 for the provision of support to national cloud services • Targeted institutional development •
  4. 4. Assessing Data Use4
  5. 5. Data Management Tools5
  6. 6. Advocacy and Training • Informatics: disciplinary metadata schema, standards, formats, identifiers, ontologies • Storage: file-store, cloud, data centres, funder policy • Access: embargoes, FOI • Policy: making the caseHow to cite data 6
  7. 7. Who Are You? • Are you part of a Web team? • Are you part of a MIS team? • Are you a researcher? • Do you know what data is? • Do you use structured data? • Do you manage data?7
  8. 8. Today‘s Workshop: A Data Journey! • Presentation: What is data anyway? Looking at current data trends and what it has to do with Web managers • Break out groups: What data do you deal with? Anything goes from personnel data to key information sets and Web stats… • Presentation/Show and Tell: Taster of tools that help with data (mining, citation, visualization, analytics, etc.) • Presentation: Case study - Data @ Southampton • Discussion and buzzword bingo8
  9. 9. Today‘s Resources • All urls at: • All slides at: • Also on IWMW12 Web site9
  10. 10. a&hs=Jl2&rls=org.mozilla:en-GB:official&biw=1366&bih What is Data Anyway?mulejunk/352387473/ sregion5/4546851916// _barcode/4793484478/ 10 597432481//
  11. 11. A Data Definition • Datum is / data are (!!!): – Facts and statistics collected together for reference or analysis – Typically the results of measurements – Can be qualitative or quantitative – Unstructured or structured – Raw data, field data, experimental data – Data – information – knowledge – Data is the lowest level of abstraction • Even researchers don‘t know what data is….11
  12. 12. A Data Present “Data underpins our economy and our society - data about how much is being spent and where, data about how schools, hospitals and police are performing, data about where things are and data about the weather.” Tim Berners Lee, director of W3C.12
  13. 13. Some Flavours of Data • Big data • DIY data • Consumer data • Activity data • Crowd Sourced data • Linked data/ Web of data / semantic Web • Open data13
  14. 14. Big Data ―big data people obviously like alliteration – ―volume, velocity, variety, value‖ ―speed, size, scope‖ Andy Powell ―Data that is too big to manage14 using ‗normal‘ (database) tools.‖
  15. 15. Big Data “I worry there won’t be enough people around to do the analysis” Chris Ponting, University of Oxford “Raw image files for a single human genome have been estimated at 28.8 terabytes, which is approaching 30,000 gigabytes”“The cost of sequencing DNA has taken anosedive...and is now dropping by 50% every 5 months”“The 1000 Genomes Project generated more DNA sequencedata in its first 6 months than GenBank had accumulated in itsentire 21 year existence” “A single sequencer can now generate in a day what it took 10 years to collect for the Human Genome Project”15
  16. 16. Big Data • 3 Vs: volume, velocity and variety • Could include scientific & research data, data Web logs, RFID data, social data, search data, video, e-commerce • Likely to require different tools and practices from what ‗we are used to‘ • Technologies include massively parallel processing (MPP) databases, datamining grids, distributed file systems, distributed databases, cloud computing platforms and scalable storage systems • Example tools are Hadoop, NoSQL, CouchDB, • Issues regarding storage, speed of access, exponential growth, infrastructure, complexity16
  17. 17. DIY Data Kyle Machulis “DIY”Humanphysiologydata17
  18. 18. Consumer Data18
  19. 19. Consumer Data19
  20. 20. Consumer Data 1 in every 9 people on Earth is on Facebook 30 billion pieces of Google has been content are shared estimated to run over 1 on Facebook each million servers in data month centers around the Walmart take data world from 1 million customer transactions per hour There are over 6 billion photos on Flickr20
  21. 21. Activity Data • ―Data about users‘ actions and attention‖ • Access, attention and activity • Many systems in institutions store data about the actions of students, teachers and researchers • It‘s good business • • JISC Projects: – Recommender systems – Improving the student experience – Resource management • JISC Info kit – Business intelligence • Student retention21
  22. 22. 22
  23. 23. Crowd Sourced Data“Crowd-sourced” astronomy23
  24. 24. Open Data • ―A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.‖ Open Knowledge Foundation • Why? Use of public money, advancement of science • Why not? Commercial and reputation reasons, cost of preparing data • ―You can do all types of stuff with data‖ TBL • But tricky to open access to data (cost, preparation, capturing meaning, annotations, context, meaning etc.) • Data is more valuable when accessible • Open data on Web: CKAN,, infochimps, openstreetmap, dbpedia, freebase, numbrary, etc.24
  25. 25. Linked Data • Repurposing and aggregating data in machine readable format • Southampton • • Lucero project • Linkeduniversitie • XCRI • Lincoln •
  26. 26. The Key Data Issues • Scale and complexity – data deluge – volume, pace, infrastructure • Sensitivity of data • Openness – why aren‘t people sharing? • Quality of data • Reputation – FOI, DPA, computer misuse • Management – Storage, incentive, costs & sustainability • Preservation – where is your data? • Funding for researchers • Analysis • Doing something useful with it…26
  27. 27. Sensitive Data • DPA 1998 – Sensitive Personal Data ―Data regarding an individual‘s race or ethnic origin, political opinion, religious beliefs, trade union membership, physical or mental health, sex life, criminal proceedings or convictions…‖ – Personal data • Relates to a living individual • The individual can be identified from those data and other information • Includes any expression of opinion about the individual • Data that may incriminate a person • Data a person prefers not to share with wider society27
  28. 28. OpennessChoices are made according to context, withdegrees of openness reached according to:• The kinds of data to be made available• The stage in the research process• The groups to whom data will be made available• On what terms and conditions it will be providedDefault position of most:• YES to protocols, software, analysis tools, methods and techniques• NO to making research data content freely available to everyoneAfter all, where is the incentive? Angus Whyte, RIN/NESTA, 201028
  29. 29. Reputation29
  30. 30. Data Storage Challenges • Scalable • Cost-effective (rent on-demand) • Secure (privacy and IPR) • Robust and resilient • Low entry barrier / ease-of-use • Has data-handling / transfer / analysis capability What about Cloud services?The case for cloud computing in genome informatics.Lincoln D Stein, May 2010 30
  31. 31. The Web Managers ask: ―So what has all this got to do with me..?‖31
  32. 32. Break Out Groups What data do you deal with? • Personnel data • Admissions • Timetables • Curriculum • key information sets • Web stats… What do you do with this data? Could you do more? What?
  33. 33. Are the Web Managers still asking? ―So what has all this got to do with me..?‖33
  34. 34. A Data Future “The ability to take data - to be able to understand it, to process it, to extract value from it, to visualise it, to communicate it –that‘s going to be a hugely important skill in the next decades.” Hal Varian, Google‘s chief economist.34 Hal Varian, Chief Economist, Google
  35. 35. Web Teams and Data • Data is relevant to those working with the Web at HEIs because: • Data will affect your IT infrastructure, if it doesn‘t already • Data is becoming increasingly important for the REF and for funding so it will be increasingly important to your HEI • It is getting easier to ask for data • Structured data could make your life easier • The Web itself is becoming more structured • Data can show impact • It‘s all about the data….35
  36. 36. Web Teams and Data • Unstructured data accounts for more than 90% of digital universe (2011 Digital Universe study) • Structured data on the rise for some time – deep web, annotation schemes, search data • In the past web pages have contained information, now is the time for them to contain data • Some key data areas Web teams need to think about: – Structure – Metrics – Patterns, data mining and analytics – Preservation (maybe one for another day?)36
  37. 37. Web Data: Structure • Move toward a Web that‘s more fluid, less fixed, and more easily accessed on a multitude of devices •‘s Brad Frost, ―get your content ready to go anywhere because it‘s going to go everywhere.‖ • Karen McGrane: calls them ―content blobs‖ – ―we can embrace meaningful, modular chunks that are ready to travel‖ • Google Knowledge Graph: ―currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects‖ • ―a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers‘‖37
  38. 38. Preparing for Structure • There is a need for structured content in Web sites • ‗Future ready content‘ - Sara Wachter-Boettcher – 1. Get Purposeful – why do users want this content? – 2. Get Micro – get granular, break content down ( – microdata) – 3. Get Meaningful – considering the meaning of elements – 4. Get Organised – looking at your CMS – 5. Get Structured – DITA? XML? HTML5 (microdata) • ‗Create once, publish everywhere‘ idea – mobile, apis, etc.38
  39. 39. Web Data: Metrics • Metrics – the new black? Kristen Ratan • ―The more you know the more you realise you don‘t know‖ • What should we be tracking? e.g. Figures opened, downloaded, inks clicked, time spent on article page, supplemental info viewed, authors‘ info viewed • Look at the pathways that info travels • Data can drive tenure and promotion, grants, reputation, discovery, prioritization, attention • Issues: Missed citation data, data sources that aren‘t reliable, digital addresses change, usage doesn‘t mean useful39
  40. 40. Web Data: Patterns “In other words, we no longer need to speculate and hypothesise; we simply need to let machines lead us to the patterns, trends, and relationships in social, economic, political, and environmental relationships.” Mark Graham, Big Data blog, the Guardian.40 Hal Varian, Chief Economist, Google
  41. 41. Web Data: Analytics • Customers expect us to be leveraging their activity to benefit their user experience • ―the process of developing actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data.‖ Adam Cooper, CETIS • Reporting and descriptive methods Vs inferential and predictive methods • Data driven decisions? ―human decisions supported by the use of good tools to provide us with data-derived insights‖ • Don‘t ―let the numbers speak for themselves‖ – data only one input to decision process • Data specialists and domain specialists work together • Need to ask the right questions41
  42. 42. Web Data: Learning Analytics • ―The measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimising learning and the environments in which it occurs.‖ 1st International Conference on Learning Analytics & Knowledge • Open University Learner Analytics Project – Looked at withdrawals - e.g. when students stop study before completion of a module towards a degree – Possible to map what points on paths of study withdrawals occur. • Other uses: personalisation, recommendation, research profiles, marketing and surveys, help desk, CRM, library • Looking at disabled students/accessibility – linking learner analytics and web metrics42
  43. 43. Protection of Freedoms Bill • The Protection of Freedoms Bill is a UK parliamentary bill introduced in February 2011 • Has completed it‘s readings – now passing through house of Lords • 102 - amendments to FOIA - mandatory for public authorities to permit re-use of datasets when communicating them in response to a FOI request • Datasets are collections of information held in electronic form i.e. raw data gathered or created in connection with the universitys functions or services‘ • Government‘s Innovation and Research Strategy for Growth - "a transformation in the accessibility of research and data‖43
  44. 44. Tools that Could Help
  45. 45. Tools: Structure • • Google Rich Snippets testing tool – tests microdata, microformats, RDFa • List of tools on Semanticweb.org45
  46. 46. Tools: Metrics & Text Mining • Google Analytics • Elsevier • total-impact • altmetric.com46
  47. 47. Tools: Analytics • SNAPP: Social Networks Adapting Pedagogical Practice • GLASS (Gradient‘s Learning Analytics System) • International Educational Data Mining society • Learning Analytics and Knowledge Conference47
  48. 48. Data Visualisations • Use your IT and your graphics design department • Make it interactive • Getting Awesome Results from Data Visualisation – Rich Kirk • Data visualisation strategy – Have a purpose – Have measurable KPIs vs purpose – Plan distribution in advance – Resource – Ensure visualisation matches purpose • Chart chooser (Gene Zelaznys Saying It With Charts) • Measurement: pageviews, buzz, links, key word ranking • ―Tell a story with your data‖ – Ewan McIntosh at IDCC1148
  49. 49. Data Visualisation Help • Great Web sites – Ewan McIntosh – Information is Beautiful – Pinterest – Guardian data blog – Flowing data – Infosthetics – information aesthetics – where form follows data • Great tools – Manyeyes – Chartsbin, icharts, Google chart tools – Google developer – Google Fusion tables – Tableau public – Datamarket – Colour Brewer49
  50. 50. Visualisations: Google Maps50
  51. 51. Data Case Study: Southampton• Not big data but small data• Got to be useful!! Chris Gutteridge -
  52. 52. Southampton Data • Places: Buildings, Rooms, Campuses, Counties, Disabled Access • Organisation Structure • Products & Services: Coffee, Sandwiches, Library Services, Recycle Points • Points of Service: Coffee Shops, Swimming Pools, Libraries, Receptions • Teaching: Courses, Modules, Statistics, Student Satisfaction • Travel: Stations, Bus-Stops, Bus-Routes, Bus Times • Resources: EPrints, Videos, Learning Objects • People: Contact Information, Experts for the Media • Events: Open Days, University History • Jargon52
  53. 53. Southampton Open Data53
  54. 54. Southampton Uses… • Google docs, excel spread sheets, RDF, triples • Grinder – github • Graphite – php library • Graphite (publishing RDF). Required skills: – RDF structure – RDF/XML – XSLT • Graphite (consuming RDF). Required skills: – RDF structure – PHP54
  55. 55. Data Case Study: Aberdeen ―I managed the Web and then inherited MIS. These two have now converged so that Web is using much better, structured data and standardising and consolidating sources. The MIS brings discipline to the Web – much needed if you ask me, anarchist though I am...” Mike McConnell, Head of Web Services, University of Aberdeen.55
  56. 56. Student Attendance Data • Loughborough University‘s Pedestal for Progression • Roehampton University‘s fulCRM • Southampton Student Dashboard at the University of Southampton • tutees, directory info, whether coursework has been handed in, and attendance. • University of Derby‘s SETL (Student Engagement Traffic Lighting) • The ESCAPES (Enhancing Student Centred Administration for Placement ExperienceS) project at the University of Nottingham56
  57. 57. Conclusions • At the moment it‘s all about the data… (whether you like it or not!) • Be aware of what is happening with data at your institution – data repository, MIS, RIM, CRIS, repository etc. Where do you sit in the picture? • Structure your Web data – it makes sense • You can start with ‗little data‘… • Think about what strategic questions you want to ask • Be grounded – efficiency and effectiveness • Start from the user end - think about the uses and output • Follow up from the IT end – how can you automate processes? • What can you use your data for? Can you show impact/success? • How about telling a story with it?57
  58. 58. Buzzword Bingo data Linked wrangler cloud Big data computing data para data Data-Driven Decision making data mining data data data journalism scientist tsunamiknowledgediscovery in clustering predictive analyticsdata (KDD)58
  59. 59. What Data Can and Cannot Do • From Guardian Datablog, by Johnathan Gray • Data is not a force unto itself. • Data is not a perfect reflection of the world. • Data does not speak for itself. • Data is not power. • Interpreting data is not easy.59
  60. 60. Thanks!! ―The data that is valuable to you is already passing through your hands" ” Doug Cutting, Chairman, Apache Software Foundation60