Big and Small Web Data


Published on

Workshop session given at the Institutional Web Management Workshop 2012 (IWMW 2012) event held at the University of Edinburgh on 18th - 20th June 2012.

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Top right: gene sequencing machiinesBejing genomic institute, one of largest genomic institutes in world, crunching out genomic data 24/7Field telescope scanning night sky – streaming in vast amounts of dataSensor equipment to monitor ai quality in desert
  • Data is a a plural word but is often used as a singular wordThe Data is stored The data are stored
  • Cern – huge teams to deal with datalarge Hadron Collider produces around 15 petabytes of data annuallyUniversity of Bristol/Cloudant talked about 50 terabyte datasets
  • Data facts – 1. If you stacked a pile of CD-ROMs on top of one another until you’d reached the current global storage capacity for digital information – about 295 exabytes – if would stretch 80,000 km beyond the moon.2. Every hour, enough information is consumed by internet traffic to fill 7 million DVDs.  Side by side, they’d scale Mount Everest 95 times.3. 247 billion e-mail messages are sent each day… up to 80% of them are spam.4. By 2020, IT departments will be looking after 10 x more servers, 50 x more data and 75 x more files.  Meanwhile, the number of IT administrators keeping track of all that data growth with increase by 1.5 times.5. We can expect a 40-60 per cent projected annual growth in the volume of data generated, while media intensive sectors, including financial services, will see year on year data growth rates of over 120 per cent.6. The world’s 500,000+ data centres are large enough to fill 5,955 football fields.7. 75% of digital information is generated by individuals, whilst enterprises have liability for 80% of digital data at some point in its life.8. There are nearly as many bits of information in the digital universe as there are stars in our actual universe.9. Investment in digital enterprises has increased 50% since 2005.10. There are 30 billion pieces of content shared on Facebook every day.11. In 2010, 28% of the digital universe required some level of security… not all of it had the level of security it required….12. People wishing each other Happy New Year drove a 500% surge in smartphone data within just one year, according to 3UK whose customers used a whopping 80 terabytes (TB) on the 31st December 2011, compared to just 14 TBs on the same day in 2010.
  • Kyle Machulis– quantified self movement – body hacking – wrist band monitors – recording bodily functions and using information to improve your lifesyle
  • Recommender systemsAEIOU – Searching in Welsh repositoriesRISE – recommendations services – Open UniSALT – Library data – MIMASOpenURL activity data – EDINAImproving the student experienceExposing VLE activity data (EVAD) – Uni of CambridgeStudent Tracking And Retention (Next Generation): STAR-Trak: NG – Leeds metLibrary Impact Data Project (LIDP)- HuddersfieldResource managementExploiting Access Grid Activity Data (AGtivity) – Uni of manchester
  • Open science/citizen science – armchair astronomers Indexing is a volunteer project which aims to create searchable digital indexes for scanned images of historical documents. The documents are drawn primarily from a collection of 2.4 million microfilms made of historical documents from 110 countries and principalities.Galaxy Zoo is a citizen science project that lets members of the public classify a million galaxies from the Sloan Digital Sky Survey. The project has led to numerous scientific papers and citizen scientist-led discoveries such as Hanny'sVoorwerp.The Katrina PeopleFinder Project used crowdsourcing to collect data for lost persons. Over 4,000 people donated their time after Hurricane Katrina. It included 90,000 entries.Life in a Day is Kevin Macdonald's 95-minute documentary film comprising an arranged series of video clips selected from 80,000 clips (4500 hours) submitted to the YouTube video sharing website, the clips showing respective occurrences from around the world on a single day.[The Open Dinosaur Project is a community research project to aggregate published measurements of ornithischiandinosaurlimb bones for many different taxa in order to study the multiple evolutionary transitions from bipedality to quadrupedality in this group of dinosaurs.reCAPTCHA uses CAPTCHA to help digitize the text of books while protecting websites from bots attempting to access restricted areas. Humans are presented images of the book, and asked to provide the corresponding text. Twenty years of The New York Times have already been digitized.Secret London is composed mostly of Londoners who use the site to share suggestions and photos of London. Originally started as a Facebook Group in 2010 in response to a competition to win an internship at Saatchi & Saatchi, Secret London gained 150,000 members within 2 weeks.
  • One of the most impressive linked data projects in UK higher education is the Southampton Open Data Service. This project is taking data sets, used by institutional administrators, and making them available in linked data formats. A number of applications have been built on the data including an interactive university map, a catering menu search function, university telephone directories, and apps making it easier for students to navigate open days.Successful universities and colleges of the future will have to build an infrastructure that turns them into reliable data hubs, able to analyse even very large and complex datasets internally and to pass on their insights – for free to students, for a fee to business.As a technology, linked data is still work in progress and JISC is working to develop its capabilities for further and higher education. Data are only a raw material and their present and future value depends on how we can use them.Find out whether you’re already using the technology. As it works behind the scenes people don’t always know where products are built around linked data.Find out what demand exists for which data within your institution and among partners, and which are your most valuable data.Cultivate an ethos of innovation – experiment with linked data in small-scale, inexpensive projects and in close contact with internal and end users of the data. Share and reuse these innovations.If you do try out linked data within your institution ask your IT team to demonstrate how the end user can access linked data.Find out more about how to publish linked data.
  • – search engine collaboration – Google, Bing, Yahoo, YandexLaunched june 2011 – 300 classes, 261 properties
  • future is flexible, and we’re bending with it. From responsive web design to thinking, we’re moving quickly toward a web that’s more fluid, less fixed, and more easily accessed on a multitude of devices.As we embrace this shift, we need to relinquish control of our content as well, setting it free from the boundaries of a traditional webpage to flow as needed through varied displays and contexts. In the words of’sBrad Frost, “get your content ready to go anywhere because it’s going to go everywhere.”But don’t unlock the shackles just yet: our content is far from future-ready. When extracted from the carefully designed pages on which it lives today, most web content turns into undifferentiated text, its meaning lost as it spills into any container you give it.We can do better. Rather than accept these “content blobs,” as Karen McGrane calls them, we can embrace meaningful, modular chunks that are ready to travel.This is a content strategy problem, true. But listen up, designers, developers, and UXers: you’re not excused just yet. This job takes editorial, architectural, and technical knowledge.This is a project for all of us.Preparing for structureMost conversations about structured content dive headfirst into the technical bits: XML, DITA, microdata, RDF. But structure isn’t just about metadata and markup; it’s what that metadata and markup mean. Before we start throwing around fancy acronyms, we need to get closer to the content itself, creating a framework for making smart decisions about its structure. Only then can we tackle technology in meaningful, useful ways. So hang on—this part’s important.1. GET PURPOSEFULYou’re already designing sites with both user and organizational goals in mind, right? Great. Now you need to translate those goals to a smaller scale, applying them to each type of content you have—like blog posts, articles, rotating features, or product descriptions. To do this, you’ll need to be able to answer questions like: How does this kind of content support the overall site goals? Why would a user want it? What is the organization accomplishing by publishing it? What does the organization want the user to do with it?Just as it’s critical to establish site goals before launching into design decisions, you have to know what each type of content is intended to accomplish before you can make decisions about how you need to treat it in different contexts. Otherwise, how can you ensure that content keeps doing its job as it flexes and twists to meet the needs of each device it’s displayed on?(Now, if you realize your content isn’t accomplishing anything, or you don’t know what kinds of content you’re dealing with, you’ve got a bigger problem on your hands. Before getting friendly with the future, go cozy up to your client or boss and figure out what matters.)2. GET MICROAll right, you know why the articles or recipes or limericks or whatever kinds of content you’re dealing with exist. Good, because now it’s time to get even more granular, breaking these content types down into their core elements.The specific elements you’ll need to consider will vary greatly depending on the type of content you’re working with, so start by identifying all the content chunks you can find in a given type of information. These could be things like titles, teasers, body content, ingredient lists, reviews, pull quotes, excerpts, images, videos, captions, related articles, bylines, directions, addresses, and many more.Take a recipe for asparagus, fingerling potato, and goat cheese pizza from the popular site Epicurious, for example.Recipes are a pretty common type of content, so you may think you’ve got this one figured out already: title, ingredients, directions. But look again, and you’ll see a whole universe of interconnected elements contributing to this single piece of content: Title Publication Attribution Publication Date Byline Yield Teaser Description Image Ingredients Preparation Wine Pairings Ratings Reviews Main Ingredients Cuisine Type Dietary Considerations Related Recipe CollectionsAn information architect or content strategist sure comes in handy in determining these attributes, but everyone on the team needs to be fully engaged—because you’ll need these chunks to make major decisions about how content will respond to changes in device and display.3. GET MEANINGFULUnderstanding which content chunks exist is just the start. Now you need to understand why each one matters to the whole—and how much it matters. This allows us to make decisions about how content is organized, prioritized, and displayed for different screen sizes, contexts, or purposes.You can begin to do this by considering: How does this element contribute to the content’s purpose? What meaning is lost if this element goes away? What relationships exist between this element and the others?If this were my project, I’d do some hefty research into organizational goals, current content use patterns, and user needs well before getting here. But, for example’s sake, we’ll work with assumptions. Since Epicurious is a publisher, let’s assume it wants to increase page views to bump advertising revenue. Since it’s a recipe site, let’s assume users are there to find something suitable to cook.This scenario could translate to a content-level goal like, “recipes should be compelling, specific, and connected—so users want to make them, can easily tell whether they meet their needs, and ultimately want to visit additional Epicurious content.”As you hold that goal up against these content elements, some interesting questions emerge: Removing all those related items may seem like an easy way to reduce clutter for small screen sizes, but will that decrease the number of total pages a user visits? If we make sidebar content push below main content as the screen size narrows, will users be frustrated at wading through ingredients to get to the recipe’s rating? What would happen to users’ interest in the recipe if we removed the image? Does a title, if displayed elsewhere without its teaser description, tell the user enough to be meaningful?These are difficult questions to answer. Wine pairings may be extremely compelling for the aspiring sommelier, and entirely unappealing for a teetotaler. Ingredients may be a critical first stop for someone with food allergies, but secondary to someone without.We may never be able to anticipate each user’s personal preferences, but the more we understand the relationships between information, the more the compromises inherent in any design decision will be clear—and the better prepared we are to make tough calls.For example, in many responsive designs, sidebars are immediately pushed beneath main content for smartphone-sized displays. But is this always the right answer? Here, ratings, reviews, and main ingredients give readers an at-a-glance means to evaluate the recipe, and pushing this information below the ingredient and preparation sections could make them all but useless.That’s the thing about adapting content to varied layouts: each case is different. One-size-fits-all rules about how content should react are unlikely to serve your many content types—which means they won’t serve your users’ needs or your business goals either. And as more devices and technologies emerge, you’ll need to develop new rules and make new compromises as well.Good thing is, we don’t need a crystal ball to start taking action. We can begin today simply by improving the ways our content is stored.4. GET ORGANIZEDThe future is sexy; content management systems are not. And yet, your CMS may well be what’s standing between your carefully considered content and its ability to travel. Think about the elements we’ve identified and the relationships and priorities that define them. Are the CMSes you’ve worked with ready for this level of content? If so, you’re in the minority. The rest of us have some work to do.One organization that’s taken great strides to future-ready its CMS is National Public Radio. Back in 2009, NPR launched a methodology it calls Create Once, Publish Everywhere. With COPE, each story is entered into a set of discrete fields within the CMS, then made available via an API to multiple platforms—such as the NPR website, device-specific applications for iPad and iPhone, the NPR music site, and local NPR affiliate stations’ sites.NPR’s CMS supports a variety of content elements, but only four are required: a title, short slug, longer description, and date line, says Zach Brand, the head of technology for NPR’s digital media. Additional attributes—like images, audio, or bylines—are all optional. Once in the CMS, the story is distributed via API and ultimately published using various combinations of elements determined by the needs of the platform on which it’s being published.If we want systems that can handle this kind of modular, fast-moving content, it’s time we get cozier with our CMSes—and the people who develop, integrate, and customize them. Armed with knowledge from your in-depth analysis, you now have the tools to embrace a strategic approach to content management, which will help you to: Ensure those focused on CMS features and capabilities understand your content and what it’s intended to accomplish. Explain the types of content you’ll need and what elements they require, much like NPR has defined the attributes of its stories. Understand your CMS’s possibilities and limitations, and collaborate on how to deal with them. Ease your technical team’s burden by providing them with thoughtful, specific direction to inform the CMS’s requirements.This groundwork will serve you well even if you’re just managing a basic website, but as you begin to share content across more devices and channels, it becomes critical. With a CMS that’s organized around modular, meaningful chunks of content, you’ll be ready to create rules for how that content should bend and shift—and have the systems in place to actually implement them.5. GET STRUCTUREDThere’s a reason this article didn’t begin with a primer on XML. Technology can’t help you make good decisions; it can only help you implement them. But content elements must eventually become code, so even if writing markup isn’t your job, we could all stand to get more comfortable with the tools out there to do it.Structured content isn’t new. Technical communicators have been pushing DITA (Darwin Information Typing Architecture) for years—and there’s nothing particularly futuristic about it. Based on XML, a markup language that gives content components an inherent meaning when displayed beyond their database, DITA authors and publishes technical information in content modules—small pieces of information designed for reuse and categorized according to topic. [1] Designed by IBM to manage the company’s own technical content, it’s most widely used for things like help documentation.Many technical communicators insist DITA should be the web’s standard structuring approach, but it’s never quite caught on. It’s also not the only way to do it. HTML5 now supports semantic markup through its microdata extension, which goes beyond traditional presentational tags and allows you to mark up content with standards-compliant, semantically rich HTML. [2] Of course, HTML5 itself is still a working draft, and it’s unclear whether microdata will gain widespread use—or offer enough specificity to suit our content. For example, late last year, the “time” element was removed in favor of the more generic “data.”There’s also, a microdata-based approach launched in 2010 by Bing, Google, and Yahoo!. Designed to create a common language across search engines, arranges microdata into taxonomies of content types that start broadly and branch into ever-more-specific elements. Critics, however, point out that is a closed system: the search engines tell us which structures matter, rather than allowing content owners to define them.Many people are passionate about which of these approaches is best, and why everyone else is doing it wrong. I’m not one of them. Fact is, we may be a long way from a definitive markup method, and none of these currently supports all kinds of content, anyway. Use the one that makes the most sense for your project right now—and in fact, that could mean not even worrying about markup yet.Giving life to structureWhat matters much more than markup is the work we put in to get there: the rules and relationships determined through analyzing content closely and caring for its message and purpose. After all, “semantic” connotes meaning—typically, the meaning of language. Whatever markup language you use, it’s not semantic unless it pushes meaning forward—which is why you can’t start with markup; you end with it.This, I think, is why structured content has often been written off as too technical and utilitarian for the mainstream web crowd: because we’ve left the editorial side, the experiential side—the part that lends content life—out of these conversations.This needs to stop. Future-ready content isn’t about becoming an XML expert or assuming microdata will solve your problems. It’s about seeing structures through the lens of meaning and storytelling, and building relationships across disciplines so that our databases reflect this richness and complexity.We don’t have all the answers, but we do have a clear place to start: with our content itself. As we break our content down, analyze its elements, and document the relationships that turn those elements into a meaningful whole, we can begin to create and manage content in a way that endures, wherever the future leads us.Technology will change. Standards will evolve. But the need for understanding our content—its purpose, meaning, structure, relationships, and value—will remain. When we can embrace this thinking, we will unshackle our content—confident it will live on, heart intact, as it travels into the great future unknown. References[1] For an introduction to DITA from the tech-comms perspective, download the Rockley Group’s whitepaper, Preparing for DITA: What you need to know.[2] See Microdata: HTML5’s Best-Kept Secret on Web Monkey and Brian Cray’s HTML5 Microdata: Why isn’t anyone talking about it?.Translations:
  • From Paul Miller According to Brockmeier, the audience (of data scientists) apparently narrowly agreed that their arsenal of tools and algorithms trumped the knowledge and experience of the meteorologists, financiers, and retailers to whose domains data scientists are increasingly turning.Data scientists are an increasingly capable bunch, and the tools at their disposal sometimes appear almost magical in their capability to derive insight.
  • From Paul Miller According to Brockmeier, the audience (of data scientists) apparently narrowly agreed that their arsenal of tools and algorithms trumped the knowledge and experience of the meteorologists, financiers, and retailers to whose domains data scientists are increasingly turning.Data scientists are an increasingly capable bunch, and the tools at their disposal sometimes appear almost magical in their capability to derive insight.
  • First 5 minutes -
  • Big and Small Web Data

    1. 1. Big and Small Web DataMarieke Guy, Institutional Support Officer,Digital Curation Centre, UKOLN, University of Bath, UKInstitutional Web Management Workshop 2012 UKOLN is supported by: This work is licensed under a Creative Commons Licence Attribution-ShareAlike 2.0 1
    2. 2. Who Am I? • Have worked for UKOLN for over 12 years • Worked on variety of projects: Subject portals project, IMPACT, Good APIs, JISC Observatory, cultural heritage work, digital preservation work, …etc • Remote worker, into amplified events • Co-chair of IWMW for a number of years • Now working for Digital Curation Curation • Institutional Support Officer helping HEIs with their RDM • New to data….2
    3. 3. The Digital Curation Centre • A consortium comprising units from the Universities of Bath (UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII) • launched 1st March 2004 as a national centre for solving challenges in digital curation that could not be tackled by any single institution or discipline • Funded by JISC with additional HEFCE funding from 2011 for the provision of support to national cloud services • Targeted institutional development •
    4. 4. Assessing Data Use4
    5. 5. Data Management Tools5
    6. 6. Advocacy and Training • Informatics: disciplinary metadata schema, standards, formats, identifiers, ontologies • Storage: file-store, cloud, data centres, funder policy • Access: embargoes, FOI • Policy: making the caseHow to cite data 6
    7. 7. Who Are You? • Are you part of a Web team? • Are you part of a MIS team? • Are you a researcher? • Do you know what data is? • Do you use structured data? • Do you manage data?7
    8. 8. Today‘s Workshop: A Data Journey! • Presentation: What is data anyway? Looking at current data trends and what it has to do with Web managers • Break out groups: What data do you deal with? Anything goes from personnel data to key information sets and Web stats… • Presentation/Show and Tell: Taster of tools that help with data (mining, citation, visualization, analytics, etc.) • Presentation: Case study - Data @ Southampton • Discussion and buzzword bingo8
    9. 9. Today‘s Resources • All urls at: • All slides at: • Also on IWMW12 Web site9
    10. 10. a&hs=Jl2&rls=org.mozilla:en-GB:official&biw=1366&bih What is Data Anyway?mulejunk/352387473/ sregion5/4546851916// _barcode/4793484478/ 10 597432481//
    11. 11. A Data Definition • Datum is / data are (!!!): – Facts and statistics collected together for reference or analysis – Typically the results of measurements – Can be qualitative or quantitative – Unstructured or structured – Raw data, field data, experimental data – Data – information – knowledge – Data is the lowest level of abstraction • Even researchers don‘t know what data is….11
    12. 12. A Data Present “Data underpins our economy and our society - data about how much is being spent and where, data about how schools, hospitals and police are performing, data about where things are and data about the weather.” Tim Berners Lee, director of W3C.12
    13. 13. Some Flavours of Data • Big data • DIY data • Consumer data • Activity data • Crowd Sourced data • Linked data/ Web of data / semantic Web • Open data13
    14. 14. Big Data ―big data people obviously like alliteration – ―volume, velocity, variety, value‖ ―speed, size, scope‖ Andy Powell ―Data that is too big to manage14 using ‗normal‘ (database) tools.‖
    15. 15. Big Data “I worry there won’t be enough people around to do the analysis” Chris Ponting, University of Oxford “Raw image files for a single human genome have been estimated at 28.8 terabytes, which is approaching 30,000 gigabytes”“The cost of sequencing DNA has taken anosedive...and is now dropping by 50% every 5 months”“The 1000 Genomes Project generated more DNA sequencedata in its first 6 months than GenBank had accumulated in itsentire 21 year existence” “A single sequencer can now generate in a day what it took 10 years to collect for the Human Genome Project”15
    16. 16. Big Data • 3 Vs: volume, velocity and variety • Could include scientific & research data, data Web logs, RFID data, social data, search data, video, e-commerce • Likely to require different tools and practices from what ‗we are used to‘ • Technologies include massively parallel processing (MPP) databases, datamining grids, distributed file systems, distributed databases, cloud computing platforms and scalable storage systems • Example tools are Hadoop, NoSQL, CouchDB, • Issues regarding storage, speed of access, exponential growth, infrastructure, complexity16
    17. 17. DIY Data Kyle Machulis “DIY”Humanphysiologydata17
    18. 18. Consumer Data18
    19. 19. Consumer Data19
    20. 20. Consumer Data 1 in every 9 people on Earth is on Facebook 30 billion pieces of Google has been content are shared estimated to run over 1 on Facebook each million servers in data month centers around the Walmart take data world from 1 million customer transactions per hour There are over 6 billion photos on Flickr20
    21. 21. Activity Data • ―Data about users‘ actions and attention‖ • Access, attention and activity • Many systems in institutions store data about the actions of students, teachers and researchers • It‘s good business • • JISC Projects: – Recommender systems – Improving the student experience – Resource management • JISC Info kit – Business intelligence • Student retention21
    22. 22. 22
    23. 23. Crowd Sourced Data“Crowd-sourced” astronomy23
    24. 24. Open Data • ―A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.‖ Open Knowledge Foundation • Why? Use of public money, advancement of science • Why not? Commercial and reputation reasons, cost of preparing data • ―You can do all types of stuff with data‖ TBL • But tricky to open access to data (cost, preparation, capturing meaning, annotations, context, meaning etc.) • Data is more valuable when accessible • Open data on Web: CKAN,, infochimps, openstreetmap, dbpedia, freebase, numbrary, etc.24
    25. 25. Linked Data • Repurposing and aggregating data in machine readable format • Southampton • • Lucero project • Linkeduniversitie • XCRI • Lincoln •
    26. 26. The Key Data Issues • Scale and complexity – data deluge – volume, pace, infrastructure • Sensitivity of data • Openness – why aren‘t people sharing? • Quality of data • Reputation – FOI, DPA, computer misuse • Management – Storage, incentive, costs & sustainability • Preservation – where is your data? • Funding for researchers • Analysis • Doing something useful with it…26
    27. 27. Sensitive Data • DPA 1998 – Sensitive Personal Data ―Data regarding an individual‘s race or ethnic origin, political opinion, religious beliefs, trade union membership, physical or mental health, sex life, criminal proceedings or convictions…‖ – Personal data • Relates to a living individual • The individual can be identified from those data and other information • Includes any expression of opinion about the individual • Data that may incriminate a person • Data a person prefers not to share with wider society27
    28. 28. OpennessChoices are made according to context, withdegrees of openness reached according to:• The kinds of data to be made available• The stage in the research process• The groups to whom data will be made available• On what terms and conditions it will be providedDefault position of most:• YES to protocols, software, analysis tools, methods and techniques• NO to making research data content freely available to everyoneAfter all, where is the incentive? Angus Whyte, RIN/NESTA, 201028
    29. 29. Reputation29
    30. 30. Data Storage Challenges • Scalable • Cost-effective (rent on-demand) • Secure (privacy and IPR) • Robust and resilient • Low entry barrier / ease-of-use • Has data-handling / transfer / analysis capability What about Cloud services?The case for cloud computing in genome informatics.Lincoln D Stein, May 2010 30
    31. 31. The Web Managers ask: ―So what has all this got to do with me..?‖31
    32. 32. Break Out Groups What data do you deal with? • Personnel data • Admissions • Timetables • Curriculum • key information sets • Web stats… What do you do with this data? Could you do more? What?
    33. 33. Are the Web Managers still asking? ―So what has all this got to do with me..?‖33
    34. 34. A Data Future “The ability to take data - to be able to understand it, to process it, to extract value from it, to visualise it, to communicate it –that‘s going to be a hugely important skill in the next decades.” Hal Varian, Google‘s chief economist.34 Hal Varian, Chief Economist, Google
    35. 35. Web Teams and Data • Data is relevant to those working with the Web at HEIs because: • Data will affect your IT infrastructure, if it doesn‘t already • Data is becoming increasingly important for the REF and for funding so it will be increasingly important to your HEI • It is getting easier to ask for data • Structured data could make your life easier • The Web itself is becoming more structured • Data can show impact • It‘s all about the data….35
    36. 36. Web Teams and Data • Unstructured data accounts for more than 90% of digital universe (2011 Digital Universe study) • Structured data on the rise for some time – deep web, annotation schemes, search data • In the past web pages have contained information, now is the time for them to contain data • Some key data areas Web teams need to think about: – Structure – Metrics – Patterns, data mining and analytics – Preservation (maybe one for another day?)36
    37. 37. Web Data: Structure • Move toward a Web that‘s more fluid, less fixed, and more easily accessed on a multitude of devices •‘s Brad Frost, ―get your content ready to go anywhere because it‘s going to go everywhere.‖ • Karen McGrane: calls them ―content blobs‖ – ―we can embrace meaningful, modular chunks that are ready to travel‖ • Google Knowledge Graph: ―currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects‖ • ―a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers‘‖37
    38. 38. Preparing for Structure • There is a need for structured content in Web sites • ‗Future ready content‘ - Sara Wachter-Boettcher – 1. Get Purposeful – why do users want this content? – 2. Get Micro – get granular, break content down ( – microdata) – 3. Get Meaningful – considering the meaning of elements – 4. Get Organised – looking at your CMS – 5. Get Structured – DITA? XML? HTML5 (microdata) • ‗Create once, publish everywhere‘ idea – mobile, apis, etc.38
    39. 39. Web Data: Metrics • Metrics – the new black? Kristen Ratan • ―The more you know the more you realise you don‘t know‖ • What should we be tracking? e.g. Figures opened, downloaded, inks clicked, time spent on article page, supplemental info viewed, authors‘ info viewed • Look at the pathways that info travels • Data can drive tenure and promotion, grants, reputation, discovery, prioritization, attention • Issues: Missed citation data, data sources that aren‘t reliable, digital addresses change, usage doesn‘t mean useful39
    40. 40. Web Data: Patterns “In other words, we no longer need to speculate and hypothesise; we simply need to let machines lead us to the patterns, trends, and relationships in social, economic, political, and environmental relationships.” Mark Graham, Big Data blog, the Guardian.40 Hal Varian, Chief Economist, Google
    41. 41. Web Data: Analytics • Customers expect us to be leveraging their activity to benefit their user experience • ―the process of developing actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data.‖ Adam Cooper, CETIS • Reporting and descriptive methods Vs inferential and predictive methods • Data driven decisions? ―human decisions supported by the use of good tools to provide us with data-derived insights‖ • Don‘t ―let the numbers speak for themselves‖ – data only one input to decision process • Data specialists and domain specialists work together • Need to ask the right questions41
    42. 42. Web Data: Learning Analytics • ―The measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimising learning and the environments in which it occurs.‖ 1st International Conference on Learning Analytics & Knowledge • Open University Learner Analytics Project – Looked at withdrawals - e.g. when students stop study before completion of a module towards a degree – Possible to map what points on paths of study withdrawals occur. • Other uses: personalisation, recommendation, research profiles, marketing and surveys, help desk, CRM, library • Looking at disabled students/accessibility – linking learner analytics and web metrics42
    43. 43. Protection of Freedoms Bill • The Protection of Freedoms Bill is a UK parliamentary bill introduced in February 2011 • Has completed it‘s readings – now passing through house of Lords • 102 - amendments to FOIA - mandatory for public authorities to permit re-use of datasets when communicating them in response to a FOI request • Datasets are collections of information held in electronic form i.e. raw data gathered or created in connection with the universitys functions or services‘ • Government‘s Innovation and Research Strategy for Growth - "a transformation in the accessibility of research and data‖43
    44. 44. Tools that Could Help
    45. 45. Tools: Structure • • Google Rich Snippets testing tool – tests microdata, microformats, RDFa • List of tools on Semanticweb.org45
    46. 46. Tools: Metrics & Text Mining • Google Analytics • Elsevier • total-impact • altmetric.com46
    47. 47. Tools: Analytics • SNAPP: Social Networks Adapting Pedagogical Practice • GLASS (Gradient‘s Learning Analytics System) • International Educational Data Mining society • Learning Analytics and Knowledge Conference47
    48. 48. Data Visualisations • Use your IT and your graphics design department • Make it interactive • Getting Awesome Results from Data Visualisation – Rich Kirk • Data visualisation strategy – Have a purpose – Have measurable KPIs vs purpose – Plan distribution in advance – Resource – Ensure visualisation matches purpose • Chart chooser (Gene Zelaznys Saying It With Charts) • Measurement: pageviews, buzz, links, key word ranking • ―Tell a story with your data‖ – Ewan McIntosh at IDCC1148
    49. 49. Data Visualisation Help • Great Web sites – Ewan McIntosh – Information is Beautiful – Pinterest – Guardian data blog – Flowing data – Infosthetics – information aesthetics – where form follows data • Great tools – Manyeyes – Chartsbin, icharts, Google chart tools – Google developer – Google Fusion tables – Tableau public – Datamarket – Colour Brewer49
    50. 50. Visualisations: Google Maps50
    51. 51. Data Case Study: Southampton• Not big data but small data• Got to be useful!! Chris Gutteridge -
    52. 52. Southampton Data • Places: Buildings, Rooms, Campuses, Counties, Disabled Access • Organisation Structure • Products & Services: Coffee, Sandwiches, Library Services, Recycle Points • Points of Service: Coffee Shops, Swimming Pools, Libraries, Receptions • Teaching: Courses, Modules, Statistics, Student Satisfaction • Travel: Stations, Bus-Stops, Bus-Routes, Bus Times • Resources: EPrints, Videos, Learning Objects • People: Contact Information, Experts for the Media • Events: Open Days, University History • Jargon52
    53. 53. Southampton Open Data53
    54. 54. Southampton Uses… • Google docs, excel spread sheets, RDF, triples • Grinder – github • Graphite – php library • Graphite (publishing RDF). Required skills: – RDF structure – RDF/XML – XSLT • Graphite (consuming RDF). Required skills: – RDF structure – PHP54
    55. 55. Data Case Study: Aberdeen ―I managed the Web and then inherited MIS. These two have now converged so that Web is using much better, structured data and standardising and consolidating sources. The MIS brings discipline to the Web – much needed if you ask me, anarchist though I am...” Mike McConnell, Head of Web Services, University of Aberdeen.55
    56. 56. Student Attendance Data • Loughborough University‘s Pedestal for Progression • Roehampton University‘s fulCRM • Southampton Student Dashboard at the University of Southampton • tutees, directory info, whether coursework has been handed in, and attendance. • University of Derby‘s SETL (Student Engagement Traffic Lighting) • The ESCAPES (Enhancing Student Centred Administration for Placement ExperienceS) project at the University of Nottingham56
    57. 57. Conclusions • At the moment it‘s all about the data… (whether you like it or not!) • Be aware of what is happening with data at your institution – data repository, MIS, RIM, CRIS, repository etc. Where do you sit in the picture? • Structure your Web data – it makes sense • You can start with ‗little data‘… • Think about what strategic questions you want to ask • Be grounded – efficiency and effectiveness • Start from the user end - think about the uses and output • Follow up from the IT end – how can you automate processes? • What can you use your data for? Can you show impact/success? • How about telling a story with it?57
    58. 58. Buzzword Bingo data Linked wrangler cloud Big data computing data para data Data-Driven Decision making data mining data data data journalism scientist tsunamiknowledgediscovery in clustering predictive analyticsdata (KDD)58
    59. 59. What Data Can and Cannot Do • From Guardian Datablog, by Johnathan Gray • Data is not a force unto itself. • Data is not a perfect reflection of the world. • Data does not speak for itself. • Data is not power. • Interpreting data is not easy.59
    60. 60. Thanks!! ―The data that is valuable to you is already passing through your hands" ” Doug Cutting, Chairman, Apache Software Foundation60