Volume. Big data uses massive datasets Variety. Big data often involves bringing together data from different sources e.g. tweets and sales data Velocity. In some contexts, it is important to analyse data as quickly as possible, even in real time e.g. when your bank texts you re possible fraudulent transaction https://www.bcgperspectives.com/content/articles/it_strategy_retail_how_to_get_started_with_big_data/
What kinds of data are we talking about? At the very least, providing photographs of pages, which can then be transcribed as text. Can then offer collections of metadata, of text, of images, for reading individually or mining as a dataset. A shift from reading pages to reading a dataset enables entirely new research questions.
If look at dates, names, can see that it's sometimes fuzzy, messy - must be flattened to fit into precise, specific systems?
Messy data - lots of different formats, not everything uses standard vocabs so it's hard to be be certain exactly who or what entities in the world they mean
Thousands of UK websites have been collected since 2004 As at 30 Nov 239.46GB Number of Archived Websites 15,112. 79,276 'instances' ie snapshots Uk web archive good eg of variety - web pages / site have multiple elements, meaning often contained in links
What makes it complex or hard to process?
AKA why do people get excited about it? Examples from different domains.
e.g. document review and to assist in pre-trial research; pre-crime detection, sentencing recommendations 'Symantec's eDiscovery platform is able to perform all tasks "from legal hold and collections through analysis, review, and production", and proved capable of analysing and sorting more than 570,000 documents in two days' Markoff (2011) in http://www.oxfordmartin.ox.ac.uk/downloads/reports/Citi_GPS_Technology_Work.pdf
Memorial Sloan-Kettering Cancer Center [Bassett (2014)] personalise a treatment plan with reference to a given patient's individual symptoms, genetics, family and medication history
Have you ever been creeped out by websites or marketing that seems to know a bit too much about you?
Ethics - discussion - what ethical dilemmas have you encountered in your own work, or heard of in other contexts? Should you use data just because it's now more convenient? Scale and convenience pushing at ethics.
Planning for big data (lessons from cultural heritage)
Planning for big data
Dr. Mia Ridge, @mia_out
Digital Curator, British Library
• What is big data?
• How is it used?
• How do you prepare for working with it?
Defining 'big data'
Data that is too large or too complex to process
manually / with a desktop computer
– Number of records
– Size of files
– Mixed formats
– Unstructured data
– Relationships between datasets
Defining 'big data' - Gartner
'Volume. Data that have grown to an immense size,
prohibiting analysis with traditional tools
Variety. Multiple formats of structured and
unstructured data—such as social-media posts,
location data from mobile devices, call center
recordings, and sensor updates—that require fresh
approaches to collection, storage, and management
Velocity. Data that need to be processed in real or near-
real time in order to be of greatest value, such as
instantly providing a coupon to customers standing in
the cereal aisle based on their past cereal purchases'
The challenges of scale
• The BL holds 180-200 million items, including:
• 8 million stamps
• 310,000 manuscript volumes
• Over 4 million maps
• Legal deposit material including pamphlets,
magazines, newspapers, sheet music and maps
• Television and radio recordings
• Websites, e-books, e-journals
• Over 3 million new items are added every year
• Only 1-2% of collections digitised
The impact of scale
My experience at Cooper Hewitt: 20% of my
residency 'dealing with the sheer size of the
dataset: it's tricky to load 60mb worth of 270,000
rows into tools that are limited by:
• the number of rows (Excel),
• rows/columns (Google Docs) or
• size of file (Google Refine, ManyEyes)
'search-and-replace cleaning takes a long time'
A splendid assortment of Gceloag
and West of England. Tweed ; also
Black Doeakin Woollen Cloths
alwaya on hand. Snit made to
order in six hoars' notice, on most
reaainable terms. Mr. M'Mohon,
Mysteries of Melbourne life
by Cameron, Donald, 1848?-1888.
Usage Public Domain Mark 1.0
Topics Australia -- Fiction
Different data, different uses
Datasets about our collections
Bibliographic datasets relating to our
published and archival holdings
Datasets for content mining Content
suitable for use in text and data
Datasets for image analysis Image
collections suitable for large-scale
Datasets from UK Web Archive Data
and API services available for
accessing UK Web Archive collections
Digital mapping Geospatial data,
cartographic applications, digital
aerial photography and scanned-in
historic map materials http://bl.uk/digital
Question: what kinds of big data are
you interested in working with?
What makes it 'big'?
Machine learning, artificial intelligence
and big data
Computational techniques that learn from
examples and/or data without being
programmed in advance
• Recruitment - shortlisting CVs to job ads
• Ecommerce - Netflix, Amazon, Spotify
Veritas case study, 'Early Case Assessment in Electronic Discovery'
Personalised treatment plans for cancer patients
• IBM Watson's used by oncologists at Memorial
Sloan-Kettering Cancer Center, suggestions
'informed by data from 600,000 medical evidence
reports, 1.5 million patient records and clinical
trials, and two million pages of text from medical
• Microsoft similarly use machine learning and
natural language processing to sort through
Planning for big data: stages
• Identify potential sources
• Digitising (unless everything is already
available as digital text/images)
• Collecting (unless everything is already
• Reformatting (unless everything is ready to be
loaded into software)
• Storage, backup, software licences
Stages: reviewing permissions
Possible issues include:
• data protection,
• commercial in confidence,
• proprietary systems,
• other licences
Stages: what skills do you need?
• Domain knowledge
• Analytical skills
• Technical skills
(unless your data is already consistent)
• These are not the same place (if you're a
– United States of America
– United States (case)
Opportunity: time to get to know the data
e.g. Google Maps only understood museum
records that used 'United Kingdom'; tens of
thousands of records that used Great Britain,
England, Scotland, Wales, Northern Ireland etc
Some 'fuzziness' is unavoidable.
• Unexpectedly complex objects e.g. 'Begun in
Kiryu, Japan, finished in France'
• Permanent uncertainty e.g. 'Bali? Java?
Reality check results
• Are they accurate?
• Could they do anyone any harm?
• Do they under- or over-report any factors?
• How can you contextualise, explain any
limitations of your analysis? e.g.
– provenance and qualities of original dataset(s);
– how it was transformed, cleaned to fit into
– how confident you are in matches, results;
– what's left out of the analysis, and why?