Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Planning for big data
Dr. Mia Ridge, @mia_out
Digital Curator, British Library
digitalresearch@bl.uk @BL_DigiSchol
Outline
• What is big data?
• How is it used?
• How do you prepare for working with it?
What is big data?
Defining 'big data'
Data that is too large or too complex to process
manually / with a desktop computer
– Number of record...
Defining 'big data' - Gartner
'Volume. Data that have grown to an immense size,
prohibiting analysis with traditional tool...
'Big data' in cultural heritage
The challenges of scale
• The BL holds 180-200 million items, including:
• 8 million stamps
• 310,000 manuscript volumes
•...
The impact of scale
My experience at Cooper Hewitt: 20% of my
residency 'dealing with the sheer size of the
dataset: it's ...
A splendid assortment of Gceloag
and West of England. Tweed ; also
Black Doeakin Woollen Cloths
alwaya on hand. Snit made ...
Different data, different uses
Datasets about our collections
Bibliographic datasets relating to our
published and archiva...
#messy data
http://museum-api.pbworks.com/w/page/21933420/Museum%C2%A0APIs
Question: what kinds of big data are
you interested in working with?
What makes it 'big'?
How is big data used?
Machine learning, artificial intelligence
and big data
Computational techniques that learn from
examples and/or data witho...
Legal
https://www.veritas.com/content/dam/Veritas/docs/white-papers/21198622_GA_ENT_WP-Early-Case-Assessment-in-Electronic...
Medical
Personalised treatment plans for cancer patients
• IBM Watson's used by oncologists at Memorial
Sloan-Kettering Ca...
Politics, finance
http://www.opensecrets.org/resources/learn/anomalies.php
Translation
• New version of Google Translate uses
'recurrent neural networks' to translate
sentences as a whole
https://r...
Enhancing records: SherlockNet
http://bit.ly/sherlocknet
Question: what kinds of decisions could
you support by analysing big data?
What value would that add?
Working with big data
Planning for big data: stages
• Identify potential sources
• Digitising (unless everything is already
available as digital...
Stages: reviewing permissions
Possible issues include:
• terms of use when data collected,
• data protection,
• copyright,...
Stages: what skills do you need?
• Domain knowledge
• Analytical skills
• Technical skills
Stages: cleaning
(unless your data is already consistent)
• These are not the same place (if you're a
computer):
– U.S.
– ...
Stages: cleaning
http://openrefine.org/
...but be careful
Stages: cleaning
Challenge: time-consuming
Opportunity: time to get to know the data
e.g. Google Maps only understood muse...
Stages: cleaning
Some 'fuzziness' is unavoidable.
• Unexpectedly complex objects e.g. 'Begun in
Kiryu, Japan, finished in ...
Cleaning: don't forget!
• Versioning
• Documentation
Stages: enhancing
http://nlp.stanford.edu:8080/ner/
Stages: verifying
Reality check results
• Are they accurate?
• Could they do anyone any harm?
• Do they under- or over-rep...
Stages: dissemination
• How can you contextualise, explain any
limitations of your analysis? e.g.
– provenance and qualiti...
The only way is Ethics
Ico: Big data and data protection
https://ico.org.uk/media/for-organisations/documents/1541/big-data-and-data-protection.p...
Ico: Big data and data protection
The ethics of convenience?
• More data is digital
• More data is retained
• More data contains identifiers
It's easier tha...
Question: what ethical issues might
arise with big data in your field? How
can you resolve them?
Thank you!
Questions?
Dr. Mia Ridge, @mia_out
Digital Curator, British Library
digitalresearch@bl.uk @BL_DigiSchol
Planning for big data (lessons from cultural heritage)
Upcoming SlideShare
Loading in …5
×

Planning for big data (lessons from cultural heritage)

2,964 views

Published on

Talk for Association for Project Management's Knowledge Management SIG event on 'What does big data mean for project and knowledge managers?'

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Planning for big data (lessons from cultural heritage)

  1. 1. Planning for big data Dr. Mia Ridge, @mia_out Digital Curator, British Library digitalresearch@bl.uk @BL_DigiSchol
  2. 2. Outline • What is big data? • How is it used? • How do you prepare for working with it?
  3. 3. What is big data?
  4. 4. Defining 'big data' Data that is too large or too complex to process manually / with a desktop computer – Number of records – Size of files – Mixed formats – Unstructured data – Relationships between datasets
  5. 5. Defining 'big data' - Gartner 'Volume. Data that have grown to an immense size, prohibiting analysis with traditional tools Variety. Multiple formats of structured and unstructured data—such as social-media posts, location data from mobile devices, call center recordings, and sensor updates—that require fresh approaches to collection, storage, and management Velocity. Data that need to be processed in real or near- real time in order to be of greatest value, such as instantly providing a coupon to customers standing in the cereal aisle based on their past cereal purchases' https://www.bcgperspectives.com/content/articles/it_strategy_retail_how_to_get_started_with_big_data/
  6. 6. 'Big data' in cultural heritage
  7. 7. The challenges of scale • The BL holds 180-200 million items, including: • 8 million stamps • 310,000 manuscript volumes • Over 4 million maps • Legal deposit material including pamphlets, magazines, newspapers, sheet music and maps • Television and radio recordings • Websites, e-books, e-journals • Over 3 million new items are added every year • Only 1-2% of collections digitised
  8. 8. The impact of scale My experience at Cooper Hewitt: 20% of my residency 'dealing with the sheer size of the dataset: it's tricky to load 60mb worth of 270,000 rows into tools that are limited by: • the number of rows (Excel), • rows/columns (Google Docs) or • size of file (Google Refine, ManyEyes) 'search-and-replace cleaning takes a long time' https://labs.cooperhewitt.org/2012/exploring-shape-collections-draft/
  9. 9. A splendid assortment of Gceloag and West of England. Tweed ; also Black Doeakin Woollen Cloths alwaya on hand. Snit made to order in six hoars' notice, on most reaainable terms. Mr. M'Mohon, Cutter. Mysteries of Melbourne life by Cameron, Donald, 1848?-1888. Published 1873 Usage Public Domain Mark 1.0 Topics Australia -- Fiction
  10. 10. Different data, different uses Datasets about our collections Bibliographic datasets relating to our published and archival holdings Datasets for content mining Content suitable for use in text and data mining research Datasets for image analysis Image collections suitable for large-scale image-analysis-based research Datasets from UK Web Archive Data and API services available for accessing UK Web Archive collections Digital mapping Geospatial data, cartographic applications, digital aerial photography and scanned-in historic map materials http://bl.uk/digital
  11. 11. #messy data http://museum-api.pbworks.com/w/page/21933420/Museum%C2%A0APIs
  12. 12. Question: what kinds of big data are you interested in working with? What makes it 'big'?
  13. 13. How is big data used?
  14. 14. Machine learning, artificial intelligence and big data Computational techniques that learn from examples and/or data without being programmed in advance e.g. • Recruitment - shortlisting CVs to job ads • Ecommerce - Netflix, Amazon, Spotify recommendations
  15. 15. Legal https://www.veritas.com/content/dam/Veritas/docs/white-papers/21198622_GA_ENT_WP-Early-Case-Assessment-in-Electronic-Discovery_EN.pdf Veritas case study, 'Early Case Assessment in Electronic Discovery'
  16. 16. Medical Personalised treatment plans for cancer patients • IBM Watson's used by oncologists at Memorial Sloan-Kettering Cancer Center, suggestions 'informed by data from 600,000 medical evidence reports, 1.5 million patient records and clinical trials, and two million pages of text from medical journals' • Microsoft similarly use machine learning and natural language processing to sort through research data http://news.microsoft.com/stories/computingcancer/ https://www.mskcc.org/blog/msk-trains-ibm-watson-help-doctors-make-better-treatment-choices http://www.oxfordmartin.ox.ac.uk/publications/view/1883
  17. 17. Politics, finance http://www.opensecrets.org/resources/learn/anomalies.php
  18. 18. Translation • New version of Google Translate uses 'recurrent neural networks' to translate sentences as a whole https://research.googleblog.com/2016/09/a-neural-network-for-machine.html
  19. 19. Enhancing records: SherlockNet http://bit.ly/sherlocknet
  20. 20. Question: what kinds of decisions could you support by analysing big data? What value would that add?
  21. 21. Working with big data
  22. 22. Planning for big data: stages • Identify potential sources • Digitising (unless everything is already available as digital text/images) • Collecting (unless everything is already centralised) • Reformatting (unless everything is ready to be loaded into software) • Storage, backup, software licences
  23. 23. Stages: reviewing permissions Possible issues include: • terms of use when data collected, • data protection, • copyright, • commercial in confidence, • proprietary systems, • other licences
  24. 24. Stages: what skills do you need? • Domain knowledge • Analytical skills • Technical skills
  25. 25. Stages: cleaning (unless your data is already consistent) • These are not the same place (if you're a computer): – U.S. – U.S.A – U.S.A. – USA – United States of America – United States (case)
  26. 26. Stages: cleaning http://openrefine.org/
  27. 27. ...but be careful
  28. 28. Stages: cleaning Challenge: time-consuming Opportunity: time to get to know the data e.g. Google Maps only understood museum records that used 'United Kingdom'; tens of thousands of records that used Great Britain, England, Scotland, Wales, Northern Ireland etc weren't mapped
  29. 29. Stages: cleaning Some 'fuzziness' is unavoidable. • Unexpectedly complex objects e.g. 'Begun in Kiryu, Japan, finished in France' • Permanent uncertainty e.g. 'Bali? Java? Mexico?'
  30. 30. Cleaning: don't forget! • Versioning • Documentation
  31. 31. Stages: enhancing http://nlp.stanford.edu:8080/ner/
  32. 32. Stages: verifying Reality check results • Are they accurate? • Could they do anyone any harm? • Do they under- or over-report any factors?
  33. 33. Stages: dissemination • How can you contextualise, explain any limitations of your analysis? e.g. – provenance and qualities of original dataset(s); – how it was transformed, cleaned to fit into software; – how confident you are in matches, results; – what's left out of the analysis, and why?
  34. 34. The only way is Ethics
  35. 35. Ico: Big data and data protection https://ico.org.uk/media/for-organisations/documents/1541/big-data-and-data-protection.pdf
  36. 36. Ico: Big data and data protection
  37. 37. The ethics of convenience? • More data is digital • More data is retained • More data contains identifiers It's easier than ever before to make creepy decisions
  38. 38. Question: what ethical issues might arise with big data in your field? How can you resolve them?
  39. 39. Thank you! Questions? Dr. Mia Ridge, @mia_out Digital Curator, British Library digitalresearch@bl.uk @BL_DigiSchol

×