Managing large and complex data sets


Published on

Presentation given by Catherine Hardman of the Archaeology Data Service in York.

The presentation was given at the 'Managing Archaeology Data' event on Monday 7th March 2011 at the University of Glasgow.

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • How big is your data? – asked in order to get a idea of scale of the problem So you’ll see there is some quite big data being produced out there – some people producing over 200GB for a project
  • We ran an online questionnaire to find out about users and uses of big data – I’ll just skim through some of the things that came out of it: We got 48 responses. this is one of the first questions we asked. Wanted to get an idea of the data collection techniques that people are using to create big data. You’ll see there’s a wide range of technologies including the ones I mentioned on an earlier slide.
  • Of the 101 software packages entered into the online form a staggering 52 are unique (that is after editing for things like lower and upper case character differences). It seems the world of ‘big data’ is very fragmented.
  • This is an interesting one. We asked if people had an archival policy for the data sets in question. Only 48% of respondents note that they have a policy in place Of these many noted that these policies were localised and incomplete - not formal written policy. A proper system of digital archiving should involve continuous active management of the data, putting data on a dvd and putting it in a drawer is not really a stable archival policy. A formal archival policy as we see it should ideally be based on the OAIS system – continuous active management of data to ensure its survival into the future.
  • Overwhelming “yes” to this question.... Some of the reasons that were cited: monitoring over time avoiding duplication Saving time/money Of course – re-use just isn’t possible unless someone is archiving and providing access to this data
  • Managing large and complex data sets

    2. 2. The problem….in 1996 My lithics report here, on floppy disc
    3. 3. The ADS: some ancient history <ul><li>The Archaeology Data Service: </li></ul><ul><li>set up in 1996 </li></ul><ul><li>one of five AHDS subject centres </li></ul><ul><li>based within the University of York </li></ul><ul><li>Funding: </li></ul><ul><li>initially received funding from </li></ul><ul><ul><li>Arts and Humanities Research Council (AHRC) </li></ul></ul><ul><ul><li>Joint Information Systems Committee (JISC) </li></ul></ul><ul><li>Presently receives core funding from AHRC alongside cross-sectoral, project-based funding. </li></ul>
    4. 4. What do we do? <ul><li>Our remit: </li></ul><ul><li>“ To support research, learning and teaching with high quality and dependable digital resources.” </li></ul><ul><li>In practice this means three key things: </li></ul><ul><li>That ADS collect and preserve datasets </li></ul><ul><li>That we allow full, easy and free access to these </li></ul><ul><li>And that we additionally provide guidance and support to data creators </li></ul>
    5. 5. No need for digital preservation Domesday Book: Publisher: William of Normandy (1086) – still readable
    6. 6. Where’s preservation when you need it? Domesday Disc: Publisher: BBC (1986) –nearly lost
    7. 7. Why is it important?
    8. 8. <ul><li>Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G. 1997. Nongeospatial Metadata for the Ecological Sciences. Ecological Applications. 7: 330-342. </li></ul>What’s the problem? Information Entropy
    9. 9. The scale of the problem in the 1990s Strategies for protecting physical media Findings and Recommendations from ‘Digital Data in Archaeology: A Survey of User Needs’ Condron et al 1999
    10. 10. Protecting Physical media … never the twain
    11. 11. The scale of the problem in the 1990s The popularity of storage options Findings and Recommendations from ‘Digital Data in Archaeology: A Survey of User Needs’ Condron et al 1999
    12. 12. 8&quot; Floppy 3.5&quot; Floppy 5.25&quot; Floppy 12&quot; Optical Disk 5.25&quot; Optical Disk CD-ROM Sparq Disk Cartridge Zip Disk Click! DVD-ROM Jaz Disk Floptical Disk Punch Tape Rectangular Hole Punch Card IBM 3480 DLT Tape DG90M Tape DC4_120 8mmD-eight QIC DC600 G2000 Tape 4mm Tape Ditto Max 9-Track Ree l Cassette tape         Memory Stick MultiMedia Card SD Memory Card xD Picture Card Smart Media CompactFlash Travan
    13. 13. Why is it all so difficult? <ul><li>Deterioration of the storage medium </li></ul><ul><li>Obsolescence of the storage medium </li></ul><ul><li>Failure to document the format adequately </li></ul><ul><li>Obsolescence of the software </li></ul><ul><li>Obsolescence of the hardware </li></ul><ul><li>Long-term management </li></ul>
    14. 14. How do we do it? Open Archival Information System (OAIS)
    15. 15. But that’s people…
    16. 16. Migration based approach & controlled ingest Aim to connect with data producers early on in their project lifecycles to ensure that preservation planning is a key consideration during the project rather than an afterthought.
    17. 17. Guides to help you do all that.
    18. 18. It hasn’t really got much easier <ul><li>The goal posts keep moving! </li></ul>
    19. 19. The size of digital archives held by different types of archaeological bodies A rchaeology D ata S ervice
    20. 20. Big Data Project Roughly how much data would be generated by a single project?
    21. 21. Which of these data collection techniques do you carry out? Technologies used 12% 4% 4% 3% 8% 1% 3% 11% 9% 9% 7% 14% 3% 12% 3D Laser Scanning Sidescan Sonar Multibeam Scanning Single Beam Scanning Geophysics Acoustic Tracking Sub bottom profiling Geographic (eg GIS) Lidar Digital Video Video Movie Clips Still Images CAD (2D or 3D) Other
    22. 22. What are the main software packages you use ?
    23. 23. Do you have an archiving policy for the data sets / types in question?
    24. 24. back-up
    25. 25. When you start a new project …would you consider using existing datasets?
    26. 26. This is the opportunity!
    27. 28. Making the inaccessible accessible <ul><li>to make available unpublished fieldwork reports in an easily retrievable fashion. There are currently 8018 reports available and this number is increasing steadily through the OASIS project in England and Scotland. </li></ul>
    28. 29. Blurring the distinction … … between publication and archives …
    29. 30. Making the LEAP…
    30. 32. What does that mean for you? <ul><li>Plan for reuse </li></ul><ul><li>Plan for reuse </li></ul><ul><li>Plan for reuse </li></ul><ul><li>Plan for reuse </li></ul>
    31. 33. How do you do that? <ul><li>Include a data management plan (use the DCCs) </li></ul><ul><li>Order your data </li></ul><ul><li>File naming strategy </li></ul><ul><li>Version control </li></ul><ul><li>Back-up (in the field) </li></ul><ul><li>Consider your file formats </li></ul><ul><li>Dissemination plan (and it’s longevity) </li></ul><ul><li>What does the long term look like? </li></ul><ul><li>Discuss requirements with an archive </li></ul>
    32. 34. We’re here to help <ul><li> </li></ul><ul><li> </li></ul><ul><li>[email_address] </li></ul>