An introduction to XML and explanation of how it may be used to encode qualitative data produced by health researchers. Talk given by Libby Bishop of the UK Data Service at the Data Management in Practice workshop, which took place on Nov 14th 2013 at the London School of Hygiene and Tropical Medicine
Azure Monitor & Application Insight to monitor Infrastructure & Application
Finding, searching and sharing qualitative data: the uses of XML
1. Finding, searching and sharing
qualitative data: the uses of XML
Libby Bishop
Producer Relations and
Research Ethics
Data Management in Practice
LSHTM, London, 14 November 2013
2. UK Data Service seeking to improve
• We have one of the largest qualitative data collections–
•
over 300 data collections in the social sciences
Currently users find and download these from our
website – generally good, we would like to improve:
• No searching within collections
• Hard to display complex relationships among related
•
files within a collection (transcript, audio, image, memo)
Cannot reliably cite parts of data
3. What researchers want from data centres
• Search - find data regardless of location
• Use – ways to use data flexibly
• Examine interview extract in context, online
• Decide before download
• Support analysis led by research questions (not technology)
• Cite – get and give credit appropriately
• Preserve – for own or others’ use later
XML is not a miracle cure,
just a (key) part of the solution
4. XML – eXtensible Mark-up Language
• Language – system for communication
• Mark-up – encoding descriptive features of text
• Tags, e.g. <u>words spoken in an interview</u>
• Extensible – set of tags is not fixed
• Text Encoding Initiative (TEI) has 100s
• Independent of specific hard/software
• Open
XML allows qual data (rich, deep, but messy,
unstructured) to benefit from computing power
typically applied to structured, numeric data.
5. Search: all types of resource available
Data
collections
• studies
• variables
Case
studies
• research
• teaching
ESRC
outputs
•
•
•
•
Support/
‘how to’
guides
conference paper
article
report
research summary
• dataset
• theme
• methods/statistics
14. Use and Cite: Digital Futures project
• Build a user-friendly system for publishing and
•
•
•
•
exploring qualitative data online
Project includes large-scale digitisation of precious
and undigitized materials
Browse search results in context
Improve display complex data
Offer a mechanism for reliably citing data located in
the system
17. School Leaver Essay 53 – My Past
aaa In 1978 I left school, I was sixteen years old. I came straight out of school into an
apprenticeship heavy meter machanics. I served my four year apprenticeship in a garage for
another year and the left and started my own garage. At the age of twenty three I got married.
The garage was doing well so I didn’t have Much prodlems setting up a home. One year
After I had/been married my wife had her first child. When I had some spare time I made up
a car for rally cross racing but In the time I was racing I only won a few. When I was twenty
five our second child was born. Once when rally driving I had a smash and was in hospital
for five months when I was twenty nine we had our third child. I would get up at six o clock
and drive to the garage and open it at Saturdays. On some Sundays when I wasn’t rally
driving the family would go horse riding or for a picnic whilst I went fishing. In the garage I
took an apprenticship from people who had just left school. When I was thirty six we had our
fourth child. My first child would come and help in the garage at least when he left school he
would get a job. When I was forty I had an extension built on to the garage. I also bought 4
acres of land and built a racetrack and made go-karts for my second and third eldest sons
when my last child was eight I brought her a pony and taught her to ride. From when I was
forty four My mother died and my father had died when I was twenty nine.
18. Corrected spelling – for accurate searches
<sic>apprenticship</sic><corr>apprenticeship<corr/>
22. Richer metadata = richer discovery
• Use of DDI 2.5, QuDEx and TEI schema
• QuDEx allows identification of data objects:
• Interview transcript or audio recording etc.
• Relationship to another data object or part of data
• Descriptive categories at the object level, e.g. mime
•
type, interview characteristics, interview setting
Capacity to capture rich annotation of parts of data
• QuDEx model in use (Schema at: www.data•
archive.ac.uk/create-manage/projects/qudex/)
Object-level description = a lot of manual work!
23. Citation – of collection, and utterance
World Health Organization and International Collaborative Study
of Medical Care Utilization, WHO/ICS Medical Care Utilization
Study Data, 1968-1969 [computer file]. Colchester, Essex: UK Data
Archive [distributor], January 1981. SN:
1427, http://dx.doi.org/10.5255/UKDA-SN-1427-1
24. Preservation – benefits of XML
• Open standard
• Widely adopted as the basis for interchange of
documents and data over the Web
• Human readable
• Best for metadata; some challenges for preserving data
itself
25. How can researchers help?
• Produce and share high quality metadata and
documentation….and,
• Using XML is not that different than text processing and
spread sheets
Main points – not to teach xml. Researchers need to locate, explore and use data.xml behind the scenes, makes that possible. Even if you are not technical, useful to understand.
We have a lot to be proud of, but technologies are advancing, and we want to improve ways we provide access and disseminate data to our users.
WE try to listen and learn from researchers – and respond to their/your needs. And those expectations are rising quickly.
A bit like html, but for marking up structural features (paragraphs), not format (bold)Hard (for me) to get in the abstract, so going to turn now to examples, cases of uses of XML,For search, use, citation and sharing.
And here is a list of the data types we are going to talk about today, we’re not going to go into a huge amount of detail – but this talkwill give you a taste of the data we host
391 hits from search on health survey
Similar search – but limited to keyword = tropical and data only. We find a LSHTM data collection!
This is just a small part of the cat record for this collection. Note fields like title, and depositor. Pretty standard kinds of metadata (data about data) needed for any data collection. Note the upper right – get DDI XML record….
No need to be scared – this is exactly the same info….but you can see its XML structure – using tags.
The use of xml for metadata, and doing it in a standardised way, enables exponential increase in power of searching for data. DDI – standardised set of tags for handling social science data.
Now, a quick look at some other capabilities made possible by this xml structured mdata.
The ability to find and locate variables across surveys and decide if they are “close enough” for your purposes.
And the ability to search for health data across archives – for example, this portal for European data archives.
DF – intended to address a couple of those areas where UKDS wants to improve
Search results show all interviews with search term.With context – surrounding sentence.And with metadata about the interviewee – age, gender, region, SES
Here is an example of some of the paper being digitised – essays about future life written by 16yo school leavers, SheppeyFor this collection – there is value in retaining an image file – the handwriting itself is data.
WE are scanning, then ocr – to fully digitise text. This version shows original spelling. What to do – show correct or original?
XML allows us to keep both in the same document.
Typical transcript user could download. Good, but cannot be browsed online, and can’t modify display.
Available online. Can use formatting to display turn-taking and Can modify speaker tags with multiple versions of metadata
This is QuDex Schema (a bit like DDI) – Qualitative Data Exchange – family of tags created specifically for qualitative data.
Made possible by GUID – Globally Unique ID – for every utterance