Managing ‘Big Data’ in the Social Sciences:
The Contribution of an Analytico-Synthetic
Classification Scheme
Suzanne Barbalet
CESSDA Thesaurus Development and
Bibliographic Services Officer
UK Data Service, University of Essex
Nathan Cunningham
Functional Director for Big Data Support, UK
Data Service, University of Essex
Innovation and Discovery, CIG 2016,
University of Swansea
31st August - 2 September 2016
Overview
• What is different about data?
• What difference will New and Novel Forms of Data
(NNfD) make for data management?
• New Tricks? Can old tools solve new problems –
Universal Decimal Classification (UDC) and UKDS pilot
project
• Options for a Vocabulary Service to manage ‘big data’
Background
• UK Data Service holds the largest collection of digital
research data in the social sciences and humanities in
the UK
• long history of developing knowledge organisation tools
(KOTs)
• currently manages two related social science thesauri:
HASSET and ELSST
• working on the architecture for an open vocabulary
service
What Data Do We Curate?
UK Survey Data
Cross-National DataBusiness MicrodataCensus Data
International Macrodata Longitudinal Data
Qualitative Data
How and Why of Data Curation
Points of deposit, re-use and re-creation of data
• research Data is curated and preserved for
secondary analysis by academic
researchers and students, government
analysts, charities and foundations,
business consultants, independent
research centres and think tanks
• our users search for data and its
documentation as a package (studies)
and/or they search for variables across the
entire collection of studies (Variable and
Question Bank)
Using data
• via a single point of access data users can browse the
UK’s largest collection of social, economic and
population data (referred to as ‘studies’)
• data is available for download and analysis after
registration (unless open and no registration is required)
• the UK Data Service Secure
Lab provides Approved and
Accredited Researchers with
controlled access to sensitive
or confidential data
What is Different about Data?
An index describes the content of an information source
but when the information source :
• is a measurement not an idea i.e. in its raw state it has
numeric form
then the construction of an index is a complex task…
• it can be dynamic i.e. social
science research captures
attitudes and behaviour in a
time frame
Data is a Special Type of Digital Resource
• the content of ‘studies’ will not necessarily relate
semantically to the title of the study without knowledge of
the research indicators and research question – not easy
to index
• an indexer asks what is being measured + what might
be a useful measurement i.e. both variables and
research concepts are indexed
BUT
studies are not difficult to classify… clear subject
information is the general rule … for example…
An example:
Title: Understanding the Importance of Work Histories in Determining Poverty
in Old Age: Variables Derived from the English Longitudinal Study of Ageing,
2002-2007
Abstract: This study found that for the most part life-course events as
measured here are not strongly associated with the chances of being on a low
income in retirement.
Topics covered include length of time spent in paid work and in
marriage, the timing of retirement, the number of children and timing
of childbirth, and whether ill-health as an adult or as a child had been
experienced. The modelling looks at the influence of these factors
alongside a range of other characteristics such as social class and
educational attainment
Subject
Variable concepts are complex
Classification of Data?
The potential of, or need for, classification schemes to
manage subject access to data collections has not been
explored because:
• flat lists of subject categories are adequate for small
collections
• free text’ search by subject is peculiarly powerful tool for
social science data searches due to the fact that:
• Titles tend to be generic e.g. Family Resources Survey, Crime
Survey for England and Wales
• Abstracts clearly set out the findings and main topics
…but … need to refine access strategies
DATA CURATION
AT THE CROSSROADS
Big Data Administrative data
Secure data
Business Microdata
A working definition of Big Data
The value is in Smart Data
“In 2016, the world of big data will focus more on smart data,
regardless of size. Smart data are wide data (high variety), not
necessarily deep data (high volume).
Data are “smart” when they consist of feature-rich content and context
(time, location, associations, links, interdependencies, etc.) that enable
intelligent and even autonomous data-driven processes, discoveries,
decisions, and applications.” Kirk Borne, Principal Data Scientist at Booz Allen Hamilton
How We Curate Big Data
• Big Data Network Support team at the UK Data Service
is engaged in a major project to develop Data Service as
a Platform (DSaaP)
• DSaaP is a Hadoop-based data lake system
• the platform will store deposited data and NNfD drawn
from current and future UK Data Service holdings
• all data curation and access provision will be governed
by established Research Data Management principles
Planning access strategies
Scoping
Understand big
data landscape
Develop user types
Users
•Different users for
different aspects of
the Big Data
•Big Data user
services
Research
Value
•Develop big data
analytics for
research areas
UKDS users are
addressing
Accessing Big
Data
•Privacy by design
•End-to-end
capacity building
though exemplar
projects
Impact of ‘Big Data’ Curation?
Big Data Analytics will require us to play a more proactive
role.
Staff are acquiring additional skills to enable them to:
• identify and scale the data for access, linking and analysis
• ensure data conforms to Research Data Management
(RDM) principles which underpin all UK Data Service
archiving and curation work
BUT
• while data collection methodologies may witness a ‘sea
change’
• methodological principles of data analysis in the social
sciences remains unchanged
The end of theory!
• Chris Anderson, editor-in-chief of Wired Magazine, wrote
a provocative article entitled, “The End of Theory: The
Data Deluge Makes the Scientific Method Obsolete”
(2008).
• He argued that hypothesis testing is no longer necessary
with Google’s petabytes of data, which provides all of the
answers to how society works. Correlation now
“supercedes” causation.
Google Flu
• Google Flu Trends is no longer good at predicting flu,
scientists find
• Researchers warn of 'big data hubris' and the
importance of updating analytical models, claiming
Google has made inaccurate forecasts for 100 of 108
weeks.
Google's own autosuggest feature may have driven more
people to make flu-related searches - and misled its Flu Trends
forecasting system. Photograph: /Guardian
We are still doing science
Pigliucci (2009:534) in response to
Andersons Wired article:
“But, if we stop looking for models and
hypotheses, are we still really doing
science? Science, unlike advertising, is not
about finding patterns—although that is
certainly part of the process—it is about
finding explanations for those patterns.”
Implications for Data Management
• NNfD (or ‘big data’) may be drawn from sources that
maintain their own controlled vocabularies or subject
categories e.g. we have some large education datasets
and the Department for Education has an in-house
thesaurus
• NNfD (or ‘big data’) may come without documentation
• NNfD (or ‘big data’) open data that is scaled to size for
analysis on a PC will need to be linked to its source.
KOTs
– Looking to 2017 and Beyond
• we want to offer our users a
vocabulary service that will
provide access to open
resources with good
provenance
• ongoing importance of our
existing KOTs for indexing and
retrieval
• greater standardisation of
subject access will be required
‘Future- Proof’ Subject Control?
• In 2014-2015 a pilot study was undertaken to assess the
feasibility of classifying the whole data collection with the
aim to use a classification scheme to:
• ‘future proof’ subject searches in a fast growing collection
• mediate the size of the returns a subject search query produces
• introduce new topics without the legacy work that a flat subject
list requires
• UDC classification scheme was chosen
Pilot study to assess the feasibility of
classifying the data collection
http://seminar.udcc.org/2015/images/TOC_
UDCSeminar2015_Proceedings.pdf
Suzanne Barbalet “Enhancing subject
authority control at the UK Data Archive: a
pilot study using UDC”
New Tricks: Old Tools for New Problems?
• advantages of UDC for Subject Authority Control
• covers all subjects
• enables broadening and narrowing searches
• enables language independent coding, multilingual
access
• UDC can be adapted to all kinds of collections
• new subjects can be easily covered
• citation order of the elements in the synthesised
numbers can be changed to meet particular needs of
collections
M. Balikova The Role of UDC Classification in the Czech Subject Authority File. UDC
Consortium Classification at the Crossroads, The Hague, 29-30 October 2009
Why Analytico-Synthetic Classification?
The code
• reflects the hierarchical location within the general schema
• the synthesis creates a detailed numeric representation of the
subject with each of its parts retaining its identity i.e. codes can
be constructed and de-constructed (parsed) in the process of
indexing and information retrieval
• human-readable and machine-readable
• language independent and labels translated in to 56 languages
• universally recognisable
Use of Classification for the Organisation of
Open Data Repositories
SUBJECT GATEWAYS
RESEARCH REPOSITORIES
• Use E-print software (UK)
• Option to apply LC classification
An example:
Title: Understanding the Importance of Work Histories in Determining Poverty
in Old Age: Variables Derived from the English Longitudinal Study of Ageing,
2002-2007
Abstract: This study found that for the most part life-course events as
measured here are not strongly associated with the chances of being on a low
income in retirement.
Topics covered include length of time spent in paid work and in
marriage, the timing of retirement, the number of children and timing
of childbirth, and whether ill-health as an adult or as a child had been
experienced. The modelling looks at the influence of these factors
alongside a range of other characteristics such as social class and
educational attainment
Subject
Variable concepts are complex
The Code: an Example for Managing Simple
Subject Categories
Title: Understanding the Importance of Work Histories in Determining Poverty
in Old Age: Variables Derived from the English Longitudinal Study of Ageing,
2002-2007
Abstract: This study found that for the most part life-course events as
measured here are not strongly associated with the chances of being on a low
income in retirement.
We can see that “poverty in old age” is the key subject; retirement and
employment (career) are also important
Thus the number is constructed:
364.176-053.9:331.25:331.108.4
Poverty in old age
Retirement
Career
Common Auxiliary Numbers Useful for
Managing Vocabulary Service for NNfD
• e.g. Special Subject Thesauri [025.4.06] and Controlled Vocabularies [025.43]
Case studies [001.87] Photographic Images [084.12]
• All resources, whatever their form, can be connected by their subject content and,
whatever the data subject, content can be described by form
Use of UDC notation for Subject Authority
• option to create statistics of studies for new subject
category
• option to subdivide a category which has become too
large for browsing without additional legacy work
• could create temporary and specific categories for
outreach events
• could use the notation as an authority file to link
resources e.g. case studies, illustrations/photos
bibliographical references
Pilot Study
– Findings
• efficient (25 per hour without UDC online) and faster with
UDC online
• UDC Online can facilitate the creation of an authority file
• visualisation provided useful analysis of category content
according to UDC schema
• some indexing training required
• useful auxiliaries e.g. file type
Visualization
– Subject Categories
The visualisation
• Offers the possibility of
data exploration with
three different
approaches
▫ Specific Subject
Categories.
▫ UDC number builder
authority file.
▫ HASSET keywords.
Visualization
– Subject Categories
New Tricks: Can Old Tools Solve New
Problems?
Pre 1993
1993-2006
2006+
Ordering a collection
+ search & retrieval of
information
UI subject organisation
+ automatic
categorization of
resources
Reference: Slavic, A. (2006) UDC In Subject Gateways: Experiment or Opportunity? Knowledge
Organization, 33 (2) p. 67-85
supporting a semantic linking, control
and vocabulary mapping between
different indexing systems in subject
hubs and federated SGs
Conclusions
• data is relatively easy and thus inexpensive to classify
• data is a measurement, thus subject access for analysis
and re-use benefits from a careful choice of knowledge
organisation tools
• the auxiliary tables of analytico-
synthetic classification provide for
multidimensional subject links
…And the future
• provide a powerful discovery tool for
researchers, enabling them to analyse
complex data
• support a research community which
includes the international scientific
community, commercial users of big
data and citizens
• a trusted source of information on the
use of new and novel forms of data in
developing impactful research
Thank you!
Questions
Suzanne Barbalet
sbarba@essex.ac.uk
Thesaurus Team, UK Data Archive
Nathan Cunningham
njcunna@essex.ac.uk
http://ukdataservice.ac.uk/about-us/our-rd/big-data-
network-support

Managing 'Big Data' in the social sciences: the contribution of an analytico-synthetic classification scheme / Suzzane Barbalet

  • 1.
    Managing ‘Big Data’in the Social Sciences: The Contribution of an Analytico-Synthetic Classification Scheme Suzanne Barbalet CESSDA Thesaurus Development and Bibliographic Services Officer UK Data Service, University of Essex Nathan Cunningham Functional Director for Big Data Support, UK Data Service, University of Essex Innovation and Discovery, CIG 2016, University of Swansea 31st August - 2 September 2016
  • 2.
    Overview • What isdifferent about data? • What difference will New and Novel Forms of Data (NNfD) make for data management? • New Tricks? Can old tools solve new problems – Universal Decimal Classification (UDC) and UKDS pilot project • Options for a Vocabulary Service to manage ‘big data’
  • 3.
    Background • UK DataService holds the largest collection of digital research data in the social sciences and humanities in the UK • long history of developing knowledge organisation tools (KOTs) • currently manages two related social science thesauri: HASSET and ELSST • working on the architecture for an open vocabulary service
  • 4.
    What Data DoWe Curate? UK Survey Data Cross-National DataBusiness MicrodataCensus Data International Macrodata Longitudinal Data Qualitative Data
  • 5.
    How and Whyof Data Curation Points of deposit, re-use and re-creation of data • research Data is curated and preserved for secondary analysis by academic researchers and students, government analysts, charities and foundations, business consultants, independent research centres and think tanks • our users search for data and its documentation as a package (studies) and/or they search for variables across the entire collection of studies (Variable and Question Bank)
  • 6.
    Using data • viaa single point of access data users can browse the UK’s largest collection of social, economic and population data (referred to as ‘studies’) • data is available for download and analysis after registration (unless open and no registration is required) • the UK Data Service Secure Lab provides Approved and Accredited Researchers with controlled access to sensitive or confidential data
  • 7.
    What is Differentabout Data? An index describes the content of an information source but when the information source : • is a measurement not an idea i.e. in its raw state it has numeric form then the construction of an index is a complex task… • it can be dynamic i.e. social science research captures attitudes and behaviour in a time frame
  • 8.
    Data is aSpecial Type of Digital Resource • the content of ‘studies’ will not necessarily relate semantically to the title of the study without knowledge of the research indicators and research question – not easy to index • an indexer asks what is being measured + what might be a useful measurement i.e. both variables and research concepts are indexed BUT studies are not difficult to classify… clear subject information is the general rule … for example…
  • 9.
    An example: Title: Understandingthe Importance of Work Histories in Determining Poverty in Old Age: Variables Derived from the English Longitudinal Study of Ageing, 2002-2007 Abstract: This study found that for the most part life-course events as measured here are not strongly associated with the chances of being on a low income in retirement. Topics covered include length of time spent in paid work and in marriage, the timing of retirement, the number of children and timing of childbirth, and whether ill-health as an adult or as a child had been experienced. The modelling looks at the influence of these factors alongside a range of other characteristics such as social class and educational attainment Subject Variable concepts are complex
  • 10.
    Classification of Data? Thepotential of, or need for, classification schemes to manage subject access to data collections has not been explored because: • flat lists of subject categories are adequate for small collections • free text’ search by subject is peculiarly powerful tool for social science data searches due to the fact that: • Titles tend to be generic e.g. Family Resources Survey, Crime Survey for England and Wales • Abstracts clearly set out the findings and main topics
  • 11.
    …but … needto refine access strategies DATA CURATION AT THE CROSSROADS Big Data Administrative data Secure data Business Microdata
  • 12.
  • 13.
    The value isin Smart Data “In 2016, the world of big data will focus more on smart data, regardless of size. Smart data are wide data (high variety), not necessarily deep data (high volume). Data are “smart” when they consist of feature-rich content and context (time, location, associations, links, interdependencies, etc.) that enable intelligent and even autonomous data-driven processes, discoveries, decisions, and applications.” Kirk Borne, Principal Data Scientist at Booz Allen Hamilton
  • 14.
    How We CurateBig Data • Big Data Network Support team at the UK Data Service is engaged in a major project to develop Data Service as a Platform (DSaaP) • DSaaP is a Hadoop-based data lake system • the platform will store deposited data and NNfD drawn from current and future UK Data Service holdings • all data curation and access provision will be governed by established Research Data Management principles
  • 15.
    Planning access strategies Scoping Understandbig data landscape Develop user types Users •Different users for different aspects of the Big Data •Big Data user services Research Value •Develop big data analytics for research areas UKDS users are addressing Accessing Big Data •Privacy by design •End-to-end capacity building though exemplar projects
  • 16.
    Impact of ‘BigData’ Curation? Big Data Analytics will require us to play a more proactive role. Staff are acquiring additional skills to enable them to: • identify and scale the data for access, linking and analysis • ensure data conforms to Research Data Management (RDM) principles which underpin all UK Data Service archiving and curation work BUT • while data collection methodologies may witness a ‘sea change’ • methodological principles of data analysis in the social sciences remains unchanged
  • 17.
    The end oftheory! • Chris Anderson, editor-in-chief of Wired Magazine, wrote a provocative article entitled, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” (2008). • He argued that hypothesis testing is no longer necessary with Google’s petabytes of data, which provides all of the answers to how society works. Correlation now “supercedes” causation.
  • 18.
    Google Flu • GoogleFlu Trends is no longer good at predicting flu, scientists find • Researchers warn of 'big data hubris' and the importance of updating analytical models, claiming Google has made inaccurate forecasts for 100 of 108 weeks. Google's own autosuggest feature may have driven more people to make flu-related searches - and misled its Flu Trends forecasting system. Photograph: /Guardian
  • 19.
    We are stilldoing science Pigliucci (2009:534) in response to Andersons Wired article: “But, if we stop looking for models and hypotheses, are we still really doing science? Science, unlike advertising, is not about finding patterns—although that is certainly part of the process—it is about finding explanations for those patterns.”
  • 20.
    Implications for DataManagement • NNfD (or ‘big data’) may be drawn from sources that maintain their own controlled vocabularies or subject categories e.g. we have some large education datasets and the Department for Education has an in-house thesaurus • NNfD (or ‘big data’) may come without documentation • NNfD (or ‘big data’) open data that is scaled to size for analysis on a PC will need to be linked to its source.
  • 21.
    KOTs – Looking to2017 and Beyond • we want to offer our users a vocabulary service that will provide access to open resources with good provenance • ongoing importance of our existing KOTs for indexing and retrieval • greater standardisation of subject access will be required
  • 22.
    ‘Future- Proof’ SubjectControl? • In 2014-2015 a pilot study was undertaken to assess the feasibility of classifying the whole data collection with the aim to use a classification scheme to: • ‘future proof’ subject searches in a fast growing collection • mediate the size of the returns a subject search query produces • introduce new topics without the legacy work that a flat subject list requires • UDC classification scheme was chosen
  • 23.
    Pilot study toassess the feasibility of classifying the data collection http://seminar.udcc.org/2015/images/TOC_ UDCSeminar2015_Proceedings.pdf Suzanne Barbalet “Enhancing subject authority control at the UK Data Archive: a pilot study using UDC”
  • 24.
    New Tricks: OldTools for New Problems? • advantages of UDC for Subject Authority Control • covers all subjects • enables broadening and narrowing searches • enables language independent coding, multilingual access • UDC can be adapted to all kinds of collections • new subjects can be easily covered • citation order of the elements in the synthesised numbers can be changed to meet particular needs of collections M. Balikova The Role of UDC Classification in the Czech Subject Authority File. UDC Consortium Classification at the Crossroads, The Hague, 29-30 October 2009
  • 25.
    Why Analytico-Synthetic Classification? Thecode • reflects the hierarchical location within the general schema • the synthesis creates a detailed numeric representation of the subject with each of its parts retaining its identity i.e. codes can be constructed and de-constructed (parsed) in the process of indexing and information retrieval • human-readable and machine-readable • language independent and labels translated in to 56 languages • universally recognisable
  • 26.
    Use of Classificationfor the Organisation of Open Data Repositories SUBJECT GATEWAYS RESEARCH REPOSITORIES • Use E-print software (UK) • Option to apply LC classification
  • 27.
    An example: Title: Understandingthe Importance of Work Histories in Determining Poverty in Old Age: Variables Derived from the English Longitudinal Study of Ageing, 2002-2007 Abstract: This study found that for the most part life-course events as measured here are not strongly associated with the chances of being on a low income in retirement. Topics covered include length of time spent in paid work and in marriage, the timing of retirement, the number of children and timing of childbirth, and whether ill-health as an adult or as a child had been experienced. The modelling looks at the influence of these factors alongside a range of other characteristics such as social class and educational attainment Subject Variable concepts are complex
  • 28.
    The Code: anExample for Managing Simple Subject Categories Title: Understanding the Importance of Work Histories in Determining Poverty in Old Age: Variables Derived from the English Longitudinal Study of Ageing, 2002-2007 Abstract: This study found that for the most part life-course events as measured here are not strongly associated with the chances of being on a low income in retirement. We can see that “poverty in old age” is the key subject; retirement and employment (career) are also important Thus the number is constructed: 364.176-053.9:331.25:331.108.4 Poverty in old age Retirement Career
  • 29.
    Common Auxiliary NumbersUseful for Managing Vocabulary Service for NNfD • e.g. Special Subject Thesauri [025.4.06] and Controlled Vocabularies [025.43] Case studies [001.87] Photographic Images [084.12] • All resources, whatever their form, can be connected by their subject content and, whatever the data subject, content can be described by form
  • 30.
    Use of UDCnotation for Subject Authority • option to create statistics of studies for new subject category • option to subdivide a category which has become too large for browsing without additional legacy work • could create temporary and specific categories for outreach events • could use the notation as an authority file to link resources e.g. case studies, illustrations/photos bibliographical references
  • 31.
    Pilot Study – Findings •efficient (25 per hour without UDC online) and faster with UDC online • UDC Online can facilitate the creation of an authority file • visualisation provided useful analysis of category content according to UDC schema • some indexing training required • useful auxiliaries e.g. file type
  • 32.
    Visualization – Subject Categories Thevisualisation • Offers the possibility of data exploration with three different approaches ▫ Specific Subject Categories. ▫ UDC number builder authority file. ▫ HASSET keywords.
  • 33.
  • 34.
    New Tricks: CanOld Tools Solve New Problems? Pre 1993 1993-2006 2006+ Ordering a collection + search & retrieval of information UI subject organisation + automatic categorization of resources Reference: Slavic, A. (2006) UDC In Subject Gateways: Experiment or Opportunity? Knowledge Organization, 33 (2) p. 67-85 supporting a semantic linking, control and vocabulary mapping between different indexing systems in subject hubs and federated SGs
  • 35.
    Conclusions • data isrelatively easy and thus inexpensive to classify • data is a measurement, thus subject access for analysis and re-use benefits from a careful choice of knowledge organisation tools • the auxiliary tables of analytico- synthetic classification provide for multidimensional subject links
  • 36.
    …And the future •provide a powerful discovery tool for researchers, enabling them to analyse complex data • support a research community which includes the international scientific community, commercial users of big data and citizens • a trusted source of information on the use of new and novel forms of data in developing impactful research
  • 37.
  • 38.
    Questions Suzanne Barbalet sbarba@essex.ac.uk Thesaurus Team,UK Data Archive Nathan Cunningham njcunna@essex.ac.uk http://ukdataservice.ac.uk/about-us/our-rd/big-data- network-support