Reproducibility Analytics Lab

We have worked with partners to create a business intelligence
shared service for UK education
Runner up in the 2018 National
Technology Awards
http://nationaltechnologyawards.co.uk/

• Unique CPD opportunity
• Teams of analysts from across UK HE
• Expertise in policy, data, and visualisation
• One day a week for 13 weeks
• Access to a range of data sources including HESA
• Aim is to produce proof-of-concept dashboards
• Remote working using agile project management methods
• Secure data processing environment
• 289 participants from 109 UK universities (including 11 APs)
What is Analytics Labs?
www.jisc.ac.uk/analytics-labs

Analytics Labs –
Working in an
Agile way –
Activity, Team
Roles and
Approach

Makeup of an Analytics Labs team
Product
Owner
Brings an
understanding of the
policy context and
the needs of
customers
Data
Analyst
Expertise in data and
analysis, especially from an
HE perspective
Scrum Master
Data & Viz Support
Keeps the project on track and
removes impediments
Specialist knowledge in tools
such as Alteryx and Tableau
Meta
Product Owner
Provides
expertise and
guidance in the
specific theme

Lab – the
environment and
tools

Plus Prep, Power BI, PowerPoint, Word, Python, R, Pentaho, Firefox, Chrome, Sublime Text
Secure data processing environment with team and shared data prep areas
GDPR
User Stories, Sprint Goals, Data Sources
Backlog, In Progress, Blocked, Done
Lab environment and tools used

The Analytics Labs curriculum focuses on 2 of our 5 competencies:
• Participating in agile development
• Visualising data
• Transforming data
• Digital collaboration
• Understanding policy and the data landscape
Curriculum

Team:Conquest
ResearchAnalytics
Theme: Evolve
Downstream effects of research funding
Theme: Reproducibility
Bias in experimental research

Reproducibility and transparency
of published preclinical research
involving animal models of
human disease is alarmingly low

Aim
Understand where
improvements might be
made
Provide a tool to allow service
users to
JOURNALFUNDERINSTITUTION
Evaluate the current state of published
preclinical research and explore what
initiatives by the scientific community might
have an impact on this
1. Benchmark to other users
2. See where targets for
improvement might be set
3.Track this progress
Focused on from the
perspective of

How do we measure reproducibility?
Threats to reproducibility are thought to include
Blinded
assessment of
outcome in
these animals
Compliance
with animal
welfare
regulations
Performance
of a sample
size
calculation
Potential
conflicts of
interest
Random
allocation of
animals to
group
Lack of
scientific
rigour
Low
statistical
power
Questionable
research
integrity
Evaluated the reporting of 5 key quality measures

The world’s largest collection of open
access full texts, containing aggregated
content of all research disciplines
CONCORDAT ON
OPENNESS ON ANIMAL
RESEARCH INTHE UK
TOP
GUIDELINES
FOR
JOURNALS
proxy for relative importance
of journal within its field
support researchers and
organisations to further
good practice and
promote integrity and
high ethical standards in
research
Animal Model Studies
Text Mining
Using Machine Learning
Data Sources
promote Open
Research Culture, and
alignment of scientific
ideals with practices
provides the ‘full economic
cost’ of activities including
how much they spend on
research
intended to improve the reporting of
research using animals
encourages organisations
to be clear about their use
of animals in research and
enhance their
communications

More information on Machine learning carried out by the Edinburgh team:
This is an algorithm that was developed by James Thomas at UCL and works by starting with a dataset of
studies and classing a subset of these manually, then feeding it to the machine so it can use it as a training
set in order to “learn” by identifying patterns between the data and your manual decisions (i.e. whether a
study should be included because it reports on an animal model of human disease, or it should be excluded
from the dataset because it doesn’t report anything of relevance).
The more you class manually and feed into the algorithm the more the machine will be able to detect patterns
and its performance of being able to do what you are doing as a human, should go up.
Obviously this method is not 100% as there is a lot of noise in there, but it can be a very good tool especially
when you have thousands of papers to screen, which would otherwise take months and even years in some
cases to be performed by a user (made even worse by the fact that the gold standard is for two independent
people to screen and then a third to screen disagreements), so this method not only saves time but also is a
good method to use when resources are limited.
More technical information about the algorithm: The algorithm uses a bag-of-words model for feature
selection and support vector machine with stochastic gradient descent for text classification to filter out animal
publications. More on this: https://www.biorxiv.org/content/10.1101/280131v1
Performance of machine learning algorithm for selecting our animal studies:
Sensitivity 95.5%, Specificity 83.5% and Accuracy 84.7%

More information on Text Mining used by Edinburgh based team:
“Text mining is a method used to explore and analyse large amounts of unstructured text to identify
concepts/patterns/keywords/phrases in the data. The team used regular expressions, which are
essentially a string of rules that tells the computer what conditions of word combinations to use when
searching a piece of text.
It’s fairly simple in the sense I tell the computer to find me the expression “animals randomly allocated
to group” and if it does, class this as the publication having reported random allocation of animals to
group. It’s slightly more sophisticated in the sense that when this statement is preceded by “not” for
example, the computer should not class this as a match.
Unfortunately these are still a work in progress and like the machine learning are not 100%, but again
reading these publications manually and classing them like this is an incredibly time-consuming
process therefore automating this can be very useful and the fact that it’s not 100% doesn’t affect the
overall conclusions that much. In fact, we have found that in some cases the computer identifies
publications that should be classed as TRUE, but the human has falsely classified them as FALSE and
therefore there is error in both directions.”

Tableau
Visualisations and
potential dashboards

Note: findings for illustrative purposes only due to small and example prototype research areas explored

Journal
Anonymised
..
.
.
.
.
.
.
Note: findings for illustrative purposes only due to small and example prototype research areas explored

Reproducibility Analytics Lab

More Related Content

Similar to Reproducibility Analytics Lab

More from Verena139

Recently uploaded

Reproducibility Analytics Lab

Editor's Notes