SlideShare a Scribd company logo
Ordering the chaos: 
creating websites using 
imperfect data 
Andrew Stretton 
Oxford University Web SIG November 2014
Who am I, what is ChemBio Hub? 
• Andrew Stretton – Data Architect and Developer 
github.com/strets123 
@strets123 
linkedin (google me) 
• Chembio Hub 
http://chembiohub.ox.ac.uk (feel free to link to us!) 
@oxchembiohub 
github.com/thesgc
Chembio Hub exists to 
support research at the 
interface of chemistry and 
biology 
by enabling sharing of reagents, expertise 
and data across 20+ departments
Who are we trying to connect and how? 
User 1: 
Scientist at Oxford 
User 2: 
Potential collaborator 
Could be in industry 
or anywhere in academia 
Stored and curated by ChemBio Hub 
Unpublished 
results 
Negative Data 
Methods 
Equipment 
Reagents 
? Not sure yet 
Areas of 
expertise 
Questions 
and answers 
Contacts 
Publications 
Held on other sites or social networks 
Organised/linked to by ChemBio Hub
All of these parts require tagging 
entities in text, how can we do it 
Who are we trying to connect and how? 
cheaply and sustainably? 
User 1: 
Scientist at Oxford 
User 2: 
Potential collaborator 
Could be in industry 
or anywhere in academia 
Stored and curated by ChemBio Hub 
Unpublished 
results 
Negative Data 
Methods 
Equipment 
Reagents 
? Not sure yet 
Areas of 
expertise 
Questions 
and answers 
Contacts 
Publications 
Held on other sites or social networks 
Organised/linked to by ChemBio Hub
What sorts of messy data are we working with? 
• Full text from procedures, biographies, web sites 
• Raw CSV/ Excel formats from multiple machines 
or departmental processes 
• “Standard” XML and JSON formats from various 
sources that do not map perfectly to our 
application 
• Multiple external databases to submit data to
How do most of our users like their web-based tools? 
Simple Search 
Flexible data 
management 
Comprehensive, 
overlapping tagging 
Clear progress, seamless experience
What do we sometimes give them? 
• Incomplete or many-to-one tagging 
• Hyperlinks instead of the right information 
from the other site 
• Dumb search 
• Inflexible schemas 
• Lack of linking between datasets
What strategies do we have to deal with messy data? 
Create more helpful data management apps 
Fill in gaps in tagging by using search engines 
Consider creating databases of flat files 
Create map reduce / 
Database views 
for schema 
Normalisation and 
data analysis 
Web crawling - not as 
hard or messy as it 
used to be
Let’s look at this one first, happy 
to discuss other areas later… 
What strategies do we have to deal with messy data? 
Create more helpful data management apps 
Fill in gaps in tagging by using search engines 
Consider creating databases of flat files 
Create map reduce / 
Database views 
for schema 
Normalisation and 
data analysis 
Web crawling - not as 
hard or messy as it 
used to be
How do we fill in gaps on un-tagged 
data? 
Let’s do an experiment… 
github.com/strets123/web-sig-2014/
Elasicsearch - information extraction on-the-fly 
• Take a dataset of 18801 companies 
~ 50% tagged 
> 80% have some 
text data 
0% 50% 100% 
Tags 
Description 
Overview 
Overview or 
description 
Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/
Use the “significant terms” feature… 
• What description/overview words most strongly 
linked to each tag? 
travel education music realestate 
Search 
engine 
optimization 
jobs onlinemarketing projectmanagement 
travel students music estate seo job marketing project 
travelers teachers artists real optimization jobs seo projects 
trip learning musicians agents engine employers agency task 
trips education songs property ppc career optimization collaboration 
hotels student labels listings marketing teams 
flights educational playlists search management 
traveler bands click 
travellers song pay 
airline artist 
hotel fans
Now let’s test these queries 
• Which companies have no tag but are most 
likely to need tagging with “music”… 
uPlaya 
Description uPlaya provides independent or unsigned musicians with immediate 
feedback on their music…. 
Category games_video 
Tags - 
Webceleb 
Description Webceleb is music marketplace and community where musicians 
and fans engage and profit from discovering, purchasing and 
downloading the latest independent music.…. 
Category games_video 
Tags -
But what if we have 
NO TAGS?
A process to extract tags from text… 
Index Data 
Assign resources (e.g. 
Amazon spot instance 
for large dataset) 
List word counts with 
the least frequent 
first 
Exclude lowest counts 
Aggregate the 
significant terms for 
each word 
Filter words that have 
a lot of high scoring 
significant terms
What does this give us? 
athletes: [athletes, coaches, athlete, coach, sports, fans] 
avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game] 
clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure] 
dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features] 
dial: [dial, calling, calls, voip, number, call, voice, phone] 
exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health] 
indie: [indie, labels, artists, music] 
logos: [logos, branding, flash, design] 
pci: [pci, dss, hipaa, compliance, sensitive, compliant] 
portland: [portland, oregon, inc, founded] 
ringtones: [ringtones, ringtone, personalization, games] 
traders: [traders, forex, trader, trading, quotes, stock, trade] 
yellow: [yellow, pages, directory, local] 
abc: [abc, cnn, nbc, television] 
argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin] 
aviation: [aviation, aircraft, aerospace, defense, transportation] 
airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]
What else can we do with this? 
Filter words that have 
a lot of high scoring 
significant terms 
De duplicate where 
large overlaps exist 
Assign levels of tags 
in order of frequency 
Use to categorise 
new data on the fly 
using percolate 
Curate manually 
Generate a sidebar 
menu 
github.com/strets123/web-sig-2014/ 
Use elasticsearch 
phrase suggester to 
create phrase tags
Advantages over direct curation / supervised learning: 
• Simplicity and pragmatism 
• Applicable to novel domains 
– e.g. Chemical Biology 
• Auto generated tags choose more appropriate 
word combinations than manual curators 
• No need for complex data formats like rdf 
• Data from many sources can be mixed 
– e.g. categories from other university’s sites…
Where might this technology lead? 
• How about a tag-based file system? 
• How about an implicit social network? 
• Elasticsearch is really easy to scale… 
• Which websites, filesystems and datasets do 
you need to categorise? 
– Do you really need RDF ontologies, curators etc. or 
can you just do something simple?
Summary 
• We now have many options to categorise and 
tidy up messy data 
• Managing variations on schemas takes a lot of 
resources – leave it to the data owners if you 
can! 
• When it comes to tagging… 
– Perfection is in the eye of the beholder 
– Sustainability is really important
Thanks 
• Thanks to the Research 
informatics team at the NDM 
Structural Genomics 
Consortium 
– Paul Barrett 
– Karen Porter 
– Michael O’Hagan 
– Brian Marsden 
– David Damerell 
– Sefa Garsot 
– Anthony Bradley 
• Thanks to the InfoDev team 
at IT services for answering 
my endless questions about 
webauth 
• Funders: 
– John Fell Fund 
– NDM Strategic 
– Welcome Trust 
– Higher Education Funding 
Council 
• To everyone here for listening
Any Questions? 
• Andrew Stretton 
github.com/strets123 
@strets123 
linkedin (google me) 
• Chembio Hub 
http://chembiohub.ox.ac.uk 
@oxchembiohub 
github.com/thesgc 
Simple example categorisation 
code available here in python 
github.com/strets123/web-sig-2014/
Appendix of other messy 
data techniques
How do we make it easy to 
add spreadsheet data to a 
system?
Working with flat files 
• Sometimes a flat file is the right schema for a 
dataset 
– User defined formats 
– Different types of research 
– Only some of the fields are relevant when 
comparing experiments 
– Data is not in memory unless needed 
• Pandas and HDF allows SQL-like queries on flat 
files
Helpful data management 
• Data Wrangler 
– https://player.vimeo.com/video/19185801 
• Raw 
– http://raw.densitydesign.org 
• Take these as inspiration for our tool for re-shaping 
biochemistry data
Simplifying web crawling 
• Modern web crawling patterns use class 
selectors instead of xPath 
– Less likelihood of change 
• Content can be crawled using a backend web 
browser 
– Dynamic javascript elements are included 
• Using a website’s data for classification is 
more acceptable than wholesale reproduction
Managing multiple JSON schemas with views 
PostgreSQL – also supported by Rails/Activerecord 
Couchbase
Why views over JSON can be useful 
• Expose only required fields from e.g. RDF 
• Input format may change but we don’t want 
crawler to break 
• Required fields may change 
• Versions are easy to support if format 
normalisation is in the database layer 
• Storage is cheap 
• View code is executed only once

More Related Content

What's hot

Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
Richard Wallis
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentation
urvics
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices
Richard Wallis
 
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
National Information Standards Organization (NISO)
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for Libraries
Richard Wallis
 
Schema.org - Extending Benefits
Schema.org - Extending BenefitsSchema.org - Extending Benefits
Schema.org - Extending Benefits
Richard Wallis
 
Extending Schema.org
Extending Schema.orgExtending Schema.org
Extending Schema.org
Richard Wallis
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
rvguha
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your Library
Richard Wallis
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
Stefan Dietze
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic  Web and Linked DataAn introduction to Semantic  Web and Linked Data
An introduction to Semantic Web and Linked Data
Gabriela Agustini
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
Peter Mika
 
Wimmics Overview 2021
Wimmics Overview 2021Wimmics Overview 2021
Wimmics Overview 2021
Fabien Gandon
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
Peter Mika
 
Danbri Drupalcon Export
Danbri Drupalcon ExportDanbri Drupalcon Export
Danbri Drupalcon Export
Drupalcon Paris
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
Peter Mika
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
Joshua Shinavier
 
Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!
Richard Wallis
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
Stanley Wang
 

What's hot (20)

Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentation
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices
 
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for Libraries
 
Schema.org - Extending Benefits
Schema.org - Extending BenefitsSchema.org - Extending Benefits
Schema.org - Extending Benefits
 
Extending Schema.org
Extending Schema.orgExtending Schema.org
Extending Schema.org
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your Library
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic  Web and Linked DataAn introduction to Semantic  Web and Linked Data
An introduction to Semantic Web and Linked Data
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Wimmics Overview 2021
Wimmics Overview 2021Wimmics Overview 2021
Wimmics Overview 2021
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
Danbri Drupalcon Export
Danbri Drupalcon ExportDanbri Drupalcon Export
Danbri Drupalcon Export
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
 
Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 

Viewers also liked

Seven Axiom
Seven AxiomSeven Axiom
Seven Axiom
mbprins
 
Get Your Ducks Nccet Webinar
Get Your Ducks   Nccet WebinarGet Your Ducks   Nccet Webinar
Get Your Ducks Nccet Webinar
California Corporate College
 
E11 Physics Evaluation Sheet
E11 Physics Evaluation SheetE11 Physics Evaluation Sheet
E11 Physics Evaluation Sheet
guest411ccf79
 
Chembio Crunch Intro
Chembio Crunch IntroChembio Crunch Intro
Chembio Crunch Intro
Andy Stretton
 
Moodle
MoodleMoodle
California Corporate College Presentation at NCCET 100910
California Corporate College Presentation at NCCET 100910California Corporate College Presentation at NCCET 100910
California Corporate College Presentation at NCCET 100910
California Corporate College
 
California Corporate College Cccaoe Fall 2009
California Corporate College Cccaoe Fall 2009California Corporate College Cccaoe Fall 2009
California Corporate College Cccaoe Fall 2009
California Corporate College
 

Viewers also liked (7)

Seven Axiom
Seven AxiomSeven Axiom
Seven Axiom
 
Get Your Ducks Nccet Webinar
Get Your Ducks   Nccet WebinarGet Your Ducks   Nccet Webinar
Get Your Ducks Nccet Webinar
 
E11 Physics Evaluation Sheet
E11 Physics Evaluation SheetE11 Physics Evaluation Sheet
E11 Physics Evaluation Sheet
 
Chembio Crunch Intro
Chembio Crunch IntroChembio Crunch Intro
Chembio Crunch Intro
 
Moodle
MoodleMoodle
Moodle
 
California Corporate College Presentation at NCCET 100910
California Corporate College Presentation at NCCET 100910California Corporate College Presentation at NCCET 100910
California Corporate College Presentation at NCCET 100910
 
California Corporate College Cccaoe Fall 2009
California Corporate College Cccaoe Fall 2009California Corporate College Cccaoe Fall 2009
California Corporate College Cccaoe Fall 2009
 

Similar to Ordering the chaos: Creating websites with imperfect data

Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
Niall Beard
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
Neo4j
 
Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13
DataDryad
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
Jan-Willem Bobbink - Freelance SEO Consultant
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
markgrover
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
Tao Feng
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
Philippe Mizrahi
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
Sylvain Zimmer
 
A Brief (and Practical) Introduction to Information Architecture
A Brief (and Practical) Introduction to Information ArchitectureA Brief (and Practical) Introduction to Information Architecture
A Brief (and Practical) Introduction to Information Architecture
Louis Rosenfeld
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
Kristi Holmes
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy Cabral
 
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and MediaGraphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Neo4j
 
Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...
Enterprise Ireland
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
Bernhard Haslhofer
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
Valeria de Paiva
 

Similar to Ordering the chaos: Creating websites with imperfect data (20)

Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
A Brief (and Practical) Introduction to Information Architecture
A Brief (and Practical) Introduction to Information ArchitectureA Brief (and Practical) Introduction to Information Architecture
A Brief (and Practical) Introduction to Information Architecture
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and MediaGraphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
 
Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
 

Recently uploaded

一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 

Recently uploaded (20)

一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 

Ordering the chaos: Creating websites with imperfect data

  • 1. Ordering the chaos: creating websites using imperfect data Andrew Stretton Oxford University Web SIG November 2014
  • 2. Who am I, what is ChemBio Hub? • Andrew Stretton – Data Architect and Developer github.com/strets123 @strets123 linkedin (google me) • Chembio Hub http://chembiohub.ox.ac.uk (feel free to link to us!) @oxchembiohub github.com/thesgc
  • 3. Chembio Hub exists to support research at the interface of chemistry and biology by enabling sharing of reagents, expertise and data across 20+ departments
  • 4. Who are we trying to connect and how? User 1: Scientist at Oxford User 2: Potential collaborator Could be in industry or anywhere in academia Stored and curated by ChemBio Hub Unpublished results Negative Data Methods Equipment Reagents ? Not sure yet Areas of expertise Questions and answers Contacts Publications Held on other sites or social networks Organised/linked to by ChemBio Hub
  • 5. All of these parts require tagging entities in text, how can we do it Who are we trying to connect and how? cheaply and sustainably? User 1: Scientist at Oxford User 2: Potential collaborator Could be in industry or anywhere in academia Stored and curated by ChemBio Hub Unpublished results Negative Data Methods Equipment Reagents ? Not sure yet Areas of expertise Questions and answers Contacts Publications Held on other sites or social networks Organised/linked to by ChemBio Hub
  • 6. What sorts of messy data are we working with? • Full text from procedures, biographies, web sites • Raw CSV/ Excel formats from multiple machines or departmental processes • “Standard” XML and JSON formats from various sources that do not map perfectly to our application • Multiple external databases to submit data to
  • 7. How do most of our users like their web-based tools? Simple Search Flexible data management Comprehensive, overlapping tagging Clear progress, seamless experience
  • 8. What do we sometimes give them? • Incomplete or many-to-one tagging • Hyperlinks instead of the right information from the other site • Dumb search • Inflexible schemas • Lack of linking between datasets
  • 9. What strategies do we have to deal with messy data? Create more helpful data management apps Fill in gaps in tagging by using search engines Consider creating databases of flat files Create map reduce / Database views for schema Normalisation and data analysis Web crawling - not as hard or messy as it used to be
  • 10. Let’s look at this one first, happy to discuss other areas later… What strategies do we have to deal with messy data? Create more helpful data management apps Fill in gaps in tagging by using search engines Consider creating databases of flat files Create map reduce / Database views for schema Normalisation and data analysis Web crawling - not as hard or messy as it used to be
  • 11. How do we fill in gaps on un-tagged data? Let’s do an experiment… github.com/strets123/web-sig-2014/
  • 12. Elasicsearch - information extraction on-the-fly • Take a dataset of 18801 companies ~ 50% tagged > 80% have some text data 0% 50% 100% Tags Description Overview Overview or description Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/
  • 13. Use the “significant terms” feature… • What description/overview words most strongly linked to each tag? travel education music realestate Search engine optimization jobs onlinemarketing projectmanagement travel students music estate seo job marketing project travelers teachers artists real optimization jobs seo projects trip learning musicians agents engine employers agency task trips education songs property ppc career optimization collaboration hotels student labels listings marketing teams flights educational playlists search management traveler bands click travellers song pay airline artist hotel fans
  • 14. Now let’s test these queries • Which companies have no tag but are most likely to need tagging with “music”… uPlaya Description uPlaya provides independent or unsigned musicians with immediate feedback on their music…. Category games_video Tags - Webceleb Description Webceleb is music marketplace and community where musicians and fans engage and profit from discovering, purchasing and downloading the latest independent music.…. Category games_video Tags -
  • 15. But what if we have NO TAGS?
  • 16. A process to extract tags from text… Index Data Assign resources (e.g. Amazon spot instance for large dataset) List word counts with the least frequent first Exclude lowest counts Aggregate the significant terms for each word Filter words that have a lot of high scoring significant terms
  • 17. What does this give us? athletes: [athletes, coaches, athlete, coach, sports, fans] avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game] clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure] dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features] dial: [dial, calling, calls, voip, number, call, voice, phone] exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health] indie: [indie, labels, artists, music] logos: [logos, branding, flash, design] pci: [pci, dss, hipaa, compliance, sensitive, compliant] portland: [portland, oregon, inc, founded] ringtones: [ringtones, ringtone, personalization, games] traders: [traders, forex, trader, trading, quotes, stock, trade] yellow: [yellow, pages, directory, local] abc: [abc, cnn, nbc, television] argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin] aviation: [aviation, aircraft, aerospace, defense, transportation] airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]
  • 18. What else can we do with this? Filter words that have a lot of high scoring significant terms De duplicate where large overlaps exist Assign levels of tags in order of frequency Use to categorise new data on the fly using percolate Curate manually Generate a sidebar menu github.com/strets123/web-sig-2014/ Use elasticsearch phrase suggester to create phrase tags
  • 19. Advantages over direct curation / supervised learning: • Simplicity and pragmatism • Applicable to novel domains – e.g. Chemical Biology • Auto generated tags choose more appropriate word combinations than manual curators • No need for complex data formats like rdf • Data from many sources can be mixed – e.g. categories from other university’s sites…
  • 20. Where might this technology lead? • How about a tag-based file system? • How about an implicit social network? • Elasticsearch is really easy to scale… • Which websites, filesystems and datasets do you need to categorise? – Do you really need RDF ontologies, curators etc. or can you just do something simple?
  • 21. Summary • We now have many options to categorise and tidy up messy data • Managing variations on schemas takes a lot of resources – leave it to the data owners if you can! • When it comes to tagging… – Perfection is in the eye of the beholder – Sustainability is really important
  • 22. Thanks • Thanks to the Research informatics team at the NDM Structural Genomics Consortium – Paul Barrett – Karen Porter – Michael O’Hagan – Brian Marsden – David Damerell – Sefa Garsot – Anthony Bradley • Thanks to the InfoDev team at IT services for answering my endless questions about webauth • Funders: – John Fell Fund – NDM Strategic – Welcome Trust – Higher Education Funding Council • To everyone here for listening
  • 23. Any Questions? • Andrew Stretton github.com/strets123 @strets123 linkedin (google me) • Chembio Hub http://chembiohub.ox.ac.uk @oxchembiohub github.com/thesgc Simple example categorisation code available here in python github.com/strets123/web-sig-2014/
  • 24. Appendix of other messy data techniques
  • 25. How do we make it easy to add spreadsheet data to a system?
  • 26. Working with flat files • Sometimes a flat file is the right schema for a dataset – User defined formats – Different types of research – Only some of the fields are relevant when comparing experiments – Data is not in memory unless needed • Pandas and HDF allows SQL-like queries on flat files
  • 27. Helpful data management • Data Wrangler – https://player.vimeo.com/video/19185801 • Raw – http://raw.densitydesign.org • Take these as inspiration for our tool for re-shaping biochemistry data
  • 28. Simplifying web crawling • Modern web crawling patterns use class selectors instead of xPath – Less likelihood of change • Content can be crawled using a backend web browser – Dynamic javascript elements are included • Using a website’s data for classification is more acceptable than wholesale reproduction
  • 29. Managing multiple JSON schemas with views PostgreSQL – also supported by Rails/Activerecord Couchbase
  • 30. Why views over JSON can be useful • Expose only required fields from e.g. RDF • Input format may change but we don’t want crawler to break • Required fields may change • Versions are easy to support if format normalisation is in the database layer • Storage is cheap • View code is executed only once

Editor's Notes

  1. Real word data is not: Perfectly tagged In one place In one format In one technology stack Spreadsheet processes don’t just disappear when you build a tool