SlideShare a Scribd company logo
1 of 45
Adding Open Data Value
to 'Closed Data' Problems
Dr Simon Price
Research Fellow, University of Bristol
Data Scientist, Capgemini Insights & Data
Who am I?
• 30 years software development and leadership roles
• Moved into Data Science via PhD in Machine Learning (2014)
• Research Fellow in Machine Learning group
 ~20 Machine Learning researchers
• Led project to establish Bristol’s open research data repository
• One of the organisers of Open Data Institute (ODI) Bristol
• Data Scientist in Big Data Analytics team
 ~100 Data Scientists, Big Data Engineers and Data Analysts
• Focus on Open Source and Big Data technologies to solve client problems
Outline
1. Case study: open data + ‘closed data’
2. Deriving value from open data
3. Data Science with ‘closed data’
Case study: SubSift
Conferences using SubSift
• ECML-PKDD: European Conference on
Machine Learning and Principles and
Practice of Knowledge Discovery in
Databases
• KDD: ACM SIGKDD International
Conference on Knowledge Discovery and
Data Mining
• PAKDD: Pacific-Asia Conference on
Knowledge Discovery and Data Mining
• SDM: SIAM International Conference on
Data Mining
Journals using SubSift
• Machine Learning
• Data Mining and Knowledge Discovery
https://doi.org/10.1145/2979672
Initial problem addressed by SubSift
Matching submitted conference papers to possible reviewers in Programme Committee
confidential
‘closed data’
open data
Initial SubSift workflow
Generic SubSift workflow
Personalised session recommendations
Expert finding
Why did SubSift recommend this person?
Profiling our organisation
Profiling staff at meetings
Open data opportunities?
Open research data
• data.bris.ac.uk
• Research data storage facility
• Each researcher gets 10TB "forever"
 140+ datasets live on opendata.bristol.gov.uk
 Mostly static but some real-time data
 Examples
• Government: Elections since 2007
• Community: Quality of Life survey
• Education: School Results
• Energy: Installed PV, Energy Use in Council Buildings
• Environment: Real time & Historic Air Quality, Flood Alerts (EA)
• Land use: 2013 Planning applications
• Health: Life expectancy/ Mortality, Obesity, NHS Spend
Open government data
Deriving value from open data
1. Data Science
2. Using open data to enrich and connect ’closed data’
statistics software
engineering
machine
learning
data
science
statistics software
engineering
machine
learning
data
science
application
domains
research
domains
Big Data Analytics
Insights & Data
www.capgemini.com/insights-data
25Copyright © Capgemini 2017. All Rights Reserved
June 2017
Example Data Science application
Assurance Scoring
http://ow.ly/4nbEUI
Using existing enterprise data plus any
useful open data, detect potentially
fraudulent transactions
26Copyright © Capgemini 2017. All Rights Reserved
June 2017
Example Data Science application
Assurance Scoring
http://ow.ly/4nbEUI
27Copyright © Capgemini 2017. All Rights Reserved
June 2017
Machine Learning
Transform Selection Model
Training
Validation
Test
Feature Extraction and Selection Model Building
Variety of output files: logs, graphics, saved models, etc.
Testing: Unit tests, monitoring tests and integration tests
Vector Build
Input Data
Manipulate, Explore
Data
Machine Learning Framework (Python, Scala, Spark)
28Copyright © Capgemini 2017. All Rights Reserved
June 2017
Graph Links - Matching
Key part of assurance scoring – bringing data together from disparate
sources
Probability of Match: 80%
Attribute Data Source 1 Data Source 2
Name Richard Smith Rich Smith
Phone Number 07123 456 789 07123 456 798
Favourite Sport Football Cricket
29Copyright © Capgemini 2017. All Rights Reserved
June 2017
Related to:
- record linkage
- duplicate detection
- reference resolution
- object identity
- entity matching
Connect graph
descriptions using
background knowledge
from open data sources.
e.g. Linked Open Data
Advanced matching
30Copyright © Capgemini 2017. All Rights Reserved
June 2017
Linked Open Data
Data Science with ‘closed data’

The information contained in this presentation is proprietary.
© 2012 Capgemini. All rights reserved.
www.capgemini.com
About Capgemini
With more than 120,000 people in 40 countries, Capgemini is one
of the world's foremost providers of consulting, technology and
outsourcing services. The Group reported 2011 global revenues
of EUR 9.7 billion.
Together with its clients, Capgemini creates and delivers
business and technology solutions that fit their needs and drive
the results they want. A deeply multicultural organization,
Capgemini has developed its own way of working, the
Collaborative Business ExperienceTM, and draws on Rightshore ®,
its worldwide delivery model.
Rightshore® is a trademark belonging to Capgemini
Problems of opening up ‘closed data’
Research data now open by default - including sensitive data
Funders
Journals
data.bris has 3 levels of access:
Data Science with ‘closed data’
Data science with ‘closed data’
• Custom R server running
inside secure data
repository / warehouse
• Enables non-disclosive,
remote analysis of
sensitive research data.
Number of Letters
NumberofWords
Non-disclosive Disclosive
Non-disclosive visualisation
Single-partition DataSHIELD
Multiple-partition DataSHIELD
DataSHIELD partition models
horizontal verticalideal
http://www.simonprice.info
simon.price@capgemini.com
@simonprice_info

More Related Content

What's hot

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
Data Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralData Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralFrank Kienle
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at PipedriveAndré Karpištšenko
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Frank Kienle
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?Seval Çapraz
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)heba_ahmad
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Edureka!
 

What's hot (20)

Data science
Data scienceData science
Data science
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
2005)
2005)2005)
2005)
 
Data science
Data scienceData science
Data science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Data Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralData Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information Collateral
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Data science
Data science Data science
Data science
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
 
data science
data sciencedata science
data science
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
 

Similar to Adding Open Data Value to 'Closed Data' Problems

ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataMartin Kaltenböck
 
RD shared services and research data spring
RD shared services and research data springRD shared services and research data spring
RD shared services and research data springJisc RDM
 
Aligning stakeholders' perspectives in Open Government Data Community
Aligning stakeholders' perspectives in Open Government Data CommunityAligning stakeholders' perspectives in Open Government Data Community
Aligning stakeholders' perspectives in Open Government Data CommunityAdegboyega Ojo
 
Towards a BIG Data Public Private Partnership
Towards a BIG Data Public Private PartnershipTowards a BIG Data Public Private Partnership
Towards a BIG Data Public Private PartnershipEdward Curry
 
Carlo Colicchio: Big Data for business
Carlo Colicchio: Big Data for businessCarlo Colicchio: Big Data for business
Carlo Colicchio: Big Data for businessCarlo Vaccari
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science LandscapePhilip Bourne
 
Supporting Open Data Publishers
Supporting Open Data PublishersSupporting Open Data Publishers
Supporting Open Data PublishersDerilinx
 
Big Data & Analytics for E&P conference
Big Data & Analytics for E&P conferenceBig Data & Analytics for E&P conference
Big Data & Analytics for E&P conferenceDale Butler
 
UK data management environment and support
UK data management environment and supportUK data management environment and support
UK data management environment and supportJisc
 
Key Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeKey Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeEdward Curry
 
A coordinated framework for open data open science in Botswana/Simon Hodson
A coordinated framework for open data open science in Botswana/Simon HodsonA coordinated framework for open data open science in Botswana/Simon Hodson
A coordinated framework for open data open science in Botswana/Simon HodsonAfrican Open Science Platform
 
Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Capgemini
 
Easy SPARQLing for the Building Performance Professional
Easy SPARQLing for the Building Performance ProfessionalEasy SPARQLing for the Building Performance Professional
Easy SPARQLing for the Building Performance ProfessionalMartin Kaltenböck
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECAProject
 
Fueling the open data economy
Fueling the open data economy Fueling the open data economy
Fueling the open data economy Elena Simperl
 
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe inside-BigData.com
 

Similar to Adding Open Data Value to 'Closed Data' Problems (20)

ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open Data
 
RD shared services and research data spring
RD shared services and research data springRD shared services and research data spring
RD shared services and research data spring
 
Aligning stakeholders' perspectives in Open Government Data Community
Aligning stakeholders' perspectives in Open Government Data CommunityAligning stakeholders' perspectives in Open Government Data Community
Aligning stakeholders' perspectives in Open Government Data Community
 
Towards a BIG Data Public Private Partnership
Towards a BIG Data Public Private PartnershipTowards a BIG Data Public Private Partnership
Towards a BIG Data Public Private Partnership
 
Carlo Colicchio: Big Data for business
Carlo Colicchio: Big Data for businessCarlo Colicchio: Big Data for business
Carlo Colicchio: Big Data for business
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 
Big data
Big dataBig data
Big data
 
ppt1.pptx
ppt1.pptxppt1.pptx
ppt1.pptx
 
Supporting Open Data Publishers
Supporting Open Data PublishersSupporting Open Data Publishers
Supporting Open Data Publishers
 
Big Data & Analytics for E&P conference
Big Data & Analytics for E&P conferenceBig Data & Analytics for E&P conference
Big Data & Analytics for E&P conference
 
UK data management environment and support
UK data management environment and supportUK data management environment and support
UK data management environment and support
 
Key Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeKey Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in Europe
 
A coordinated framework for open data open science in Botswana/Simon Hodson
A coordinated framework for open data open science in Botswana/Simon HodsonA coordinated framework for open data open science in Botswana/Simon Hodson
A coordinated framework for open data open science in Botswana/Simon Hodson
 
ODINE - Open Data Incubator Europe
ODINE - Open Data Incubator EuropeODINE - Open Data Incubator Europe
ODINE - Open Data Incubator Europe
 
Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry
 
Easy SPARQLing for the Building Performance Professional
Easy SPARQLing for the Building Performance ProfessionalEasy SPARQLing for the Building Performance Professional
Easy SPARQLing for the Building Performance Professional
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
 
Fueling the open data economy
Fueling the open data economy Fueling the open data economy
Fueling the open data economy
 
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe
New Horizons for a Data-Driven Economy – A Roadmap for Big Data in Europe
 
Data Activities in Austria
Data Activities in AustriaData Activities in Austria
Data Activities in Austria
 

More from Simon Price

Citizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological SurveysCitizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological SurveysSimon Price
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeSimon Price
 
Managing Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development ProjectsManaging Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development ProjectsSimon Price
 
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...Simon Price
 
NewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed miningNewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed miningSimon Price
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebSimon Price
 
Webs of People, Webs of Data
Webs of People, Webs of DataWebs of People, Webs of Data
Webs of People, Webs of DataSimon Price
 
Visualising China - historical photos of China
Visualising China - historical photos of ChinaVisualising China - historical photos of China
Visualising China - historical photos of ChinaSimon Price
 
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising ChinaBest of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising ChinaSimon Price
 
Adapting CARDIO for BOS
Adapting CARDIO for BOSAdapting CARDIO for BOS
Adapting CARDIO for BOSSimon Price
 
data.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoptiondata.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoptionSimon Price
 
Co-designing Research IT and Research Data Services
Co-designing Research IT and Research Data ServicesCo-designing Research IT and Research Data Services
Co-designing Research IT and Research Data ServicesSimon Price
 
Managing research data at Bristol
Managing research data at BristolManaging research data at Bristol
Managing research data at BristolSimon Price
 
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...Simon Price
 
A Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big DataA Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big DataSimon Price
 
SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...Simon Price
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...Simon Price
 
Code Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and reviewCode Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and reviewSimon Price
 
Academic IT support for Data Science
Academic IT support for Data ScienceAcademic IT support for Data Science
Academic IT support for Data ScienceSimon Price
 

More from Simon Price (20)

Citizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological SurveysCitizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological Surveys
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research Landscape
 
Managing Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development ProjectsManaging Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development Projects
 
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
 
NewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed miningNewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed mining
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Webs of People, Webs of Data
Webs of People, Webs of DataWebs of People, Webs of Data
Webs of People, Webs of Data
 
Visualising China - historical photos of China
Visualising China - historical photos of ChinaVisualising China - historical photos of China
Visualising China - historical photos of China
 
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising ChinaBest of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
 
Adapting CARDIO for BOS
Adapting CARDIO for BOSAdapting CARDIO for BOS
Adapting CARDIO for BOS
 
data.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoptiondata.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoption
 
Nature Locator
Nature LocatorNature Locator
Nature Locator
 
Co-designing Research IT and Research Data Services
Co-designing Research IT and Research Data ServicesCo-designing Research IT and Research Data Services
Co-designing Research IT and Research Data Services
 
Managing research data at Bristol
Managing research data at BristolManaging research data at Bristol
Managing research data at Bristol
 
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...
 
A Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big DataA Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big Data
 
SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
 
Code Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and reviewCode Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and review
 
Academic IT support for Data Science
Academic IT support for Data ScienceAcademic IT support for Data Science
Academic IT support for Data Science
 

Recently uploaded

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Adding Open Data Value to 'Closed Data' Problems

  • 1. Adding Open Data Value to 'Closed Data' Problems Dr Simon Price Research Fellow, University of Bristol Data Scientist, Capgemini Insights & Data
  • 2. Who am I? • 30 years software development and leadership roles • Moved into Data Science via PhD in Machine Learning (2014) • Research Fellow in Machine Learning group  ~20 Machine Learning researchers • Led project to establish Bristol’s open research data repository • One of the organisers of Open Data Institute (ODI) Bristol • Data Scientist in Big Data Analytics team  ~100 Data Scientists, Big Data Engineers and Data Analysts • Focus on Open Source and Big Data technologies to solve client problems
  • 3. Outline 1. Case study: open data + ‘closed data’ 2. Deriving value from open data 3. Data Science with ‘closed data’
  • 4. Case study: SubSift Conferences using SubSift • ECML-PKDD: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases • KDD: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining • PAKDD: Pacific-Asia Conference on Knowledge Discovery and Data Mining • SDM: SIAM International Conference on Data Mining Journals using SubSift • Machine Learning • Data Mining and Knowledge Discovery https://doi.org/10.1145/2979672
  • 5. Initial problem addressed by SubSift Matching submitted conference papers to possible reviewers in Programme Committee
  • 11. Why did SubSift recommend this person?
  • 13. Profiling staff at meetings
  • 15. Open research data • data.bris.ac.uk • Research data storage facility • Each researcher gets 10TB "forever"
  • 16.
  • 17.  140+ datasets live on opendata.bristol.gov.uk  Mostly static but some real-time data  Examples • Government: Elections since 2007 • Community: Quality of Life survey • Education: School Results • Energy: Installed PV, Energy Use in Council Buildings • Environment: Real time & Historic Air Quality, Flood Alerts (EA) • Land use: 2013 Planning applications • Health: Life expectancy/ Mortality, Obesity, NHS Spend Open government data
  • 18.
  • 19. Deriving value from open data 1. Data Science 2. Using open data to enrich and connect ’closed data’
  • 20.
  • 23. Big Data Analytics Insights & Data www.capgemini.com/insights-data
  • 24. 25Copyright © Capgemini 2017. All Rights Reserved June 2017 Example Data Science application Assurance Scoring http://ow.ly/4nbEUI Using existing enterprise data plus any useful open data, detect potentially fraudulent transactions
  • 25. 26Copyright © Capgemini 2017. All Rights Reserved June 2017 Example Data Science application Assurance Scoring http://ow.ly/4nbEUI
  • 26. 27Copyright © Capgemini 2017. All Rights Reserved June 2017 Machine Learning Transform Selection Model Training Validation Test Feature Extraction and Selection Model Building Variety of output files: logs, graphics, saved models, etc. Testing: Unit tests, monitoring tests and integration tests Vector Build Input Data Manipulate, Explore Data Machine Learning Framework (Python, Scala, Spark)
  • 27. 28Copyright © Capgemini 2017. All Rights Reserved June 2017 Graph Links - Matching Key part of assurance scoring – bringing data together from disparate sources Probability of Match: 80% Attribute Data Source 1 Data Source 2 Name Richard Smith Rich Smith Phone Number 07123 456 789 07123 456 798 Favourite Sport Football Cricket
  • 28. 29Copyright © Capgemini 2017. All Rights Reserved June 2017 Related to: - record linkage - duplicate detection - reference resolution - object identity - entity matching Connect graph descriptions using background knowledge from open data sources. e.g. Linked Open Data Advanced matching
  • 29. 30Copyright © Capgemini 2017. All Rights Reserved June 2017 Linked Open Data
  • 30. Data Science with ‘closed data’  The information contained in this presentation is proprietary. © 2012 Capgemini. All rights reserved. www.capgemini.com About Capgemini With more than 120,000 people in 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2011 global revenues of EUR 9.7 billion. Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business ExperienceTM, and draws on Rightshore ®, its worldwide delivery model. Rightshore® is a trademark belonging to Capgemini
  • 31. Problems of opening up ‘closed data’
  • 32. Research data now open by default - including sensitive data Funders Journals data.bris has 3 levels of access:
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. Data Science with ‘closed data’
  • 38. Data science with ‘closed data’ • Custom R server running inside secure data repository / warehouse • Enables non-disclosive, remote analysis of sensitive research data.
  • 44.