SlideShare a Scribd company logo
1 of 24
Download to read offline
The
LinkedGov extension


        for
   Google Refine




                      @danpaulsmith
What is LinkedGov?
         A community project
               aiming to
      make public data more usable

              Cleaning
           Improving access
              Enriching
               Linking
                                     @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                             Question
                                               site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                     data
   XML)        (Data is stored as machine-   .linkedgov
                      readable data)             .org




                                               @danpaulsmith
What is Google Refine?

 “A power tool for working with messy data”
              “cleaning it up”,
             “ transforming it”,
               “extending it”,
              “and linking it”


                                       @danpaulsmith
@danpaulsmith
Spreadsheet software

Spreadsheet software           Google Refine

  Single-cell editing           Bulk-editing

 Create & input data    Use & transform existing data

  Document-based                Data-based

                          Allows extensions to be
                                 installed




                                                    @danpaulsmith
Transposition, multi-valued cells,
  clustering, faceting, filtering




                               @danpaulsmith
What does the LinkedGov extension
               do?




   Image curtosey of http://download.chip.eu
                                               @danpaulsmith
Typing wizards




Date & time   Measurements   Geolocations   Addresses




                                            @danpaulsmith
Other wizards




Columns to rows   Rows to columns   Blank values   Codes and symbols




                                                        @danpaulsmith
@danpaulsmith
@danpaulsmith
Cleaning




           @danpaulsmith
Enriching




            @danpaulsmith
What a machine understands
               before
                       (CSV, TSV, Excel)

      Column Column Column Column Column Column Column
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number




                                                          @danpaulsmith
What a machine understands
              after
                 (machine-readable format)

                                                     Water
          Temp    Name    Gas/hour Postcode Date             Height
                                                     /hour
Building Celsius string   kWh      Postcode   date   m3      metres
Building Celsius String   kWh      Postcode   date   m3      metres
Building Celsius string   kWh      Postcode   date   m3      metres
Building Celsius string   kWh      Postcode   date   m3      metres
Building Celsius string   kWh      Postcode   date   m3      metres




                                                              @danpaulsmith
The power of linking


 Latitude &
                   Postcodes       Dates      Measurements
 longitude




                   GP Surgery    NHS events   GP Surgery energy
NHS geo data      address data      data           use data



                                                    @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                              Question
                                                site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                      data
   XML)                                       .linkedgov
               (Data exists as linked data)
                                                  .org




                                                @danpaulsmith
Cleaning tasks




                 @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                              Question
                                                site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                      data
   XML)                                       .linkedgov
               (Data exists as linked data)
                                                  .org




                                                @danpaulsmith
Question
     site




   @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                              Question
                                                site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                      data
   XML)                                       .linkedgov
               (Data exists as linked data)
                                                  .org




                                                @danpaulsmith
data.linkedgov.org




                     @danpaulsmith
Feedback & questions



  http://linkedgov.org - Website

  http://wiki.linkedgov.org - Wiki

  @LinkedGov - Twitter

   #linkedgov – IRC (Freenode.net)




                                     @danpaulsmith

More Related Content

What's hot

Graph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph DatabasesGraph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph DatabasesNeo4j
 
Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assTobias Lindaaker
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jTobias Lindaaker
 
RDF Stream Processing Models (RSP2014)
RDF Stream Processing Models (RSP2014)RDF Stream Processing Models (RSP2014)
RDF Stream Processing Models (RSP2014)Daniele Dell'Aglio
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillCharles Givre
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGGRatko Mutavdzic
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at AirbnbNeo4j
 
Introduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill MeetupIntroduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill MeetupVince Gonzalez
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for RepositoriesMartin Klein
 

What's hot (12)

Real-World NoSQL Schema Design
Real-World NoSQL Schema DesignReal-World NoSQL Schema Design
Real-World NoSQL Schema Design
 
Graph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph DatabasesGraph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph Databases
 
Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks ass
 
A Spot of TEI
A Spot of TEIA Spot of TEI
A Spot of TEI
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4j
 
RDF Stream Processing Models (RSP2014)
RDF Stream Processing Models (RSP2014)RDF Stream Processing Models (RSP2014)
RDF Stream Processing Models (RSP2014)
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache Drill
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
 
Introduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill MeetupIntroduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill Meetup
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 

Similar to LinkedGov extension for Google Refine

Database Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory WebcastDatabase Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory WebcastEric Kavanagh
 
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...Big Data Week
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
DataUp Overview: AGU 2012
DataUp Overview: AGU 2012DataUp Overview: AGU 2012
DataUp Overview: AGU 2012Carly Strasser
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierDemai Ni
 
Operations-Driven Web Services at Rent the Runway
Operations-Driven Web Services at Rent the RunwayOperations-Driven Web Services at Rent the Runway
Operations-Driven Web Services at Rent the RunwayCamille Fournier
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Etu Solution
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseBig Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseDean Hallman
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSafe Software
 
Alitora Innovation Networks
Alitora Innovation NetworksAlitora Innovation Networks
Alitora Innovation Networksalitora
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit
 
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020Amazon Web Services Korea
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingAll Things Open
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run GraphVaticle
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)Amazon Web Services
 
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform", Artem NikulchenkoFwdays
 
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...Lace Lofranco
 

Similar to LinkedGov extension for Google Refine (20)

Database Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory WebcastDatabase Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory Webcast
 
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
DataUp Overview: AGU 2012
DataUp Overview: AGU 2012DataUp Overview: AGU 2012
DataUp Overview: AGU 2012
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
Operations-Driven Web Services at Rent the Runway
Operations-Driven Web Services at Rent the RunwayOperations-Driven Web Services at Rent the Runway
Operations-Driven Web Services at Rent the Runway
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseBig Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data Sharing
 
Alitora Innovation Networks
Alitora Innovation NetworksAlitora Innovation Networks
Alitora Innovation Networks
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data Warehousing
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run Graph
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
 
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
 

Recently uploaded

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 

Recently uploaded (20)

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 

LinkedGov extension for Google Refine

  • 1. The LinkedGov extension for Google Refine @danpaulsmith
  • 2. What is LinkedGov? A community project aiming to make public data more usable Cleaning Improving access Enriching Linking @danpaulsmith
  • 3. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) (Data is stored as machine- .linkedgov readable data) .org @danpaulsmith
  • 4. What is Google Refine? “A power tool for working with messy data” “cleaning it up”, “ transforming it”, “extending it”, “and linking it” @danpaulsmith
  • 6. Spreadsheet software Spreadsheet software Google Refine Single-cell editing Bulk-editing Create & input data Use & transform existing data Document-based Data-based Allows extensions to be installed @danpaulsmith
  • 7. Transposition, multi-valued cells, clustering, faceting, filtering @danpaulsmith
  • 8. What does the LinkedGov extension do? Image curtosey of http://download.chip.eu @danpaulsmith
  • 9. Typing wizards Date & time Measurements Geolocations Addresses @danpaulsmith
  • 10. Other wizards Columns to rows Rows to columns Blank values Codes and symbols @danpaulsmith
  • 13. Cleaning @danpaulsmith
  • 14. Enriching @danpaulsmith
  • 15. What a machine understands before (CSV, TSV, Excel) Column Column Column Column Column Column Column Row number word number word date number number Row number word number word date number number Row number word number word date number number Row number word number word date number number Row number word number word date number number @danpaulsmith
  • 16. What a machine understands after (machine-readable format) Water Temp Name Gas/hour Postcode Date Height /hour Building Celsius string kWh Postcode date m3 metres Building Celsius String kWh Postcode date m3 metres Building Celsius string kWh Postcode date m3 metres Building Celsius string kWh Postcode date m3 metres Building Celsius string kWh Postcode date m3 metres @danpaulsmith
  • 17. The power of linking Latitude & Postcodes Dates Measurements longitude GP Surgery NHS events GP Surgery energy NHS geo data address data data use data @danpaulsmith
  • 18. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) .linkedgov (Data exists as linked data) .org @danpaulsmith
  • 19. Cleaning tasks @danpaulsmith
  • 20. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) .linkedgov (Data exists as linked data) .org @danpaulsmith
  • 21. Question site @danpaulsmith
  • 22. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) .linkedgov (Data exists as linked data) .org @danpaulsmith
  • 23. data.linkedgov.org @danpaulsmith
  • 24. Feedback & questions http://linkedgov.org - Website http://wiki.linkedgov.org - Wiki @LinkedGov - Twitter #linkedgov – IRC (Freenode.net) @danpaulsmith

Editor's Notes

  1. Me. Recent graduate. Have been building interfaces and visualisations for last two years on government projects themed on transparency, big data, open data and linked machine-readable data.This is a presentation on an interface I’ve been building for LinkedGov recently.
  2. When you’re looking for public data – it can be quite hard to find(you need to create accounts, arrive at broken download links, searches fail due to a lack of metadata). Once you’ve found the data – it can be in the wrong format(so you then begin the time consuming process of converting that data into a format you can work with). Then once you’ve started working with the data – you can find it to be mysterious and lacking in explanation. So! LinkedGov makes life easier by:1. Cleaning data (spelling mistakes, formats…). 2. Improving access (format of choice, API’s, high quality metadata). 3. Enriches data – (labels and descriptions for the data at a fine-grained level, uses online vocabularies to describe what the data contains). 4. Links datasets to each other.
  3. The purple block here is Google Refine – with which data is imported. The importeddata is then cleaned and enriched by the LinkedGov extension. The final step of the import process is to store the data in LinkedGov’s database in a machine-understandable format. With the data stored, we can then do a few things: Create “cleaning tasks” for the community that help fix errors in the data. Power a “question site” that lets non-technical users form queries to query datasets. 3. And also power a technical search site aimed at developers that helps them find the data they want.
  4. Free. Open source. Runs in the web browser.
  5. This is what Refine looks like. A little bit like spreadsheet software – you have columns and rows. Though you don’t have any toolbars allowing you edit the style, insert charts, generate reports… That’s because…
  6. Refine has some key differences to spreadsheet software. Spreadsheet software focuses on single-cell editing and inputting of data, Refine focuses on editing hundreds of rows & columns at the same time. ------ Spreadsheet software is largely for creating and capturing data, Refine is for users to reshape and transform existing data. ------- Spreadsheet software is very document-based- allowing you to style the data, use multiple pages or insert media, Refine is data-based – only allowing you to alter the structure and values of the data. ------ Refine also allows people to build extensions for it!
  7. However. Cleaning and transforming data *is*complicated. A non-technical personwill get confused. Google Refine is designed for programmers / frequent data-wranglers…It would be useful if the people who create or own the data are able to clean the data themselves (they after all should know the most about it).
  8. Hides the technical stuff! Instead, asks the user questions about their data… Creates clean, formatted, machine-readable data.
  9. So what are we askingthe user? We ask them “can you spot any of these things in your data?”.Why do we ask these things? These four types of data are a good starting ground for linking datasets as they are common across most datasets. --------- If multiple datasets contain the same time span – you can try to compare them to see if there’s anything that connects. If multiple datasets contain the same measurements (i.e. kilowatts per hour) – it’s a good starting point to see if any of them relate. If multiple datasets contain latitude and longitude values – you can gather and compare data spatially and begin to plot things on maps which everybody seems to love. If multiple datasets contain postcodes – & if any of them match, you automatically have a number of different types of information for each postcode. --------- These questions come in the form of “wizards” – which basically leads the user through a small number tasks - asking them to select a column, specify how the data is currently formatted and then they press “Done”!
  10. Thereare also a few other wizards: The “colums to rows” & “rows to columns” wizards help the user reshape their data in a way that helps us store the data. These are currently the most problematic wizards in regards to the wording and conveying the benefit or reason behind asking the user to do this. The “blank” values wizard BLANKS out any values in the data that represent “NULL” values – each dataset is to it’s own, I’ve come across dashes, full stops and words like “missing” or “none”. The “codes and symbols” wizard asks the user to replace any codes or symbols with what they actually mean, so for example, in some NHS data, a column was filled with lots of A’s, C’s, D’s and P’s – after googling about, I found out that they actually meant Active, Closed, Dormant and Proposed. So having their actual meaning present in the data is obviously a lot more helpful to people trying to use the data.
  11. So, this is what Refine looks like before the extension has been installed… and after the LinkedGov extension is installed. The main addition to the interface being a new panel called the “Typing” panel – which houses the wizards. So, I’ll just walk you through a couple of wizards… Imagine I have some dates in my data and I click on the Date & Time wizard…
  12. The wizard appears and it asks me to select any columns that contain dates… So I select two columns “open date” and “close date” by clicking on their headers…
  13. We ask the user to specify each dart part for each column – as the values could be in any combination: year-month-day, year-month, day-month, month-day…. You can see the column contains a day, month and year – but in a mixture of formats. You have words, dashes and slashes as separators…which the user doesn’t have to worry about. They then press “Finish” and the magic happens. The values are all formatted properly to using the ISO standard, they are also linked to an online definition and breakdown of that specific date and finally stored as machine-readable linked data.
  14. This is the measurements wizard. Select “Avg. Temp” column. It then asks me to search for a measurement type by typing into a text box, which searches an online database of measurements. I click “Finish” after I’ve found the right measurement – “Celsius”, and then the measurements are stored using their online definition – which comes bundled with wikipedia-like information such as alternative names, a description or related measurements (i.e. centimeters, meters, kilometers). So not only is the measurement being stored as an actual measurement, but because we’re using an online database to define it, it comes bundled with a lot of other relevant and potentially useful information to the end user.
  15. Here’s an example of what a machine understands about the data before and after using our extension. After saving a file in spreadsheet software, a machine, at best, only understands that the data is a bunch of columns and rows, containing numbers, words and dates. The ability for machines to understand the data is the magic that powers the question site, the dataset directory and makes linking datasets together a breeze.
  16. After using the wizards, machines are able to understand a little bit more about the data. Now machines have a more in-depth understanding of what the data actually means, The guesswork and inaccuracy is removed when searching and querying the data.
  17. An example of how datasets can link… The red dataset contains latitude/longitudes. The blue dataset contains postcodes and latitude/longitudes. The green dataset contains postcods and dates. And the orange dataset contains dates and measurements… All four datasets can be linked together by those linkable values. When you’re able to start linking datasets together like this – NEW information is created from a NEWLY acquired sense of UNDERSTANDING of those datasets.
  18. So that’s what the LinkedGov extension is and does. I’ll briefly finish off with what happens to the machine-readable data. Cleaning tasks can now be created for the community – asking them to use their expertise and judgement to correct problematic data. For example, a column may contain cryptic codes that represent types of NHS walk-in-clinics. So a task may be to decode one of these values and replace it with what it actually means.
  19. Here’s a screenshot of an example task – It’s asking the user to try to fix a value that contains two dashes instead of a decimal point. The user has the options to say “Yes I can fix this”, ”Refer this to an expert”, “It’s actually fine” etc.
  20. The question site
  21. The question site is aimed at non-technical users. It allows them to form queries to retrieve data, without requiring any knowledge of query languages. They form the question in a human-readable way, using a mixture of selectable question fragments together with free text input. An example: Give me ALL … GP SURGERIES … in … LONDON…
  22. A finally, the data site.
  23. The data site is targeted at the developer community. and is powered by the enriching parts of the data such as: their metadata What types of data are actually in the datasets (postcodes, dates, measurements) What they could potentially link to…
  24. So that’s where we are so farFeedback & questions?