SlideShare a Scribd company logo
How we built the largest
                          open database of
                        companies in the world




Thursday, 7 June 2012
A simple (huge) goal: an entry (and URI) for
       every corporate legal entity in the world
                                            URI is based on the company register
                                              ID, meaning it’s open and IP-free




        Also i
    trade       mpor
          marks       ting p
   officia         , gove ublic data
          l regis        rnme
                 ters &       nt spe –
                         gazet         nding
                               te not         ,
                                      ices..
                                             .




Thursday, 7 June 2012
All Op
                                 enly L
                        free re         icens
                                use, e        ed, al
                                       ven c         lowin
                                             omm           g
                                                   ercial
                                                          ly

Thursday, 7 June 2012
5 core uses




Thursday, 7 June 2012
1. An open identifying system
               URIs can be used as common identifiers among a
               variety of organisations
               Can be used without reference to OpenCorporates
               Because they map to the id issued by the company
               register the corresponding entry in the registry (and
               associated info) can be found, and vice versa
               Fits the new EU Business Vocabulary
               Can even by used for companies in jurisdiction we
               haven’t yet imported

Thursday, 7 June 2012
2. The simple search

               Not to be underestimated
               Massively reduces friction
               (how long will it take you
               to find and search
               multiple jurisdictions)
               Allows what if questions
               Potentially generates
               stories in its own right

Thursday, 7 June 2012
3. Source for additional info
               Addresses, filings,
               status, websites...
               Intl trademarks, UK
               govt spending, official
               notices, health & safety
               violations...
               Other IDs: SEC, CAGE,
               etc – allows reverse
               mapping queries, e.g.
               show me legal entitity
               mapped to a CIK code
Thursday, 7 June 2012
4. Reconciliation
         (matching names to legal entities)

         Clean up messy
         company names
         (& prev names)
         to legal entity,
         and from there
         to other data
         Google Refine
         reconciliation
         service (specific
         to jurisdiction)

Thursday, 7 June 2012
5. The platform

               API: allows all
               information to be
               retrieved as data,
               even searches
               Users can now
               add data too
               Coming soon: the
               option to match
               data to
               companies
Thursday, 7 June 2012
New feature: directors/officers

         We’ve just
         started
         importing &
         indexing
         company
         directors &
         officers,
         allowing search
         by name, &
                                other resources
         finding links
         between them
         and other         similarly named
         companies

Thursday, 7 June 2012
How have we done it?
         1. Started small,
         with just three
         countries and
         3 million
         companies
         2. Increasingly
         using official
         sources, where
         this is possible (i.e.
         the company
         registers are open
         and make data
         available)

Thursday, 7 June 2012
How have we done it?
          3. Leveraged the
          open data
          community and
          ScraperWiki to
          scrape company
          registers around
          the world
          4. Worked with
          governments to
          help understand
          the problems – EU,
          World Bank, G20
          Financial Stability
          Board, etc

Thursday, 7 June 2012
The technology
         Vanilla, commodity open-source software, hosted on our
         own UK-based servers

         Database                        MySQL
                              (but considering PostgreSQL)
         Search                           Solr
                             (but considering ElasticSearch)
         Code                              Ruby
                          (RubyOnRails main app, Sinatra API,
                         vanilla Ruby for various internal libraries)
         Webserver          Nginx (webserver) + Memcached
                         (caching) + Redis (queue + persistence)

Thursday, 7 June 2012
How do we pay for all this?


               Unlike many open data projects, we’re a for-profit
               company – the open data movement needs successful
               companies if it’s going to have a diverse ecosystem
               But we’re a company whose business model is
               dependent on making more data open, and an
               advisory board to make sure we do the right thing
               Not yet looking for customers, but...


Thursday, 7 June 2012
How do we pay for all this?
         Two projected sources of income

               Services model, especially around cleansing data/
               reconciliation. Of course, you can use our API,
               reconciliation service without asking us, but it may be
               cheaper to pay us to do it. Ditto custom extracts, and
               verticals
               Dual-licence model – contribute back to the community
               either with data, or financial support, e.g. if you have a
               proprietary database you may not want to be bound by
               the share-alike attribution restrictions
               And we already have some (small) customers

Thursday, 7 June 2012
The problems




         Getting the data Company registers have forgotten their
         main role is as public record, and actively work to prohibit
         free and open access to the data
Thursday, 7 June 2012
The problems




         Understanding the data Language, legal and cultural
         issues, not to mention the complexity of the subject
Thursday, 7 June 2012
The problems




         Normalising the data How do we abstract company
         types, status, industry codes, addresses, etc
Thursday, 7 June 2012
W3C Business Vocabulary

               What are
               we doing?
               Why are we
               doing it?
               What does
               it mean?
               Where is it
               going?

Thursday, 7 June 2012
The problems




         Handling the data Over 150 million rows in some tables
         (slow schema changes), heavy reading and writing,
         evolving understanding of the problems and solutions
Thursday, 7 June 2012
tions
                                                          isdic tes
                                                     0 jur
                                              nies in 5 23 US sta
                                        compa     clud ing
                               3million         In
     wo                 v er 4
  No




Thursday, 7 June 2012

More Related Content

Similar to EDF2012 Chris Taggart - How the biggest Open Database of Companies was built

Open Data 4 Startups
Open Data 4 StartupsOpen Data 4 Startups
Open Data 4 Startups
CSI Piemonte
 
Open data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival TorinoOpen data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival Torino
mzaglio
 
Lessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas TribuneLessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas Tribune
Elise Hu-Stiles
 
You rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LODYou rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LOD
Mateja Verlic
 
M12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part TwoM12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part Two
MER Conference
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
Varad Meru
 
Code sharing at MediaEval
Code sharing at MediaEvalCode sharing at MediaEval
Code sharing at MediaEval
Adam Rae
 
CII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingCII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingAnand Deshpande
 
Looking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SMELooking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SME
smespire
 
Learn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data CollectionLearn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data Collection
IQPC Exchange
 
Automated indexing - Hyland Onbase
Automated indexing - Hyland OnbaseAutomated indexing - Hyland Onbase
Automated indexing - Hyland Onbase
AMS Imaging
 
Productivity Future Vision
Productivity Future VisionProductivity Future Vision
Productivity Future Vision
Micro Focus SRL
 
Open Data for Transportation Agencies
Open Data for Transportation AgenciesOpen Data for Transportation Agencies
Open Data for Transportation Agencies
Novavia Solutions
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
Amanda Gray
 
Open data: what's in it for business?
Open data: what's in it for business?Open data: what's in it for business?
Open data: what's in it for business?
Chris Taggart
 
Website Usability | Class 1
Website Usability | Class 1Website Usability | Class 1
Website Usability | Class 1studiokandm
 
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Jari Koister
 
Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...
Peter Wells
 
ORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesEDINA, University of Edinburgh
 

Similar to EDF2012 Chris Taggart - How the biggest Open Database of Companies was built (20)

Open Data 4 Startups
Open Data 4 StartupsOpen Data 4 Startups
Open Data 4 Startups
 
Open data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival TorinoOpen data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival Torino
 
Lessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas TribuneLessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas Tribune
 
You rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LODYou rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LOD
 
Story spaces pitch
Story spaces pitchStory spaces pitch
Story spaces pitch
 
M12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part TwoM12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part Two
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
Code sharing at MediaEval
Code sharing at MediaEvalCode sharing at MediaEval
Code sharing at MediaEval
 
CII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingCII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud Computing
 
Looking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SMELooking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SME
 
Learn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data CollectionLearn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data Collection
 
Automated indexing - Hyland Onbase
Automated indexing - Hyland OnbaseAutomated indexing - Hyland Onbase
Automated indexing - Hyland Onbase
 
Productivity Future Vision
Productivity Future VisionProductivity Future Vision
Productivity Future Vision
 
Open Data for Transportation Agencies
Open Data for Transportation AgenciesOpen Data for Transportation Agencies
Open Data for Transportation Agencies
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
Open data: what's in it for business?
Open data: what's in it for business?Open data: what's in it for business?
Open data: what's in it for business?
 
Website Usability | Class 1
Website Usability | Class 1Website Usability | Class 1
Website Usability | Class 1
 
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
 
Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...
 
ORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple Repositories
 

More from European Data Forum

EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro PresentationEDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro Presentation
European Data Forum
 
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
European Data Forum
 
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
European Data Forum
 
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
European Data Forum
 
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
European Data Forum
 
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
European Data Forum
 
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
European Data Forum
 
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
European Data Forum
 
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
European Data Forum
 
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
European Data Forum
 
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
European Data Forum
 
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
European Data Forum
 
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
European Data Forum
 
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
European Data Forum
 
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
European Data Forum
 
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
European Data Forum
 

More from European Data Forum (20)

EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
 
Barbato leit ict 15-16-17
Barbato leit ict 15-16-17Barbato leit ict 15-16-17
Barbato leit ict 15-16-17
 
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
 
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
 
EDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro PresentationEDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro Presentation
 
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
 
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
 
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
 
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
 
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
 
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
 
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
 
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
 
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
 
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
 
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
 
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
 
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
 
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
 
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
 

Recently uploaded

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 

Recently uploaded (20)

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 

EDF2012 Chris Taggart - How the biggest Open Database of Companies was built

  • 1. How we built the largest open database of companies in the world Thursday, 7 June 2012
  • 2. A simple (huge) goal: an entry (and URI) for every corporate legal entity in the world URI is based on the company register ID, meaning it’s open and IP-free Also i trade mpor marks ting p officia , gove ublic data l regis rnme ters & nt spe – gazet nding te not , ices.. . Thursday, 7 June 2012
  • 3. All Op enly L free re icens use, e ed, al ven c lowin omm g ercial ly Thursday, 7 June 2012
  • 4. 5 core uses Thursday, 7 June 2012
  • 5. 1. An open identifying system URIs can be used as common identifiers among a variety of organisations Can be used without reference to OpenCorporates Because they map to the id issued by the company register the corresponding entry in the registry (and associated info) can be found, and vice versa Fits the new EU Business Vocabulary Can even by used for companies in jurisdiction we haven’t yet imported Thursday, 7 June 2012
  • 6. 2. The simple search Not to be underestimated Massively reduces friction (how long will it take you to find and search multiple jurisdictions) Allows what if questions Potentially generates stories in its own right Thursday, 7 June 2012
  • 7. 3. Source for additional info Addresses, filings, status, websites... Intl trademarks, UK govt spending, official notices, health & safety violations... Other IDs: SEC, CAGE, etc – allows reverse mapping queries, e.g. show me legal entitity mapped to a CIK code Thursday, 7 June 2012
  • 8. 4. Reconciliation (matching names to legal entities) Clean up messy company names (& prev names) to legal entity, and from there to other data Google Refine reconciliation service (specific to jurisdiction) Thursday, 7 June 2012
  • 9. 5. The platform API: allows all information to be retrieved as data, even searches Users can now add data too Coming soon: the option to match data to companies Thursday, 7 June 2012
  • 10. New feature: directors/officers We’ve just started importing & indexing company directors & officers, allowing search by name, & other resources finding links between them and other similarly named companies Thursday, 7 June 2012
  • 11. How have we done it? 1. Started small, with just three countries and 3 million companies 2. Increasingly using official sources, where this is possible (i.e. the company registers are open and make data available) Thursday, 7 June 2012
  • 12. How have we done it? 3. Leveraged the open data community and ScraperWiki to scrape company registers around the world 4. Worked with governments to help understand the problems – EU, World Bank, G20 Financial Stability Board, etc Thursday, 7 June 2012
  • 13. The technology Vanilla, commodity open-source software, hosted on our own UK-based servers Database MySQL (but considering PostgreSQL) Search Solr (but considering ElasticSearch) Code Ruby (RubyOnRails main app, Sinatra API, vanilla Ruby for various internal libraries) Webserver Nginx (webserver) + Memcached (caching) + Redis (queue + persistence) Thursday, 7 June 2012
  • 14. How do we pay for all this? Unlike many open data projects, we’re a for-profit company – the open data movement needs successful companies if it’s going to have a diverse ecosystem But we’re a company whose business model is dependent on making more data open, and an advisory board to make sure we do the right thing Not yet looking for customers, but... Thursday, 7 June 2012
  • 15. How do we pay for all this? Two projected sources of income Services model, especially around cleansing data/ reconciliation. Of course, you can use our API, reconciliation service without asking us, but it may be cheaper to pay us to do it. Ditto custom extracts, and verticals Dual-licence model – contribute back to the community either with data, or financial support, e.g. if you have a proprietary database you may not want to be bound by the share-alike attribution restrictions And we already have some (small) customers Thursday, 7 June 2012
  • 16. The problems Getting the data Company registers have forgotten their main role is as public record, and actively work to prohibit free and open access to the data Thursday, 7 June 2012
  • 17. The problems Understanding the data Language, legal and cultural issues, not to mention the complexity of the subject Thursday, 7 June 2012
  • 18. The problems Normalising the data How do we abstract company types, status, industry codes, addresses, etc Thursday, 7 June 2012
  • 19. W3C Business Vocabulary What are we doing? Why are we doing it? What does it mean? Where is it going? Thursday, 7 June 2012
  • 20. The problems Handling the data Over 150 million rows in some tables (slow schema changes), heavy reading and writing, evolving understanding of the problems and solutions Thursday, 7 June 2012
  • 21. tions isdic tes 0 jur nies in 5 23 US sta compa clud ing 3million In wo v er 4 No Thursday, 7 June 2012