Relationship Discovery
Across Public Data Webinar
Ontotext, 2016
Presentation Outline
• Use cases: Relation discovery and Media monitoring
• FactForge-News open-data playground
• Relationship Discovery Examples
• Media Monitoring Examples
• Panama Papers and Global Legal Entity Identifier as Open Data
• Mapping Datasets to DBPedia with the GraphDB Lucene Connector
• Tracing Panama Papers entities in the news
Relationship Discovery Webinar Aug 2016
Relation Discovery Case
Mar 2016Using FIBO and Open Data to Discover Relationships #3
• Find suspicious
relationships like:
− Company in USA controls
− Another company in USA
− Through a company in an
off-shore zone
• Show news
relevant to them
Linking News to Big Knowledge Graphs
Aug 2016Relationship Discovery Webinar
• The DSP platform
links text to
knowledge graphs
• One can navigate
from news to
concepts, entities
and topics, and from
there to other news
Try it at
http://now.ontotext.com
Semantic Media Monitoring
Aug 2016Relationship Discovery Webinar
For each entity:
• popularity trends
• Relevant news
• Related entities
• Knowledge graph
information
Try it at
http://now.ontotext.com
Presentation Outline
• Use cases: Relation discovery and Media monitoring
• FactForge-News open-data playground
• Relationship Discovery Examples
• Media Monitoring Examples
• Panama Papers and Global Legal Entity Identifier as Open Data
• Mapping Datasets to DBPedia with the GraphDB Lucene Connector
• Tracing Panama Papers entities in the news
Relationship Discovery Webinar Aug 2016
Our approach to Big Data
1. Integrate relevant data from many sources
− Build a Big Knowledge Graph from proprietary databases and
taxonomies integrated with millions of facts of Linked Data
2. Infer new facts and unveil relationships
− Performing reasoning across data from different sources
3. Interlink text and with big data
− Using text-mining to automatically discover references to
concepts and entities
4. Use NoSQL graph database for metadata
management, querying and search
Aug 2016Relationship Discovery Webinar
FF-NEWS: Data Integration and Loading
• DBpedia (the English version only) 496M statements
• Geonames (all geographic features on Earth) 150M statements
− owl:sameAs links between DBpedia and Geonames 471K statements
• Company registry data (GLEI) 3M statements
• Panama Papers DB (#LinkedLeaks) 20M statements
• News metadata (from NOW) 145M statements
• Total size: 1 026М statements
− Mapped to FIBO; 724M explicit statements + 302M inferred statements
Relationship Discovery Webinar Aug 2016
News Metadata
• Metadata from Ontotext’s Dynamic Semantic Publishing platform
− Automatically generated as part of the NOW.ontotext.com semantic news showcase
• News stream from Google since Feb 2015, about 10k news/month
− ~70 tags (annotations) per news article
• Tags link text mentions of concepts to the knowledge graph
− Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases
Aug 2016Relationship Discovery Webinar
News Metadata
Aug 2016Relationship Discovery Webinar
Category Count
International 52 074
Science and Technology 23 201
Sports 20 714
Business 15 155
Lifestyle 11 684
122 828
Mentions / entity type Count
Keyphrase 2 589 676
Organization 1 276 441
Location 1 260 972
Person 1 248 784
Work 309 093
Event 258 388
RelationPersonRole 236 638
Species 180 946
Class Hierarchy Map (by number of instances)
Aug 2016Relationship Discovery Webinar
Left: The big picture
Right: dbo:Agent class (2.7M organizations and persons)
Sample queries at http://ff-news.ontotext.com
F1: Big cities in Eastern Europe
F2: Airports near London
F3: People and organizations related to Google
F4: Top-level industries by number of companies
Available as Saved Queries at http://ff-news.ontotext.com/sparql
Note 1: Open Saved Queries with the folder icon in the upper-right corner
Note 2: FF-NEWS is still in Beta testing ! But available to play with
Relationship Discovery Webinar Aug 2016
Presentation Outline
• Use cases: Relation discovery and Media monitoring
• FactForge-News open-data playground
• Relationship Discovery Examples
• Media Monitoring Examples
• Panama Papers and Global Legal Entity Identifier as Open Data
• Mapping Datasets to DBPedia with the GraphDB Lucene Connector
• Tracing Panama Papers entities in the news
Relationship Discovery Webinar Aug 2016
Offshore control example
Query: Find companies, which control other companies in the same
country, through company in an off-shore zone
How it works:
1. Establish control-relationship
2. Establish a company-country mapping good for the purpose
3. Establish an “off-shore criteria”
4. SPARQL it
Relationship Discovery Webinar Aug 2016
Off-shore company control example
SELECT *
FROM onto:disable-sameAs
WHERE {
?c1 fibo-fnd-rel-rel:controls ?c2 .
?c2 fibo-fnd-rel-rel:controls ?c3 .
?c1 ff-map:orgCountry ?c1_country .
?c2 ff-map:orgCountry ?c2_country .
?c3 ff-map:orgCountry ?c1_country .
FILTER (?c1_country != ?c2_country)
?c2_country ff-map:hasOffshoreProvisions true .
}
Relationship Discovery Webinar Aug 2016
Presentation Outline
• Use cases: Relation discovery and Media monitoring
• FactForge-News open-data playground
• Relationship Discovery Examples
• Media Monitoring Examples
• Panama Papers and Global Legal Entity Identifier as Open Data
• Mapping Datasets to DBPedia with the GraphDB Lucene Connector
• Tracing Panama Papers entities in the news
Relationship Discovery Webinar Aug 2016
Semantic Media Monitoring/Press-Clipping
• We can trace references to a specific company in the news
− This is pretty much standard, however we can deal with syntactic variations in the names, because state
of the art Named Entity Recognition technology is used
− What’s more important, we distinguish correctly in which mention “Paris” refers to which of the
following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the Greek hero)
• We can trace and consolidate references to daughter companies
• We have comprehensive industry classification
− The one from DBPedia, but refined to accommodate identifier variations and specialization (e.g.
company classified as dbr:Bank will also be considered classified as dbr:FinancialServices)
Relationship Discovery Webinar Aug 2016
Media Monitoring Queries
F5: Mentions in the news of an organization and its related entities
F7: Most popular companies per industry, including children
F8: Regional exposition of company – normalized
Relationship Discovery Webinar Aug 2016
News Popularity Ranking: Automotive
Relationship Discovery Webinar
Rank Company News # Rank
Company incl. mentions of child
companies News #
1 General Motors 2722 1 General Motors 4620
2 Tesla Motors 2346 2 Volkswagen Group 3999
3 Volkswagen 2299 3 Fiat Chrysler Automobiles 2658
4 Ford Motor Company 1934 4 Tesla Motors 2370
5 Toyota 1325 5 Ford Motor Company 2125
6 Chevrolet 1264 6 Toyota 1656
7 Chrysler 1054 7 Renault-Nissan Alliance 1332
8 Fiat Chrysler Automobiles 1011 8 Honda 864
9 Audi AG 972 9 BMW 715
10 Honda 717 10 Takata Corporation 547
Aug 2016
News Popularity: Finance
Relationship Discovery Webinar
Rank Company News # Rank Company incl. mentions of controlled News #
1 Bloomberg L.P. 3203 1 Intra Bank 261667
2 Goldman Sachs 1992 2 Hinduja Bank (Switzerland) 49731
3 JP Morgan Chase 1712 3 China Merchants Bank 38288
4 Wells Fargo 1688 4 Alphabet Inc. 22601
5 Citigroup 1557 5 Capital Group Companies 4076
6 HSBC Holdings 1546 6 Bloomberg L.P. 3611
7 Deutsche Bank 1414 7 Exor 2704
8 Bank of America 1335 8 Nasdaq, Inc. 2082
9 Barclays 1260 9 JP Morgan Chase 1972
10 UBS 694 10 Sentinel Capital Partners 1053
Note: Including investment funds, stock exchanges, agencies, etc.
Aug 2016
News Popularity: Banking
Relationship Discovery Webinar
Rank Company News # Rank Company incl. mentions of controlled News #
1 Goldman Sachs 996 1 China Merchants Bank * 38288
2 JP Morgan Chase 856 2 JP Morgan Chase 1972
3 HSBC Holdings 773 3 Goldman Sachs 1030
4 Deutsche Bank 707 4 HSBC 966
5 Barclays 630 5 Bank of America 771
6 Citigroup 519 6 Deutsche Bank 742
7 Bank of America 445 7 Barclays 681
8 Wells Fargo 422 8 Citigroup 630
9 UBS 347 9 Wells Fargo 428
10 Chase 126 10 UBS 347
Aug 2016
Relations extracted from text
Apr 2016Using FIBO and Open Data to Discover Relationships 22
Subject Object Count
dbr:Chrysler dbr:Fiat_Chrysler_Automobiles 455
dbr:NASA dbr:Goddard_Space_Flight_Center 69
dbr:Time_Warner_Cable dbr:Comcast 44
dbr:National_Football_League dbr:New_England_Patriots 40
dbr:DirecTV dbr:AT&T 33
dbr:Alcatel-Lucent dbr:Nokia 31
dbr:AOL dbr:Verizon_Communications 30
dbr:University_of_Pennsylvania dbr:Perelman_School_of_Medicine_at_... UPEN 29
dbr:Time_Warner_Cable dbr:Charter_Communications 27
dbr:Continental_Airlines dbr:United_Airlines 26
Note: relation types "RelationOrganizationAffiliatedWithOrganization" "RelationAcquisition" "RelationMerger"
Presentation Outline
• Use cases: Relation discovery and Media monitoring
• FactForge-News open-data playground
• Relationship Discovery Examples
• Media Monitoring Examples
• Panama Papers and Global Legal Entity Identifier as Open Data
• Mapping Datasets to DBPedia with the GraphDB Lucene Connector
• Tracing Panama Papers entities in the news
Relationship Discovery Webinar Aug 2016
Global Legal Entity Identifier (GLEI) data
• Global Markets Entity Identifier (GMEI) Utility data
− The Global Markets Entity Identifier (GMEI) utility is DTCC's legal entity identifier solution offered in
collaboration with SWIFT
− We downloaded as XML data dump from https://www.gmeiutility.org/
• RDF-ized company records
− Fields: LEI#, legal name, ultimate parent, registered country
− 3M explicit statements for 211 thousand organizations
▪ For comparison, there are 490 000 organizations in DBPeda and D&B covers above 200 million
− 10,821 ultimate parent relationships and 1632 ultimate parents
• 2 800 organizations from the GLEI dump mapped to DBPedia
Relationship Discovery Webinar Aug 2016
GLEI Company Data Sample: ABN-AMRO
Aug 2016Relationship Discovery Webinar
lei:businessRegistry "Kamer van Koophandel"^^xsd:string
lei:businessRegistryNumber "34334259"^^xsd:string
lei:duplicateReference data:549300T5O0D0T4V2ZB28
lei:entityStatus "ACTIVE"^^xsd:string
lei:headquartersCity "Amsterdam"^^xsd:string
lei:headquartersState "Noord-Holland"^^xsd:string
lei:legalForm "NAAMLOZE VENNOOTSCHAP"^^xsd:string
lei:legalName "ABN AMRO Bank N.V."^^xsd:string
lei:lei "BFXS5XCH7N0Y05NIXW11"^^xsd:string
lei:registeredCity "Amsterdam"^^xsd:string
lei:registeredCountry "NL"^^xsd:string
lei:registeredPostCode "1082 PP"^^xsd:string
lei:registeredState "Noord-Holland"^^xsd:string
Global Legal Entity Identifier (GLEI) data
Aug 2016Relationship Discovery Webinar
Ultimate parent Children Country
1 The Goldman Sachs Group, Inc. 1 851 US
2 United Technologies Corporation 427 US
3 Honeywell International Inc. 341 US
4 Morgan Stanley 228 US
5 Cargill, Incorporated 217 US
6 1832 Asset Management L.P. 202 CA
7 Aegon N.V. 174 NL
8 Union Bancaire Privée, UBP SA 138 CH
9 Citigroup Inc. 135 US
10 State Street Corporation 128 US
Country Companies
1 dbr:United_States 103 548
2 dbr:Canada 17 425
3 dbr:Luxembourg 13 984
4 dbr:Sweden 7 934
5 dbr:United_Kingdom 7 421
6 dbr:Belgium 6 868
7 dbr:Ireland 4 762
8 dbr:Australia 4 385
9 dbr:Germany 3 039
10 dbr:Netherlands 2 561
Offshore Leaks Database from ICIJ
• Published by the International Consortium of Investigative
Journalists (ICIJ) on 9th of May
• A “searchable database” about 320 000 offshore companies
− 214 000 extracted from Panama Papers (valid until 2015)
− More than 100 000 from 2013 Offshore leaks investigation (valid until 2010)
• CSV extract from a graph database available for download
• https://offshoreleaks.icij.org/
Relationship Discovery Webinar Aug 2016
Offshore
Leaks
Database
Relationship Discovery Webinar Aug 2016
Offshore Leaks DB as Linked Open Data
• Ontotext published the Offshore Leaks DB as Linked Open Data
• Available for exploration, querying and download at
http://data.ontotext.com
• ONTOTEXT DISCLAIMERS
We use the data as is provided by ICIJ. We make no representations and warranties of any kind,
including warranties of title, accuracy, absence of errors or fitness for particular purpose. All
transformations, query results and derivative works are used only to showcase the service and
technological capabilities and not to serve as basis for any statements or conclusions.
Relationship Discovery Webinar Aug 2016
Enrichment and structuring of the data
• Relationship type hierarchy
− About 80 types of relationship types in the original dataset got organized in a property hierarchy
• Classification of officers into Person and Company
− In the original database there is no way to distinguish whether an officer is a physical person
• Mapping to DBPedia:
− 209 countries referred in Offshore Leaks DB are mapped to DBPedia
− About 3000 persons and 300 companies mapped to DBPedia
• Overall size of the repository: 22M statements (20M explicit)
Relationship Discovery Webinar Aug 2016
The RDF-ization Process
• Linked data variant produced without programming
− The raw CSV files are RDF-ized using TARQL, http://tarql.github.io/
− Data was further interlinked and enriched in GraphDB using SPARQL
• The process is documented in this README file
• All relevant artifacts are open-source, available at
https://github.com/Ontotext-AD/leaks/
• The entire publishing and mapping took about 15 person-days !!!
− Including data.ontotext.com portal setup, promotion, documentation, etc.
Relationship Discovery Webinar Aug 2016
Sample queries at http://data.ontotext.com
Q1: Countries by number of entities related to them
Q2: Country pairs by ownership statistics
Q3: Statistics by incorporation year
Q4: Officers and entities by number of capital relations
Q5: Countries in Eastern Europe by number of owners
Q6: Intermediaries in Asia by name
Q7: The best connected officers
Q8: Countries by number of Person and Company officers
Relationship Discovery Webinar Aug 2016
Presentation Outline
• Use cases: Relation discovery and Media monitoring
• FactForge-News open-data playground
• Relationship Discovery Examples
• Media Monitoring Examples
• Panama Papers and Global Legal Entity Identifier as Open Data
• Mapping Datasets to DBPedia with the GraphDB Lucene Connector
• Tracing Panama Papers entities in the news
Relationship Discovery Webinar Aug 2016
Mapping datasets to DBPedia
• The task: map people, organizations and locations to IDs in DBPedia
− So that we can analyze the original data with the help of the extra information available in DBPedia and
other datasets that are related to it, e.g. Geonames
− For instance, #LinkedLeaks doesn’t contain any extra information about the companies, e.g. industry
sector, controlling or controlled companies, etc.
• Specific conditions: we had to map by names
− Other than names, the information about the entities in the source datasets couldn’t help the mapping
▪ Address and country attributes are present, but those appeared to be marginally useful for mapping
− In both cases we mapped locations only in terms of countries and not finer grained locations
▪ For this purpose DBPedia geographic data is sufficient and it is also well mapped with GeoNames
Relationship Discovery Webinar Aug 2016
Mapping datasets to DBPedia (2)
• We used the GraphDB connector to Lucene for these mappings
− Using the GraphDB connector, Lucene index was created for Organizations and People from DBPedia,
indexing all sorts of names, descriptions and other textual information for each entity
− The mapping process consists mostly of using the name of the entity from the 3rd party dataset (in this
case Panama Papers or GLEI) as a FTS query, embedded in a SPARQL query
• What is that Lucence does better than SPARQL?
− When there is little information other than the name, we benefit from the free text indexing of Lucene,
because it deals well with minor syntactic variations and sorts the results by relevance
− When mappings 300 000 organizations against another 500 000 organizations, without a key, the
complexity of a SPARQL query is 300 000 x 500 000, which is slower that 300 000 Lucene queries
Relationship Discovery Webinar Aug 2016
#LinkedLeaks Mapping Queries
Companies mapped by industry
Companies mapped in the Finance sector
Politicians mapped
Available as Saved Queries at http://ff-news.ontotext.com/sparql
Note 1: Open Saved Queries with the folder icon in the upper-right corner
Note 2: FF-NEWS is still in Beta testing ! But available to play with
Relationship Discovery Webinar Aug 2016
Presentation Outline
• Use cases: Relation discovery and Media monitoring
• FactForge-News open-data playground
• Relationship Discovery Examples
• Media Monitoring Examples
• Panama Papers and Global Legal Entity Identifier as Open Data
• Mapping Datasets to DBPedia with the GraphDB Lucene Connector
• Tracing Panama Papers entities in the news
Relationship Discovery Webinar Aug 2016
Tracing Panama Papers entities in the news
• After mapping #LinkedLeaks entities to DBPedia identifiers, we can
load them, together with the mappings, in the FF-NEWS repository
• This way we have in a single repo, mapped to one another:
#LinkedLeaks data, DBPedia, News metadata
• We can make queries like: Give me news mentions of entities which
appear in the Panama Papers dataset
• This way the mapping enabled media monitoring at no extra cost
Relationship Discovery Webinar Aug 2016
Thank you!
Experience the technology with NOW: Semantic News Portal
http://now.ontotext.com
Play with open data at
http://data.ontotext.com and http://ff-news.ontotext.com
Relationship Discovery Webinar Aug 2016

Gain Super Powers in Data Science: Relationship Discovery Across Public Data

  • 1.
    Relationship Discovery Across PublicData Webinar Ontotext, 2016
  • 2.
    Presentation Outline • Usecases: Relation discovery and Media monitoring • FactForge-News open-data playground • Relationship Discovery Examples • Media Monitoring Examples • Panama Papers and Global Legal Entity Identifier as Open Data • Mapping Datasets to DBPedia with the GraphDB Lucene Connector • Tracing Panama Papers entities in the news Relationship Discovery Webinar Aug 2016
  • 3.
    Relation Discovery Case Mar2016Using FIBO and Open Data to Discover Relationships #3 • Find suspicious relationships like: − Company in USA controls − Another company in USA − Through a company in an off-shore zone • Show news relevant to them
  • 4.
    Linking News toBig Knowledge Graphs Aug 2016Relationship Discovery Webinar • The DSP platform links text to knowledge graphs • One can navigate from news to concepts, entities and topics, and from there to other news Try it at http://now.ontotext.com
  • 5.
    Semantic Media Monitoring Aug2016Relationship Discovery Webinar For each entity: • popularity trends • Relevant news • Related entities • Knowledge graph information Try it at http://now.ontotext.com
  • 6.
    Presentation Outline • Usecases: Relation discovery and Media monitoring • FactForge-News open-data playground • Relationship Discovery Examples • Media Monitoring Examples • Panama Papers and Global Legal Entity Identifier as Open Data • Mapping Datasets to DBPedia with the GraphDB Lucene Connector • Tracing Panama Papers entities in the news Relationship Discovery Webinar Aug 2016
  • 7.
    Our approach toBig Data 1. Integrate relevant data from many sources − Build a Big Knowledge Graph from proprietary databases and taxonomies integrated with millions of facts of Linked Data 2. Infer new facts and unveil relationships − Performing reasoning across data from different sources 3. Interlink text and with big data − Using text-mining to automatically discover references to concepts and entities 4. Use NoSQL graph database for metadata management, querying and search Aug 2016Relationship Discovery Webinar
  • 8.
    FF-NEWS: Data Integrationand Loading • DBpedia (the English version only) 496M statements • Geonames (all geographic features on Earth) 150M statements − owl:sameAs links between DBpedia and Geonames 471K statements • Company registry data (GLEI) 3M statements • Panama Papers DB (#LinkedLeaks) 20M statements • News metadata (from NOW) 145M statements • Total size: 1 026М statements − Mapped to FIBO; 724M explicit statements + 302M inferred statements Relationship Discovery Webinar Aug 2016
  • 9.
    News Metadata • Metadatafrom Ontotext’s Dynamic Semantic Publishing platform − Automatically generated as part of the NOW.ontotext.com semantic news showcase • News stream from Google since Feb 2015, about 10k news/month − ~70 tags (annotations) per news article • Tags link text mentions of concepts to the knowledge graph − Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases Aug 2016Relationship Discovery Webinar
  • 10.
    News Metadata Aug 2016RelationshipDiscovery Webinar Category Count International 52 074 Science and Technology 23 201 Sports 20 714 Business 15 155 Lifestyle 11 684 122 828 Mentions / entity type Count Keyphrase 2 589 676 Organization 1 276 441 Location 1 260 972 Person 1 248 784 Work 309 093 Event 258 388 RelationPersonRole 236 638 Species 180 946
  • 11.
    Class Hierarchy Map(by number of instances) Aug 2016Relationship Discovery Webinar Left: The big picture Right: dbo:Agent class (2.7M organizations and persons)
  • 12.
    Sample queries athttp://ff-news.ontotext.com F1: Big cities in Eastern Europe F2: Airports near London F3: People and organizations related to Google F4: Top-level industries by number of companies Available as Saved Queries at http://ff-news.ontotext.com/sparql Note 1: Open Saved Queries with the folder icon in the upper-right corner Note 2: FF-NEWS is still in Beta testing ! But available to play with Relationship Discovery Webinar Aug 2016
  • 13.
    Presentation Outline • Usecases: Relation discovery and Media monitoring • FactForge-News open-data playground • Relationship Discovery Examples • Media Monitoring Examples • Panama Papers and Global Legal Entity Identifier as Open Data • Mapping Datasets to DBPedia with the GraphDB Lucene Connector • Tracing Panama Papers entities in the news Relationship Discovery Webinar Aug 2016
  • 14.
    Offshore control example Query:Find companies, which control other companies in the same country, through company in an off-shore zone How it works: 1. Establish control-relationship 2. Establish a company-country mapping good for the purpose 3. Establish an “off-shore criteria” 4. SPARQL it Relationship Discovery Webinar Aug 2016
  • 15.
    Off-shore company controlexample SELECT * FROM onto:disable-sameAs WHERE { ?c1 fibo-fnd-rel-rel:controls ?c2 . ?c2 fibo-fnd-rel-rel:controls ?c3 . ?c1 ff-map:orgCountry ?c1_country . ?c2 ff-map:orgCountry ?c2_country . ?c3 ff-map:orgCountry ?c1_country . FILTER (?c1_country != ?c2_country) ?c2_country ff-map:hasOffshoreProvisions true . } Relationship Discovery Webinar Aug 2016
  • 16.
    Presentation Outline • Usecases: Relation discovery and Media monitoring • FactForge-News open-data playground • Relationship Discovery Examples • Media Monitoring Examples • Panama Papers and Global Legal Entity Identifier as Open Data • Mapping Datasets to DBPedia with the GraphDB Lucene Connector • Tracing Panama Papers entities in the news Relationship Discovery Webinar Aug 2016
  • 17.
    Semantic Media Monitoring/Press-Clipping •We can trace references to a specific company in the news − This is pretty much standard, however we can deal with syntactic variations in the names, because state of the art Named Entity Recognition technology is used − What’s more important, we distinguish correctly in which mention “Paris” refers to which of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the Greek hero) • We can trace and consolidate references to daughter companies • We have comprehensive industry classification − The one from DBPedia, but refined to accommodate identifier variations and specialization (e.g. company classified as dbr:Bank will also be considered classified as dbr:FinancialServices) Relationship Discovery Webinar Aug 2016
  • 18.
    Media Monitoring Queries F5:Mentions in the news of an organization and its related entities F7: Most popular companies per industry, including children F8: Regional exposition of company – normalized Relationship Discovery Webinar Aug 2016
  • 19.
    News Popularity Ranking:Automotive Relationship Discovery Webinar Rank Company News # Rank Company incl. mentions of child companies News # 1 General Motors 2722 1 General Motors 4620 2 Tesla Motors 2346 2 Volkswagen Group 3999 3 Volkswagen 2299 3 Fiat Chrysler Automobiles 2658 4 Ford Motor Company 1934 4 Tesla Motors 2370 5 Toyota 1325 5 Ford Motor Company 2125 6 Chevrolet 1264 6 Toyota 1656 7 Chrysler 1054 7 Renault-Nissan Alliance 1332 8 Fiat Chrysler Automobiles 1011 8 Honda 864 9 Audi AG 972 9 BMW 715 10 Honda 717 10 Takata Corporation 547 Aug 2016
  • 20.
    News Popularity: Finance RelationshipDiscovery Webinar Rank Company News # Rank Company incl. mentions of controlled News # 1 Bloomberg L.P. 3203 1 Intra Bank 261667 2 Goldman Sachs 1992 2 Hinduja Bank (Switzerland) 49731 3 JP Morgan Chase 1712 3 China Merchants Bank 38288 4 Wells Fargo 1688 4 Alphabet Inc. 22601 5 Citigroup 1557 5 Capital Group Companies 4076 6 HSBC Holdings 1546 6 Bloomberg L.P. 3611 7 Deutsche Bank 1414 7 Exor 2704 8 Bank of America 1335 8 Nasdaq, Inc. 2082 9 Barclays 1260 9 JP Morgan Chase 1972 10 UBS 694 10 Sentinel Capital Partners 1053 Note: Including investment funds, stock exchanges, agencies, etc. Aug 2016
  • 21.
    News Popularity: Banking RelationshipDiscovery Webinar Rank Company News # Rank Company incl. mentions of controlled News # 1 Goldman Sachs 996 1 China Merchants Bank * 38288 2 JP Morgan Chase 856 2 JP Morgan Chase 1972 3 HSBC Holdings 773 3 Goldman Sachs 1030 4 Deutsche Bank 707 4 HSBC 966 5 Barclays 630 5 Bank of America 771 6 Citigroup 519 6 Deutsche Bank 742 7 Bank of America 445 7 Barclays 681 8 Wells Fargo 422 8 Citigroup 630 9 UBS 347 9 Wells Fargo 428 10 Chase 126 10 UBS 347 Aug 2016
  • 22.
    Relations extracted fromtext Apr 2016Using FIBO and Open Data to Discover Relationships 22 Subject Object Count dbr:Chrysler dbr:Fiat_Chrysler_Automobiles 455 dbr:NASA dbr:Goddard_Space_Flight_Center 69 dbr:Time_Warner_Cable dbr:Comcast 44 dbr:National_Football_League dbr:New_England_Patriots 40 dbr:DirecTV dbr:AT&T 33 dbr:Alcatel-Lucent dbr:Nokia 31 dbr:AOL dbr:Verizon_Communications 30 dbr:University_of_Pennsylvania dbr:Perelman_School_of_Medicine_at_... UPEN 29 dbr:Time_Warner_Cable dbr:Charter_Communications 27 dbr:Continental_Airlines dbr:United_Airlines 26 Note: relation types "RelationOrganizationAffiliatedWithOrganization" "RelationAcquisition" "RelationMerger"
  • 23.
    Presentation Outline • Usecases: Relation discovery and Media monitoring • FactForge-News open-data playground • Relationship Discovery Examples • Media Monitoring Examples • Panama Papers and Global Legal Entity Identifier as Open Data • Mapping Datasets to DBPedia with the GraphDB Lucene Connector • Tracing Panama Papers entities in the news Relationship Discovery Webinar Aug 2016
  • 24.
    Global Legal EntityIdentifier (GLEI) data • Global Markets Entity Identifier (GMEI) Utility data − The Global Markets Entity Identifier (GMEI) utility is DTCC's legal entity identifier solution offered in collaboration with SWIFT − We downloaded as XML data dump from https://www.gmeiutility.org/ • RDF-ized company records − Fields: LEI#, legal name, ultimate parent, registered country − 3M explicit statements for 211 thousand organizations ▪ For comparison, there are 490 000 organizations in DBPeda and D&B covers above 200 million − 10,821 ultimate parent relationships and 1632 ultimate parents • 2 800 organizations from the GLEI dump mapped to DBPedia Relationship Discovery Webinar Aug 2016
  • 25.
    GLEI Company DataSample: ABN-AMRO Aug 2016Relationship Discovery Webinar lei:businessRegistry "Kamer van Koophandel"^^xsd:string lei:businessRegistryNumber "34334259"^^xsd:string lei:duplicateReference data:549300T5O0D0T4V2ZB28 lei:entityStatus "ACTIVE"^^xsd:string lei:headquartersCity "Amsterdam"^^xsd:string lei:headquartersState "Noord-Holland"^^xsd:string lei:legalForm "NAAMLOZE VENNOOTSCHAP"^^xsd:string lei:legalName "ABN AMRO Bank N.V."^^xsd:string lei:lei "BFXS5XCH7N0Y05NIXW11"^^xsd:string lei:registeredCity "Amsterdam"^^xsd:string lei:registeredCountry "NL"^^xsd:string lei:registeredPostCode "1082 PP"^^xsd:string lei:registeredState "Noord-Holland"^^xsd:string
  • 26.
    Global Legal EntityIdentifier (GLEI) data Aug 2016Relationship Discovery Webinar Ultimate parent Children Country 1 The Goldman Sachs Group, Inc. 1 851 US 2 United Technologies Corporation 427 US 3 Honeywell International Inc. 341 US 4 Morgan Stanley 228 US 5 Cargill, Incorporated 217 US 6 1832 Asset Management L.P. 202 CA 7 Aegon N.V. 174 NL 8 Union Bancaire Privée, UBP SA 138 CH 9 Citigroup Inc. 135 US 10 State Street Corporation 128 US Country Companies 1 dbr:United_States 103 548 2 dbr:Canada 17 425 3 dbr:Luxembourg 13 984 4 dbr:Sweden 7 934 5 dbr:United_Kingdom 7 421 6 dbr:Belgium 6 868 7 dbr:Ireland 4 762 8 dbr:Australia 4 385 9 dbr:Germany 3 039 10 dbr:Netherlands 2 561
  • 27.
    Offshore Leaks Databasefrom ICIJ • Published by the International Consortium of Investigative Journalists (ICIJ) on 9th of May • A “searchable database” about 320 000 offshore companies − 214 000 extracted from Panama Papers (valid until 2015) − More than 100 000 from 2013 Offshore leaks investigation (valid until 2010) • CSV extract from a graph database available for download • https://offshoreleaks.icij.org/ Relationship Discovery Webinar Aug 2016
  • 28.
  • 29.
    Offshore Leaks DBas Linked Open Data • Ontotext published the Offshore Leaks DB as Linked Open Data • Available for exploration, querying and download at http://data.ontotext.com • ONTOTEXT DISCLAIMERS We use the data as is provided by ICIJ. We make no representations and warranties of any kind, including warranties of title, accuracy, absence of errors or fitness for particular purpose. All transformations, query results and derivative works are used only to showcase the service and technological capabilities and not to serve as basis for any statements or conclusions. Relationship Discovery Webinar Aug 2016
  • 30.
    Enrichment and structuringof the data • Relationship type hierarchy − About 80 types of relationship types in the original dataset got organized in a property hierarchy • Classification of officers into Person and Company − In the original database there is no way to distinguish whether an officer is a physical person • Mapping to DBPedia: − 209 countries referred in Offshore Leaks DB are mapped to DBPedia − About 3000 persons and 300 companies mapped to DBPedia • Overall size of the repository: 22M statements (20M explicit) Relationship Discovery Webinar Aug 2016
  • 31.
    The RDF-ization Process •Linked data variant produced without programming − The raw CSV files are RDF-ized using TARQL, http://tarql.github.io/ − Data was further interlinked and enriched in GraphDB using SPARQL • The process is documented in this README file • All relevant artifacts are open-source, available at https://github.com/Ontotext-AD/leaks/ • The entire publishing and mapping took about 15 person-days !!! − Including data.ontotext.com portal setup, promotion, documentation, etc. Relationship Discovery Webinar Aug 2016
  • 32.
    Sample queries athttp://data.ontotext.com Q1: Countries by number of entities related to them Q2: Country pairs by ownership statistics Q3: Statistics by incorporation year Q4: Officers and entities by number of capital relations Q5: Countries in Eastern Europe by number of owners Q6: Intermediaries in Asia by name Q7: The best connected officers Q8: Countries by number of Person and Company officers Relationship Discovery Webinar Aug 2016
  • 33.
    Presentation Outline • Usecases: Relation discovery and Media monitoring • FactForge-News open-data playground • Relationship Discovery Examples • Media Monitoring Examples • Panama Papers and Global Legal Entity Identifier as Open Data • Mapping Datasets to DBPedia with the GraphDB Lucene Connector • Tracing Panama Papers entities in the news Relationship Discovery Webinar Aug 2016
  • 34.
    Mapping datasets toDBPedia • The task: map people, organizations and locations to IDs in DBPedia − So that we can analyze the original data with the help of the extra information available in DBPedia and other datasets that are related to it, e.g. Geonames − For instance, #LinkedLeaks doesn’t contain any extra information about the companies, e.g. industry sector, controlling or controlled companies, etc. • Specific conditions: we had to map by names − Other than names, the information about the entities in the source datasets couldn’t help the mapping ▪ Address and country attributes are present, but those appeared to be marginally useful for mapping − In both cases we mapped locations only in terms of countries and not finer grained locations ▪ For this purpose DBPedia geographic data is sufficient and it is also well mapped with GeoNames Relationship Discovery Webinar Aug 2016
  • 35.
    Mapping datasets toDBPedia (2) • We used the GraphDB connector to Lucene for these mappings − Using the GraphDB connector, Lucene index was created for Organizations and People from DBPedia, indexing all sorts of names, descriptions and other textual information for each entity − The mapping process consists mostly of using the name of the entity from the 3rd party dataset (in this case Panama Papers or GLEI) as a FTS query, embedded in a SPARQL query • What is that Lucence does better than SPARQL? − When there is little information other than the name, we benefit from the free text indexing of Lucene, because it deals well with minor syntactic variations and sorts the results by relevance − When mappings 300 000 organizations against another 500 000 organizations, without a key, the complexity of a SPARQL query is 300 000 x 500 000, which is slower that 300 000 Lucene queries Relationship Discovery Webinar Aug 2016
  • 36.
    #LinkedLeaks Mapping Queries Companiesmapped by industry Companies mapped in the Finance sector Politicians mapped Available as Saved Queries at http://ff-news.ontotext.com/sparql Note 1: Open Saved Queries with the folder icon in the upper-right corner Note 2: FF-NEWS is still in Beta testing ! But available to play with Relationship Discovery Webinar Aug 2016
  • 37.
    Presentation Outline • Usecases: Relation discovery and Media monitoring • FactForge-News open-data playground • Relationship Discovery Examples • Media Monitoring Examples • Panama Papers and Global Legal Entity Identifier as Open Data • Mapping Datasets to DBPedia with the GraphDB Lucene Connector • Tracing Panama Papers entities in the news Relationship Discovery Webinar Aug 2016
  • 38.
    Tracing Panama Papersentities in the news • After mapping #LinkedLeaks entities to DBPedia identifiers, we can load them, together with the mappings, in the FF-NEWS repository • This way we have in a single repo, mapped to one another: #LinkedLeaks data, DBPedia, News metadata • We can make queries like: Give me news mentions of entities which appear in the Panama Papers dataset • This way the mapping enabled media monitoring at no extra cost Relationship Discovery Webinar Aug 2016
  • 39.
    Thank you! Experience thetechnology with NOW: Semantic News Portal http://now.ontotext.com Play with open data at http://data.ontotext.com and http://ff-news.ontotext.com Relationship Discovery Webinar Aug 2016

Editor's Notes

  • #4 HOW MANY CONCEPTS A PERSON KNOWS?
  • #8 HOW MANY CONCEPTS A PERSON KNOWS?