2. FORRÁS
2.6TB adat. Relációs adatbázisok, emailek,
különböző banki dokumentumok, cégiratok,
amelyek a 215,000 offshore céghez kapcsolódnak,
akik a panamai Mossack Fonseca jogi szolgáltató
cég ügyfelei voltak 1977 és 2015 között.
3. A FOLYAMAT
1. Acquire documents
2. Classify documents
a. Scan / OCR —Tesseract
b. Extract document metadata — Apache Tika https://tika.apache.org
3. Whiteboard domain
a. Determine entities and their relationships
b. Determine potential entity and relationship properties
c. Determine sources for those entities and their properties
4. Work out analyzers, rules, parsers and named entity recognition for documents —Apache Solr, Blacklight
http://projectblacklight.org, Nuix https://www.nuix.com
5. Parse and store document metadata and document and entity relationships —Talend http://
www.talend.com
a. Parse by author, named entities, dates, sources and classification
6. Infer entity relationships
7. Compute similarities, transitive cover and triangles
8. Analyze data using graph queries and visualizations —Neo4j, Linkurious http://linkurio.us
4. ENTITÁSOK
• Clients
• Companies
• Addresses
• Officers (both natural people
and companies)
5. RELÁCIÓK
• (:Officer)-[:is officer of]->(:Company)
• (:Officier)-[:registered address]->(:Address)
• (:Client)-[:registered]->(:Company)
• (:Officer)-[:has similar name and address]->(:Officer)
8. RUGALMAS ADATMODELL
Új entitások:
Documents: E-Mail, PDF, Contract, DB-Record, …
Money Flow: Accounts / Banks / Intermediaries
Új relációk:
Family / business ties
Conversations
Peer Groups / Rings
Similar Roles
Mentions / Topic-Of
Money Flow
9. FELFEDEZÉS
Once the database was set up, it was a simple
matter to install and configure Linkurious to
essentially provide a GUI (graphical user interface)
atop the database. Having the visual depiction of
the graph of names and addresses was critical in
making sense of the data, especially for non-
technical reporters.