Visualizing Relationships:
Journalistic Problems in a Digital
Age
Summary
1. Introduction
2. The Problem we are solving
3. Involved issues
4. Problems we found
5. The Challenge
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

2
WHO ARE WE?
• Mariano Blejman is a technology editor
and youth editor in Argentine newspaper
Página/12, and Hacks/Hackers Buenos
Aires co-founder. @blejmanevel

• Marcos Vanetta is a biomedical engineer.
Software developer at 3PillarGlobal and
hacker at Hacks/Hackers Buenos Aires.
@malev

© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

3
HACKS/HACKERS BUENO AIRES

© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

4
THE PROBLEM
• 1976 A dictatorship started in Argentina.

• 30,000 persons were kidnapped and disappeared.
• 1985 First trials happened in Argentina. They judged the
bad guys but we have to stop.
• 2003 Justice start judging the bad guys again.
• 2012 Large amount of judicial documents.

No one can read all of them

© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

5
INVOLVED ISSUES
• Semantic Analytics

• Ontology
• Data Mining
• Social Network Analysis

• Visualizations

Who were dealing with documents?

DocumentCloud, Overview, Open Calais, NLTK, Gate

© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

6
FIRST APPROACH
Read all the documents

Software solution based on regular expressions
Ruby, Padrino and MySQL database.

def self.extract_plain_text(path)
basename = File.basename(path).split('.')[0..-2].join('.')
tmp_dir = Dir.tmpdir
Docsplit.extract_text(path, :output => tmp_dir, :ocr => false)
text = File.open(File.join(tmp_dir, "#{basename}.txt")).read
self.clean_text(text)
end

© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

7
THE PROBLEMS WE FOUND
• Convert text from pdf files

• Extract entities from documents
• Parse dates and addresses
• Co-reference names resolution

• How to store relations
• Documents contextual information
• Confidence on data on a crowdsourcing platform.
Visualizing Relationships over the Time

© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

8
WHAT DO WE HAVE
NOW?

Prototype for a single (and
local) use case: mapa76
Platform for different use
cases: analice.me
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

9
THE VISUALIZATIONS THAT WE
IMAGINED

© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

10
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

11
THE VISUALIZATIONS THAT WE FOUND

© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

12
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

13
THE #MOZFEST CHALLENGE
Find a big journalistic issue that involves:

• Lot of documents with unstructured data
• Lot of data to find inside
• What relationships do you wants to find

© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential

14

Visualizing Relationships: Journalistic Problems in a Digital Age

  • 1.
  • 2.
    Summary 1. Introduction 2. TheProblem we are solving 3. Involved issues 4. Problems we found 5. The Challenge © Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 2
  • 3.
    WHO ARE WE? •Mariano Blejman is a technology editor and youth editor in Argentine newspaper Página/12, and Hacks/Hackers Buenos Aires co-founder. @blejmanevel • Marcos Vanetta is a biomedical engineer. Software developer at 3PillarGlobal and hacker at Hacks/Hackers Buenos Aires. @malev © Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 3
  • 4.
    HACKS/HACKERS BUENO AIRES ©Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 4
  • 5.
    THE PROBLEM • 1976A dictatorship started in Argentina. • 30,000 persons were kidnapped and disappeared. • 1985 First trials happened in Argentina. They judged the bad guys but we have to stop. • 2003 Justice start judging the bad guys again. • 2012 Large amount of judicial documents. No one can read all of them © Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 5
  • 6.
    INVOLVED ISSUES • SemanticAnalytics • Ontology • Data Mining • Social Network Analysis • Visualizations Who were dealing with documents? DocumentCloud, Overview, Open Calais, NLTK, Gate © Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 6
  • 7.
    FIRST APPROACH Read allthe documents Software solution based on regular expressions Ruby, Padrino and MySQL database. def self.extract_plain_text(path) basename = File.basename(path).split('.')[0..-2].join('.') tmp_dir = Dir.tmpdir Docsplit.extract_text(path, :output => tmp_dir, :ocr => false) text = File.open(File.join(tmp_dir, "#{basename}.txt")).read self.clean_text(text) end © Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 7
  • 8.
    THE PROBLEMS WEFOUND • Convert text from pdf files • Extract entities from documents • Parse dates and addresses • Co-reference names resolution • How to store relations • Documents contextual information • Confidence on data on a crowdsourcing platform. Visualizing Relationships over the Time © Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 8
  • 9.
    WHAT DO WEHAVE NOW? Prototype for a single (and local) use case: mapa76 Platform for different use cases: analice.me © Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 9
  • 10.
    THE VISUALIZATIONS THATWE IMAGINED © Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 10
  • 11.
    © Copyright 2014.3Pillar | All rights reserved Strictly Confidential 11
  • 12.
    THE VISUALIZATIONS THATWE FOUND © Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 12
  • 13.
    © Copyright 2014.3Pillar | All rights reserved Strictly Confidential 13
  • 14.
    THE #MOZFEST CHALLENGE Finda big journalistic issue that involves: • Lot of documents with unstructured data • Lot of data to find inside • What relationships do you wants to find © Copyright 2014. 3Pillar | All rights reserved Strictly Confidential 14

Editor's Notes

  • #3 3Pillar Global has brought together the expertise of engineering and the critical understanding of the market and business needs to build innovative software products that propels clients’ businesses forward.