School Of Data - mapping opencorporates networks using openrefine and Gephi


Published on

Published in: Business, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Some slide prompts to support a data framing investigation around corporate data – originally prepared for the OGP Festival, London, October 2013.For more information, contact:
  • These notes provide a worked example of how to download company ownership relationship data from OpenCorproates ( using the cross-platform data cleaning tool OpenRefine (, and then visualise the data using the cross-platform Gephinetwrokvisualisation tool (
  • OpenCorporates is a private company that has set itself the ambitious task of building a database of registered company information for every legal corporate entity in the world.One of the views OpenCorporates offers over at least some of the data in its database shows how companies are connected by beneficial ownership or shareholder relationships.Although complex, this diagram is “human readable” – the data is presented in a way that is intended to make some sort of meaningful sense to us.
  • But as well as publishing data for us humans to read, OpenCorporates also makes data available in a way that machines can read - machine readable data.You may have heard of the term “API” in the context of data publishing websites. To all intents and purposes, an API is an interface that computers can use to get information out of websites in a way that they, and the databases they work with, can understand.The data is published in a format known as JSON – Javascript Object Notation. But you don’t really need to know much more than that – just that it’s called JSON, and tools that can parse and work with JSON can parse and work with the data that the OpenCorporates API publishes.
  • If you aren’t a programmer, here’s way of getting the data out of OpenCorporates and into a tabular form you may be more comfortable with, and which we can use to generate a network diagram to display in a tool such as Gephi…You can download the OpenRefine application from When you run it on your computer, it will launch an application that runs inside a browser tab using your default web browser.
  • We can get company ownership (subsidiary relations, major shareholdings, etc) from OpenCorporates by hacking the web address/URL of a company page on OpenCorporates.From a company page on OpenCorporates, which should have the form: the following to the end of the web address/URL:/network.json?depth=2to give something with the following form: company network data may not be available in all jurisdictions or for all companies.)
  • In OpenRefine, select the option to Create [a new] Project using the web address – or URL – to the JSON data pagethat reveals the data relating to the corporate ownership network of the company we are interested in on OpenCorporates.Note that you can import data into OpenRefine from several web addresses all in one go, though the data returned from each URL should have the same format or structure.Using multiple URLs results in a combined data set, which can be quite handy.
  • Being machine readable, the data makes more sense to OpenRefine than it probably does to us! Select a block of data in the preview view that is typical of a set of data that you want to map into a single row in a “traditional” spreadsheet like view.Data blocks are typically contained within braces (curly brackets); these things : { }Note that in some machine readable data, some data blocks may be contained within other data blocks…Each of the items in a single data block can be mapped into a separate cell – that is, a separate column – in a single row of data.So each data block is a row, and each item in the block is a column…. OpenRefine will give you a preview of how the data will look if you click the right button!
  • You can preview the effect of making particular block selections using Update Preview.To return to the block highlighter, use ‘Pick Record Nodes’.When you are happy with your selection, you are ready to “Create Project”.
  • Once we’re happy with the data preview, we can import the data into a more familiar looking layout.The arrows at the top of each column pop up menus that allow us to run a wide variety of operations on a column.One of the operations let’s us change the column name, so I’m going to rename the child company and parent company columns to what Gephi expects: Source and Target.
  • This is the format that Gephi wants to see when we import data from a simple two column, comma separated variable (CSV) text file.One of the columns needs to be called Source, another needs to be called Target. When constructing the network diagram, Gephi then knows to draw a line going from each Source element to the corresponding Target.
  • TheOpenCorporates network data in tabulated form. The default column names are not necessarily as human readable as they could be!In particular, we can identify the name of the parent company and the child company for each ownership relation. We also have access to the OpenCorporates IDs for all of those companies. The type of relationship between the companies is also described. For the moment, we will treat them all equally.(If you want to view just those company connections that relate to a particular type of relation, use the Facet or Text Filter tool applied to the appropriate column.)
  • From the appropriate column menu, select “Edit Column” and then “Rename this column” to change the column name.
  • We can now export the data using the Custom Tabular Exporter.Deselect all the columns then select just the Source and Target columns – we will only export data from these two columns.
  • Preview your data to check that it looks like the sort of data you expect to export.From the Download tab, select the CSV output type and export your data – it should be saved into the default download directory used by your browser, with a file name that corresponds to the OpenRefine project name.You should have the two column data saved to your computer that you can now load in to Gephi.
  • Gephi is a powerful cross-platform desktop tool for visualising data that describes networks, such as social networks or corporate ownership networks. You can import data into Gephi using specialised graph/network representation formats, or from simple two column data files where each describes a simple connection between two elements (egthing1, thing2 would say that thing1 connects to thing2).You can download the Gephi application from When you run it on your computer, it will launch a desktop application. Note that Gephi requires Java – if you are on a Mac, you may need to download and install Java yourself:
  • LaunchGephi (download it from if you don’t already have it installed) and select Data Laboratory.If the Data Table toolbar is empty, go to the application’s File menu and select ‘New Project’. A new project will be created and you should see several toolbar options appear in the Data Table.
  • Load the data in using the “Import Spreadsheet” tool option. Make sure that you select Edges table as the table type.If your data file does not have Source and Target column names, an error will occur and you will not be able to import the data file. (In such a case, you could always open the file in a text editor, change the column names in the file, save it, and try again. Alternatively, go in to OpenRefine, change the column names there, and re-export the custom tabulated data…)
  • The final stage of the import gives some additional information about how uploaded data will be treated.Because we are simply loading in data that describes how one company (identified by its name) is connected to another company (also identified by its name), we need to get Gephi to automatically create a node each time it sees a new company (as identified by its company name…).
  • When the data is imported, we can preview it, either by looking at a list of nodes that have been created, or ‘edges’ – that is, connections between two companies.
  • So now let’s see where we can start to view this data as a network visualisation.Click on the top palette Overview button to get an overview of the network in visual form. This is the area where we can interactively visualise the network.
  • The default Overview layout has three main areas: in the middle is the canvas where we can see the current layout of the network; along the left hand side of the central panel are several tools for operating on the elements shown on the canvas; along the bottom of the central panel are several tools for controlling how text labels are displayed. to the left are several tools for manipulating what the network looks like: tools for laying out the network (that is, positioning the nodes) automatically, as well as colouring and sizing the nodes;- to the right are several tools that allow us to analyse and process the graph (that is, the mathematical structure that defines the network); for example, we can run various statistics on the network, or filter the nodes that are displayed according to one or more specified criteria.
  • Let’s start by laying out the network. There are several layout tools provided by default (you can install more from the Tools->Pluginsmenu) which each have slightly different behaviours and can be differently effective at laying out networks with different sorts of structure.A couple of good all-round layout algorithms are: ForceAtlas2YifanHu.If you imagine connected nodes held together by springs, you can thing of these layout tools as trying to position the nodes so that the springs are stretched as little as possible. Sort of.
  • At the moment, we don’t know what each node represents. By default, when labels are switched on, Gephi looks for a label column value associated with a node and displays that. But we can also display other values. In this case, we are using a company name as the node ID, so we can select id as the element to display when we switch labels on. Click on the clipboard icon on the toolbar at the bottom of the screen to raise the label selector.To actually switch labels on, click on the leftmost/darketT button on the toolbar at the bottom of the screen.The slider on the right controls the text label size.
  • We can also change the size of labels proportional to the size of a node – but how do we size nodes?Whilst it is possible to load in data that describes various attributes associated with each node (for example, in the case of a company node it might be the turnover or profit in the last financial year), we can also generate information about each node based on various network properties.For example, the degree of a node says how many connections it has with other nodes. Where connections are ‘directed’ – that is, represented by arrows – the number of arrows that leave a node is referred to as the out-degree of the node, and the number of arrows that come into a node as the in-degree.
  • We can use the Average Degree statistic tool to calculate the degree, in-degree and out-degree values for each node.We can then use these values as the basis for sizing the nodes in the network visualisation.
  • Here we have sized the nodes by Degree. The min and max size parameters can be set as required to scale the size of the nodes.
  • We can set the label size so that it is proportional to the node size – from the black/dark A label on the toolbar at the bottom of the screen, select the [proportional to] Node Size menu option.
  • As well as tools for generating grandscale layouts, there are also layout tools for tweaking a particular layout.The Expansion tool just stretches (or shrinks) the layout in the x and y directions. This can be good for just putting a bit of space into a layout.The Label Adjust tool juggles nodes so that their labels don’t overlap. Note that this tool may move some nodes quite a distance compared to their neighbours and so may upset any meaningful spatial relationships obtained using the other layout tools.
  • We can colour and size nodes according to a wide range of properties obtained from running various network statistics.As you work with network data more and more, you start to get a feel for which tools to use to help you look for particular patterns, structures and stories within the data. But that is a tutorial for another day…
  • We can use various tools in concert to tweak the layout of the network.In this example, I have: sized the nodes by degree; set the label sizes proportional to the Degree; tweaked the scale using the text-size slide; used the Authority value (obtained via the HITS statistic) to colour the nodes; laid out the network using a ForceAtlas2 algorithm, a bit of Expansion and a dash of Label Adjust.
  • If you want to know more, contact us…
  • School Of Data - mapping opencorporates networks using openrefine and Gephi

    1. 1. Mapping Corporate Networks - Intro
    2. 2. A two-part recipe for downloading company ownership data from OpenCorporatesusing OpenRefine, and then visualising it with Gephi
    3. 3.
    4. 4. How to grab the data using OpenRefine Visit to download the application
    5. 5. Where’s the data? Add /network.json?depth=2 to the end of the web address
    6. 6. URL of the form: JURISDICTION/COMPANY_ID/network.json?de pth=2
    7. 7. What data block makes a row?
    8. 8. Create project Toggle selection and preview
    9. 9. Nicely tabulated data
    10. 10. What Gephi Expects…
    11. 11. Child Parent
    12. 12. Parent ->Source Child ->Target (You may find the network analyses work better if you use the parent as the Target and the child as the Source…) What Gephi Expects…
    13. 13. How tovisualise the data usingGephi to download the application
    14. 14. Getting Started with Gephi
    15. 15. Import as Edges table
    16. 16. Colour/S ize View Stats/ Filters Layout Label tools
    17. 17. “Spacing”
    18. 18. Label display selector Turn labels on Label size
    19. 19. A matter of degree… Degree 2 In-degree 2 Out-degree 0 Degree 3 Degree 3 In-degree 0 In-degree 1 Out-degree 3 Out-degree 2
    20. 20. Size by degree… Calculate in-degree and out-degree Set node size The color wheel/palette is used to colourthe nodes.
    21. 21. Label Sizing
    22. 22. Tweaking the layout “Expand” the layout (stretch it in two dimensions) “Adjust” the labels so that they don’t overlap - may change relative position of nodes
    23. 23. Network Stats HITS – Authority and Hub values: authoritative nodes are pointed to, hub nodes point to others Measure the ‘influence’ of a node in the network
    24. 24. Use the tools in concert… Colour based on Authority (HITS statistic) Label adjust tweaks the layout so we can read the labels Fine tune label sizing using text-size slider
    25. 25.