At Uncharted specializes in creating visual analytic solutions for a variety of industries.
Today drawing on elements from a few of the projects we’re working on, namely:
PanTera for large multidimensional and geo data and
Influent for big graph data
Also be pulling from related research and open source initiatives being developed as part of a DARPA program for reasoning over big data
There are many reasons to visualize node-link graph data. Often the need stems from assessing believability or trust in an ML computed answer, or seeking context for it, looking for nuance or patterns that may have been missed. But beyond that, what does visual analysis entail?
Frequently we’re looking for clusters and communities; groups of nodes that are closely connected and may suggest similarity or affinity. We also look for the relationships / structures between nodes and communities …
In this map of science, what is it that connects organic chemistry to international studies?
Ultimately we’re creating a spatial representation or a 2d map of a complex dataset. This helps us, as spatial thinkers, build mental models of how the data is organized. We use our well-developed visual perception to see anomalies and patterns that may beg further questions or lead to new insights.
As the volume of data that we’re collecting and analyzing grows, millions of nodes or billions of edges, actually seeing our graphs becomes increasingly problematic.
The computational cost of laying out and rendering the graph becomes prohibitive
Layouts often results in an unresolved hairball that reveals little
The size and cost leads to our interactions being limited
And ineffective labeling makes it hard to understand the nature of communities and relationships
Look at a graph of Twitter users who tweeted about Chelsea FC soccer team during the 2014 season.
Demo:
Start by side-stepping two of the issues the plague big graphs – computation cost of layout and hairball results. Do this by using the actual geo-locations of the nodes (the users) as our layout. This is visualizing a graph as a map in the most literal sense and focuses on spatial communities and patterns.
The arcs we see are twitter mentions between users and are drawn in a clockwise fashion to show link directionality. At this level we get an immediate sense of high-level geo structure of the data – large community in england but also high engagement in Spain, and West Africa.
To address render time and interactivity costs, we use a heatmap aggregate view of all links and show appropriate level of detail at each scale. Adopting a Google Maps paradigm we serve only the data necessary for a given view keeping render speed high in the browser.
So at this level our focus is on link patterns and many are immediately obvious - such as a high one-sided volume of traffic from Guatemala to Spain, also England to West Africa. And I can drill down from the global to the local level, revealing patterns at different scales.
But we also need to address labeling - ways of understanding the character of the communities we see (beyond geography). Here is a layer showing player mention sentiment by region
Another with trending topics that may help characterize the communities we see. In this case we see a very specific hashtag “we need west cumberland hospital” common to the users in this region. Drilling down reveals a single-day campaign around a local issue.
Here we’ve plotted the same graph but laid out based on network structure. Many of the elements remain the same as the geo layout: directional arc links and approach to rendering. At this zoom level some of the most prominent communities are outlined and labeled.
Just like zooming in a map reviews finer details of cities and streets, zooming here reveals additional communities and labels.
Spatial proximity between clusters helps show us strength of relationship and the heatmap of links gives us a sense of structure.
Approach the labeling issue by layering in additional information to help understand the nature of our clusters. In this case we have a cluster centered around BBC Sports that also has strong mention of the soap opera “Coronation Street”. Manchester United appears as a common theme in this region of the graph.
Technical Approach:
We start with raw source data and run it through a series of steps to produce a set of web-map-style data tiles for consumption by the client.
We first detect communities in the graph. We apply a distributed Louvain Modularity algorithm which starts with the raw graph and detects increasingly higher-level community structures. We address the typically large computational costs of this by using the distributed computing platform Apache Spark using the GraphX library.
We then take these communities and run a hierarchical graph layout algorithm, again using Spark. This starts by laying out the highest level aggregate community graph with a distributed force-directed layout. Once the top level communities are positioned, the next level down is laid out within the constraints of the parent communities, and so-on down to the raw nodes.
In this manner we are addressing two of the challenges of big graph visualization that we avoided in the geo-case:
- Computation cost, as hierarchical layout is easily data parallelizable across a cluster of computers.
- Furthermore, this community-based layout is good at reducing hairball results by pulling apart communities and focusing on relationships between them.
In the final stage again we borrow concepts web geographic maps and apply a “data tile” approach - the information from earlier stages is stored using a hierarchical division of space.
Again, we’re addressing interaction and labeling here by efficiently delivering appropriate level-of-detail and domain-specific data representations to the client.
Final Demo: This application looks at a far larger graph - a 2M node, 10M link graph of Amazon products, purchases, and reviews.
We use the same community-based approach to big graph layout. Here Amazon products are our nodes and links are co-review and co-purchases. Communities tend to suggest product affinity. Despite an order of magnitude increase in data size, the speed and interactivity remains the same.
But one issue I didn’t discuss with the twitter graph was how we can navigate the graph at the individual node level.
By selecting a node or even a cluster of interest we can switch to a different view of the data – an interactive lens into the graph with the ability to trace a path through links of interest either individually or in aggregate. As we do so we can see the strength and temporal distribution of connections.
This is all linked back to the global graph.
…where I can see them in context.
Zooming in ….
and again, we see the big cluster from earlier break up into a constellation of smaller communities.
Zooming further still, we see the position of these products in the context of other clusters showing interesting co-purchase groups by proximity – The Lord of the Rings not surprisingly near Harry Potter
In summary, our approach seeks to meet some of the challenges of visualizing big graphs:
Computation cost through distributed computing and data parallelism;
Hairball layouts through a hierarchical community-based layout;
Interaction through the application of web-map-style data tiles
And finally effective labeling through the use of scale-sensitive domain-specific data stored with the tiles.
Finally, we’ve applied this technology to several other domains such as financial networks, doctor referral networks, and biological data.