When we explain to people what we do, they often compare us to Google. What we present today, however, is a novel way of approaching search and discovery.
Eric and I know each other for some time already as colleagues developing software for writing research at the University of Antwerp. The technology we present here is the result of Eric’s PhD research in computer linguistics and my technical background in software engineering. It is by combining our skills that we were able to implement this technique.
We believe that it is a very powerful way to quickly unlock valuable insights into your unstructured data. We will show you how our application allows users to easily navigate between unfamiliar sources.
Three standard ways of fulfilling an information need Retrieval is well understood in the world of relational databases Search is the approach used by mainstream/typical search engines Information discovery, however, is more complex and less developed
An important aspect of bag of words is the assumption that the words are used independently of each other. Statistics and probabilities based on word counts are then used to predict the appearance of words in a text. The technique we use is called Topical Facets, it’s a graph based approach transforms the text into a network of words, Each word is connected to another word, if they appear in a sequence within a text. Therefore we can detect all relevant semantic relations during the preprocessing of the text, without requiring any human intervention. No ontologies are required, there’s no need for manual tagging and no computer training phase is required.
We present a case in collaboration with AG VESPA, the Autonomous Municipal Company for Real Estate and City Projects in Antwerp. Along a dock in the north of the city once stood the main abattoir of Antwerp. It’s closed for almost ten years now. This picture sketches a view of how this 18-hectare deserted area will look over a couple of years.
Why is out technology of any use here? Well, these projects require a lot interaction with many different stakeholders
Setting up a large city development project implies lots of different parties, each with their own wishes, concerns, ambitions etc.… Of course the authorities have rules with a mandatory character. Understanding the position of each group is one of the tasks of the project manager. In the end, all actors, and their views need to be aligned with one another
Ms Meijsmans is a civil engineer and architect and she is in charge. Ms Meijsmans knows how to calculate the amount of steel required to build a concrete roof, she knows how to read a building plan. But, she also needs to read a lot of other documents.
She knows that a project of this size needs public support. We all know from experience that projects get stuck when that support is lost. Even small disputes can cause a lot of friction
And that is the point where we can help: what is everybody talking about and how is it connected?
For this case we got the actual data from Vespa. Different kinds of sources, ranging from press feeds to council decisions to legal documents and social media. The donut shows the share of the different sources in the total volume From the top clockwise we have Advisory reports 3%. These are official advices on matters as mobility, stability, public utilities. Then the Apache new agency with 5% Council decisions make up 13% of the volume, and so on.
There’s even a bit of Wikipedia that we included to offer background knowledge to the system, which is self-learning. All this is unstructured digitalized text going straight from their computer folders into the database without further human intervention. The technology is language independent.
Ms Meijsmans has seen all the groups, has talked to the citizens and the contractors and the city council. She knows her data, so it would be unfair to suggest that this is a typical question she would ask.
But we don’t have a clue. We use this deliberately vague query to show that even vague queries lead to relevant responses and also to illustrate what happens when we are confronted with unfamiliar datasets.
So this is the answer we get from our system.
Just a side note: answering this query took about 50 milliseconds.
How is that possible? With topical facets we precalculate all possible semantic relations between all texts. A query merely triggers a chain reaction that collects relevant associations while pruning all off-topic links. That is why we get a sensible result even when there are no keywords in the query.
On the left hand side, we show some snippets from the top two documents in the result set. For instance “We are…” which is spot on despite the fact that there is no relation between the terminology in the query and the answer.
On the right hand side, we get six clusters of topics that group together based on this query. Of course the original result was in Dutch,. We took the liberty to translate everything in English.
The clusters are interconnected. It means that together they constitute the answer to the question. So while we only asked for civic concerns in our question, we get a set of 6 interrelated topics. This is because while these topics are not strictly about only the civic concerns, all of them are related to it.
This set of topics is an entry point for exploration. To be clear: the labels for the clusters are manually added for this presentation.
Let’s look what happens when we try to discover the actual civic concerns.
When taking a deeper look at the relationships of this topic, we get a fine-grained view on the concerns as reported by the citizens.
We identify here a number of subjects, but they do not all receive the same attention. There is a marked discrepancy between the press and other sources of information about civilian participation. For instance when talking about the development project Optima & Land Invest get an unreasonable large share of coverage in the media.
When we look back at traffic infrastructure, we find for instance that the communication on this issue is strongly biased to say the least.
Let’s examine this by inspecting the Mobility cluster.
The press mostly reports the action groups dissatisfactions with the speed limits. However, When looking at how this topic is handled in the planning documents, and in council decisions we find that in reality this has been given extensive thought and careful consideration, which apparently the press has completely overlooked. By linking different sources on the same subject, we discover a communication issue We illustrate here how the press clippings are linked to other sources, for instance with project reports and city council decisions.
Translation of some snippets from the masterplan: “We have therefore included a footbridge in the design to increase the ease of crossing. We propose to eliminate the existing Kalverstraat and to build a new access road next to the former Slaughterhouse Hall. This allows the neighborhood park "Kalverwei" to be free of traffic. Access to the area is designed to be independent as much as is possible of the Slachthuislaan and keep traffic pressure as limited as possible. Incidentally, it is expected that the reconstruction of the IJzerlaan will help reduce the amount of transit traffic on the Lange Lobroekstraat.”
This concludes our example for this case study. Of course more can be done with the data and our techniques.
In this case we touched only the surface of the possibilities of our technology.
For instance the general mood of the press clippings over the period august 2016 to February 2017. Readership Impact om a time line Something different is the reconstruction of parliamentary debates based on transcriptions, such as based on Europarl transcription files for September 2006 in the European Parliament
This opens up the door for all sorts of advanced analytics and dashboards.
The semantic relations allow for easy … This allows you to jump from one collection to the next collection, only touching those documents that are relevant to your information need. Even if you had no knowledge of their existence, their semantic relationship, or even how to begin a search for such information. Discovering relevant information from datasets without any prior knowledge of its contents. This can be achieved by simply following semantic relationships, as described above, or by the powerful query mechanism that can even return relevant information to vaguely worded questions. … and drilling down. The first two points focus on search and discovery. This last point focuses on exploring and analysing the data. The system precalculates all semantic relationships, and these are available, in the background, to help you.
We hope that you found this presentation informative and that you have gotten some ideas on how this technique could be applied to your own unstructured data. If there are any questions, we would love to take them right now, and thank you all for your attention.
Analyzing stakeholder interaction using 'unlock'
INTERACTION USING ‘UNLOCK’
Unlock aggregates retrieval, search and
• RETRIEVAL gathering available information from a data store:
“get the address of employee Marc Smith”
• SEARCH replying to someone having domain knowledge:
“compare this fund’s volatility relative to the benchmark using beta”
• DISCOVERY exploring unknown data:
“what are the concerns of the citizens in the building project at the
Bag-of-words Text graph
PreprocessingUnstructured Data Input
Open Data Sources
Meta Data Extraction
Internal & External Reports
Let’s visualize this…
A B A BA
C DA C D
Stakeholders in the project
• Regulatory authorities: Europe, state, region, province, municipality
• Private owners of the site
• Architects and master planners
• Property developers
• Experts on mobility, public utilities, environment, financing
• Neighborhood committees
• General press
Ir.-arch. Nancy Meijsmans
Project Manager at AG Vespa for the
Slachthuislaan development area
A project manager
needs to marshal the
and concerns of
amounts of text.
“What are the concerns of the citizens in the
building project at the Slachthuislaan?”
“Wat zijn de bezorgdheden van de
burgers bij het bouwproject aan
GVA – 11 APRIL 2016
Is our feedback even being used?
“We are pulling the emergency brake
because our input does not seem to
matter a lot”
“We are especially worried about
qualitative, green, public space and
affordable housing, but also about
parking space and mobility”
GVA – 11 FEBRUARI 2017
It’s good to have a fresh breeze in this
“I inquired around the neighbourhood
and believe very strongly in the
Youth & Culture
Other advisory boards
Youth & Culture
Other advisory boards
Optima & Land Invest; secret deal
Traffic infrastructure; safe street crossing
Parking spaces; number of residents
Kids; green areas; playgrounds
Emissions; air pollution
Joris Giebens requests external audit
Youth & Culture
Other advisory boards
GVA – 11 APRIL 2016
“… while the plan prescribes a 70 km
zone. It is inconceivable that the
maximum speed on this already
dangerous road is being further
More is possible…
Capturing the sentiment (mood)
of the messages Reconstruction of debates in parliament
Apache DM DS GVA HLN NB Other
• The semantic relations allow for easy navigation between documents
and document sets based on your information need.
• Discovering relevant information from data sets without any prior
knowledge of its contents.
• Fluid views on your knowledge by zooming out and drilling down.