Slides in HD: http://www.jianxu.net/en/files/HON_Xu_HONS_2017.ppsx
Paper on Science Advances: http://advances.sciencemag.org/content/advances/2/5/e1600028.full.pdf
Our world produces massive data every day; they exist in diverse forms, from pairwise data and matrix to time series and trajectories. Meanwhile, we have access to the versatile toolkit of network analysis. Networks also have different forms; from simple networks to higher-order network, each representation has different capabilities in carrying information. For researchers who want to leverage the power of the network toolkit, and apply it beyond networks data to sequential data, diffusion data, and many more, the question is: how to represent big data and networks? The higher-order network (HON) makes a first step to answering the question: it is a critical piece for representing higher-order interaction data. In the slides, we introduce a scalable algorithm for building the network, and visualization tools for interactive exploration. Finally, we presents broad applications of the higher-order network in the real-world.
Representing higher-order dependencies in networks: methods, applications, and visualizations
1. Representing Higher-order Dependencies in Networks:
Methods, Applications, and Visualizations
Jian Xu
Department of Computer Science and Engineering
University of Notre Dame
i@jianxu.net
June 20th, 2017
18. 18
Variable orders in HON
Fixed-order Variable-order
Assuming a fixed order
beyond the second order
becomes impractical
because “higher-order
Markov models are more
complex” due to
combinatorial explosion
--- Rosvall et al. (Nature
Comm. 2014)
20. 20
Scalability
Network representation
(global shipping data)
Number of edges Number of nodes Network density Clustering
time (mins)
Ranking
time (s)
Conventional first-order 31,028 2,675 4.3×10-3 4 1.3
Fixed second-order 116,611 19,182 3.2×10-4 73 7.7
HON, max order two 64,914 17,235 2.2×10-4 45 4.8
HON, max order three 78,415 26,577 1.1×10-4 63 6.2
HON, max order four 83,480 30,631 8.9×10-5 67 7.0
HON, max order five 85,025 31,854 8.4×10-5 68 7.6
21. 21
Scalability
Network representation
(global shipping data)
Number of edges Number of nodes Network density Clustering
time (mins)
Ranking
time (s)
Conventional first-order 31,028 2,675 4.3×10-3 4 1.3
Fixed second-order 116,611 19,182 3.2×10-4 73 7.7
HON, max order two 64,914 17,235 2.2×10-4 45 4.8
HON, max order three 78,415 26,577 1.1×10-4 63 6.2
HON, max order four 83,480 30,631 8.9×10-5 67 7.0
HON, max order five 85,025 31,854 8.4×10-5 68 7.6
22. 22
Scalability
Network representation
(global shipping data)
Number of edges Number of nodes Network density Clustering
time (mins)
Ranking
time (s)
Conventional first-order 31,028 2,675 4.3×10-3 4 1.3
Fixed second-order 116,611 19,182 3.2×10-4 73 7.7
HON, max order two 64,914 17,235 2.2×10-4 45 4.8
HON, max order three 78,415 26,577 1.1×10-4 63 6.2
HON, max order four 83,480 30,631 8.9×10-5 67 7.0
HON, max order five 85,025 31,854 8.4×10-5 68 7.6
24. 24
How to construct HON?
Raw data
• Sequential data
Rule
extraction
• Which nodes to
split into higher-
order nodes, and
how high the
orders are
Network
wiring
• Connect nodes
representing
different orders
HON
• Use HON like
the conventional
network for
analyses
30. 30
A
• Convert all first-order rules into edges
B
• Convert higher-order rules
• Add higher-order nodes when necessary
C
• Rewire edges
• The edge weights are preserved
D
• Rewire remaining edges
Raw data
Rule
extraction
Network
wiring
HON
31. 31
A
• Convert all first-order rules into edges
B
• Convert higher-order rules
• Add higher-order nodes when necessary
C
• Rewire edges
• The edge weights are preserved
D
• Rewire remaining edges
Raw data
Rule
extraction
Network
wiring
HON
32. 32
A
• Convert all first-order rules into edges
B
• Convert higher-order rules
• Add higher-order nodes when necessary
C
• Rewire edges
• The edge weights are preserved
D
• Rewire remaining edges
Raw data
Rule
extraction
Network
wiring
HON
33. 33
A
• Convert all first-order rules into edges
B
• Convert higher-order rules
• Add higher-order nodes when necessary
C
• Rewire edges
• The edge weights are preserved
D
• Rewire remaining edges
Raw data
Rule
extraction
Network
wiring
HON
45. 45
Higher-order dependencies revealed by HON
Data # Records
Dependencies
revealed
Similar observations
Ship movement 3,415,577 Up to 5th order N/A
Clickstream 3,047,697 Up to 3rd order
“… appear to saturate at k = 3 for
Yahoo… browsing behavior
across websites is definitely not
Markovian but can be captured
reasonably well by a not-too-high
order Markov chain.”
--- Chierichetti et al. (2012)
Retweet 23,755,810 N/A
Assuming the second order has
“marginal consequences for
disease spreading”
--- Rosvall et al. (2014)
46. 46
Higher-order dependencies revealed by HON
Data # Records
Dependencies
revealed
Similar observations
Ship movement 3,415,577 Up to 5th order N/A
Clickstream 3,047,697 Up to 3rd order
“… appear to saturate at k = 3 for
Yahoo… browsing behavior
across websites is definitely not
Markovian but can be captured
reasonably well by a not-too-high
order Markov chain.”
--- Chierichetti et al. (2012)
Retweet 23,755,810 N/A
Assuming the second order has
“marginal consequences for
disease spreading”
--- Rosvall et al. (2014)
47. 47
Higher-order dependencies revealed by HON
Data # Records
Dependencies
revealed
Similar observations
Ship movement 3,415,577 Up to 5th order N/A
Clickstream 3,047,697 Up to 3rd order
“… appear to saturate at k = 3 for
Yahoo… browsing behavior
across websites is definitely not
Markovian but can be captured
reasonably well by a not-too-high
order Markov chain.”
--- Chierichetti et al. (2012)
Retweet 23,755,810 N/A
Assuming the second order has
“marginal consequences for
disease spreading”
--- Rosvall et al. (2014)
48. 48
Higher-order dependencies revealed by HON
Data # Records
Dependencies
revealed
Similar observations
Ship movement 3,415,577 Up to 5th order N/A
Clickstream 3,047,697 Up to 3rd order
“… appear to saturate at k = 3 for
Yahoo… browsing behavior
across websites is definitely not
Markovian but can be captured
reasonably well by a not-too-high
order Markov chain.”
--- Chierichetti et al. (2012)
Retweet 23,755,810 N/A
Assuming the second order has
“marginal consequences for
disease spreading”
--- Rosvall et al. (2014)
52. 52
Compatible with existing tools
Conventionally: every node
represents a single entity
(location, state, etc.)
Now: break down nodes into
higher-order nodes that carry
different dependency relationships
1
1 1( | ) t t
t t t t
tj
W i i
P X i X i
W i j
1
( | )
| ( | )
( | )
t t
k
W i h j
P X j X i h
W i h k
Only change the node labeling
54. 54
Invasive species
Zebra mussels @ Great Lakes
Clogging water pipes, attach to boats
Photos: Great Lakes Environmental Research Lab; TIME & LIFE Images, Getty Images
$120 billion / year
damage & control costs
66. 66
Ranking on clickstream network
• 26% pages show more
than 10% changes in
ranking
• More than 90% pages lose
PageRank scores, while a
few pages gain significant
scores
No changes
to the ranking algorithm
73. 73
Anomaly detection with dynamic network
Capturing complex
anomalous patterns
No changes in
Network topology
74. 74
Synthetic data with 11 billion movements
100,000 ships, each moving 100 steps;
11 scenarios, each repeating 100 times;
Total: 11,000,000,000 movements
75. 75
Higher-order anomalies captured by HON
First-order network HON
Injection and alternation of 1st order dependencies
Networkdifferences
Day Day
76. 76
Higher-order anomalies captured by HON
First-order network HON
Injection and alternation of 2nd order dependencies
Networkdifferences
Day Day
77. 77
Higher-order anomalies captured by HON
First-order network HON
Injection and alternation of 3rd order dependencies
Networkdifferences
Day Day
78. 78
Higher-order anomalies captured by HON
First-order network HON
Networkdifferences
Day Day
Injection and alternation of higher-order dependencies
Fails to capture certain anomalies
88. 88
Varieties of data
Application fields Input Trajectories Nodes Edges
Transportation Ship trajectories Ports Ship traffic
Computer network Clickstreams Web pages Web traffic
Human interactions
Phone call or
message cascades
People Information flow
Human behavior Human movements POIs Traffic
Healthcare Patient records Diseases Disease evolutions
NLP Sentences Words # word pairs
96. 96
Collaborators
The HON paper received valuable comments from Yuxiao Dong, Reid Johnson and David Lodge.
The HON introduction video is designed by Mayra Duarte in iCeNSA.
HON
HoNVis
Anomaly
Bruno Ribeiro
100. 100
References
• Rosvall, Martin, Alcides V. Esquivel, Andrea Lancichinetti, Jevin D. West, and Renaud Lambiotte. "Memory in network flows and its effects on
spreading dynamics and community detection." Nature communications 5 (2014).
• Chierichetti, Flavio, Ravi Kumar, Prabhakar Raghavan, and Tamas Sarlos. "Are web users really markovian?." In Proceedings of the 21st
international conference on World Wide Web, pp. 609-618. ACM, 2012.
• Pons, Pascal, and Matthieu Latapy. "Computing communities in large networks using random walks." J. Graph Algorithms Appl. 10, no. 2
(2006): 191-218.
• Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. "The PageRank citation ranking: bringing order to the web." (1999).
• Gleich, David F., Lek-Heng Lim, and Yongyang Yu. "Multilinear PageRank." SIAM Journal on Matrix Analysis and Applications 36, no. 4
(2015): 1507-1541.
• Akoglu, Leman, Mary McGlohon, and Christos Faloutsos. "Anomaly detection in large graphs." In In CMU-CS-09-173 Technical Report. 2009.
• Seebens, H., M. T. Gastner, and B. Blasius. "The risk of marine bioinvasion caused by global shipping." Ecology Letters 16, no. 6 (2013): 782-
790.
• Rosvall, Martin, and Carl T. Bergstrom. "Maps of random walks on complex networks reveal community structure." Proceedings of the National
Academy of Sciences 105, no. 4 (2008): 1118-1123.
• Ducruet, César. "Network diversity and maritime flows." Journal of Transport Geography 30 (2013): 77-88.
• Ducruet, César, Sung-Woo Lee, and Adolf KY Ng. "Centrality and vulnerability in liner shipping networks: revisiting the Northeast Asian port
hierarchy." Maritime Policy & Management 37, no. 1 (2010): 17-36.
• Ducruet, César, Céline Rozenblat, and Faraz Zaidi. "Ports in multi-level maritime networks: evidence from the Atlantic (1996–2006)." Journal of
Transport Geography 18, no. 4 (2010): 508-518.
• Ducruet, César, and Theo Notteboom. "The worldwide maritime network of container shipping: spatial structure and regional dynamics." Global
Networks 12, no. 3 (2012): 395-423.
• Kaluza, Pablo, Andrea Kölzsch, Michael T. Gastner, and Bernd Blasius. "The complex network of global cargo ship movements." Journal of the
Royal Society Interface 7, no. 48 (2010): 1093-1103.
• Hu, Yihong, and Daoli Zhu. "Empirical analysis of the worldwide maritime transportation network." Physica A: Statistical Mechanics and its
Applications 388, no. 10 (2009): 2061-2071.
101. 101
Takeaways
Higher-order network is:
More accurate in capturing dynamics in raw data.
More scalable than fixed-order networks.
3
Compatible with existing network algorithms.
111. 111
Comparison with VOM
HON VOM
In HON but not in
VOM
In VOM but not
in HON
0th order 0 3,029 0 3029
1st order 31,028 31,028 0 0
2nd order 32,960 35,288 427 2,755
3rd order 15,642 21,536 550 6,444
4th order 4,632 8,973 302 4,643
5th order 763 2,084 23 1,344
Total 85,025 101,938 1,302 18,215
112. 112
Higher-order dependencies revealed by HON
Data # Records
Inject known variable-
order dependencies
Synthetic 10,000,000
10 second-order
10 third-order
10 fourth-order
• Effectiveness: correctly captures all 30 of the higher-order dependencies
• Accuracy: does not extract false dependencies beyond the fourth order
• Compactness: determines that all other dependencies are first-order
113. 113
Clustering: higher-order network
• 45% of ports belong to more than one cluster
• 44 ports (1.7% of all) belong to five clusters
o New York, Shanghai, Hong Kong, Gibraltar, Hamburg, etc.
• Panama Canal belongs to six clusters
• Highlighting ports that may be invaded by species from multiple regions
115. 115
Ranking on clickstream network
Pages that gain PageRank scores ΔPageRank Pages that lose PageRank scores ΔPageRank
South Bend Tribune - Home. +0.0119 KTUU - Home. −0.0057
Hagerstown News / obituaries - Front. +0.0115 KWCH - Home. −0.0031
South Bend Tribune - Obits - 3rd Party. +0.0112 Imperial Valley Press - Home. −0.0011
South Bend Tribune / sports / notredame - Front. +0.0102 Hagerstown News / sports - Front. −0.0005
Aberdeen News / news / obituaries - Front. +0.0077 Imperial Valley Press / classifieds / topjobs - Front. −0.0004
WDBJ7 - Home. +0.0075 Gaylord - Home. −0.0004
KY3 / weather - Front. +0.0075 WDBJ7 / weather / web-cams - Front. −0.0004
Hagerstown News - Home. +0.0072 KTUU / about / meetnewsteam - Front. −0.0003
Daily American / lifestyle / obituaries - Front. +0.0054 Smithsburg man faces more charges following salvag −0.0003
WDBJ7 / weather / closings - Front. +0.0048 KWCH / about / station / newsteam - Front. −0.0003
WSBT TV / weather - Front. +0.0041 South Bend Tribune / sports / highschoolsports - Front. −0.0003
Daily American - Home. +0.0036 Hagerstown News / opinion - Front. −0.0002
WDBJ7 / weather / radar - Front. +0.0036 WDBJ7 / news / anchors-reporters - Front. −0.0002
WDBJ7 / weather / 7-day-planner - Front. +0.0031 Petoskey News / news / obituaries - Front. −0.0002
WDBJ7 / weather - Front. +0.0019 KWCH / news - Front. −0.0002
No changes
to the ranking algorithm
116. 116
Network representation Number of
edges
Number of
nodes
Network
density
Prob. of
returning
after two
steps
Prob. of
returning
after three
steps
Entropy
rate (bits)
Clustering
time (mins)
Ranking
time (s)
Conventional first-order 31,028 2,675 4.3×10-3 10.7% 1.5% 3.44 4 1.3
Fixed second-order 116,611 19,182 3.2×10-4 42.8% 8.0% 1.45 73 7.7
HON, max order two 64,914 17,235 2.2×10-4 41.7% 7.3% 1.46 45 4.8
HON, max order three 78,415 26,577 1.1×10-4 45.9% 16.4% 0.90 63 6.2
HON, max order four 83,480 30,631 8.9×10-5 48.9% 18.5% 0.68 67 7.0
HON, max order five 85,025 31,854 8.4×10-5 49.3% 19.2% 0.63 68 7.6
Good afternoon, my name is Jian, I’m from University of Notre Dame.
Today I will discuss an approach to represent higher-order dependencies in networks.
Overall, we have placed an important piece for representing raw data with networks
Overall, we have placed an important piece for representing raw data with networks
Now let’s begin with the higher-order network.
Our world is complex. A single ship may sail among hundreds of ports in the world
Consider hundreds of thousands of such ships.
They naturally weave a global shipping network, that transports cargos intentionally, and transports invasive species unintentionally.
Suppose we have raw data about global shipping movements,
which in every row of the data we know which vessel goes from which port to which other port at what time.
We see sequential data like this very frequently.
In order to perform network analyses, we first need to construct a network from this data.
The conventional way is to look at the data, count how many ships sailed between port pairs, and use these as edge weights to build a network.
This conventional network gives an overview of port-to-port traffic.
In this network, the edges appear to be the same, since port-to-port traffic are equal.
When we simulate ship movements on this network, according to the network structure, where a ship will go from Singapore will not depend on where it came to Singapore.
In reality, according to the raw data, the ship is more likely to continue onto LA if it came from Shanghai to Singapore, and more likely to go to Seattle if it came from Tokyo to Singapore.
Such complex movement patterns are not captured by the conventional first order network, which only retains pairwise interactions between ports, and loses important information about higher-order dependencies.
We know the network representation derived from the raw data is the foundation for further analyses,
if the network captures only pairwise connections in the raw data, and loses information such as higher-order dependencies,
then for all following network analyses, the results may not be accurate.
The question is…
We propose the higher-order network.
The basic idea is to split nodes into higher-order nodes.
In this example, based on the movement patterns in the raw data,
Singapore is split into Singapore given Shanghai and Singapore given Tokyo.
When ships follow different paths to Singapore, their choices for the next step will be different, therefore being able to reproduce ship movements more accurately.
The idea of higher-order Markov is not new, but it has not been discussed in the network context until very recently.
In 2014, Rosvall’s group proposed the fixed second order network, which counts the number of trigrams in the raw trajectories,
And uses these memory nodes to remember the one more previous step. This is a powerful approach that captures all second-order dependencies.
Although the fixed-order network is relatively easy to build, going beyond second order is impractical due to the possible combinations of multiple previous steps.
Instead, we propose to use a variable-order representation, that uses the first order when it is sufficient, and uses higher orders only when necessary.
As a result, we can represent variable orders of dependencies in the same network.
Here we see the first order node of Singapore, second order, third order, fourth order, and so on.
While assuming a fixed second order significantly increases the size of networks
By using variable orders in HON, the information in the network is comparable, but the network size is greatly reduced, since higher orders are used only when necessary.
Even the 5th order higher-order network is smaller than the fixed second order network. Therefore, HON has good scalability.
Now how to construct HON?
The workflow is as follows:
Given the raw sequential data, we first decide which nodes need to be split into higher-order nodes, and how high the orders are;
Then we connect nodes representing different orders of dependency.
Finally the output is the higher-order network, which can be used like conventional network for analyses.
The rule extraction process works as follows. In the shipping example, from the raw ship movement trajectories,
first count how many ships are going from Singapore to the next port
normalize the distributions for the next step given the current port being Singapore;
Then given one more previous step, say Shanghai, see how that changes the distribution of the next step from Singapore;
If the change is significant, then there is second order dependency here; this process is then repeated recursively for higher-orders.
After deciding the orders of dependencies, we first construct a first-order network in the conventional way,
then for every second order rule, add the corresponding node,
and rewire the previous first-order link to connect to the new node;
Then repeat the process for third order rules and so on.
Finally, for the highest order nodes, rewire their out-edges to connect to nodes with the highest orders possible.
Our latest work has made the algorithm parameter-free and more scalable.
In practice, we do not have a lot of higher-order dependencies.
To identify higher-order dependencies, we need to iterate every order and test the significance. However, the number of possible dependencies grows fast with orders.
Is there a way to optimize the cost in space and time?
Given the probability distribution for the next step, we can actually pre-determine if a higher-order will make a significant difference. Therefore, the algorithm does not have to iterate all possibilities for higher orders.
Let us see how HON affects dynamic processes on the network.
Suppose we have these ship trajectories,
the first two ships are going back and forth between port A and port B.
The third ship goes between port A and port C.
We can construct both the first-order network and the higher-order network from these trajectories.
Then based on the network structure, we predict where the ship is going.
On the first-order network, if a ship used to go between a and b, it is not guaranteed to go back to b again.
But on the higher-order network, if a ship goes from b to a, according to the network structure, it will go back to b for sure.
We can compare the predicted trajectories with the testing set, and know how accurately movements are simulated on these two networks.
We see that compared with the first-order network, fixed second-order network improves the accuracy, but HON with maximum order of 5 can further improve the accuracy, with a smaller size than the fixed 2nd order network.
We take diverse real-world data sets as the input, including the ship movement data, user clickstream data on websites, and Weibo retweet data reflecting diffusion process.
Interestingly, HON shows that a ship’s movement can depend on up to five previous ports, while all existing papers on shipping network simply ignore these higher-order dependencies.
The clickstream data shows up to 3rd order, which matches observations in other literatures.
The retweet data as a diffusion process, very interestingly, shows no higher-order dependency.
Given diverse raw data with higher-order dependencies, existing analysis methods are divided into two directions.
The first one is simply ignore higher-order patterns in the data and use the first-order network representation, which is inaccurate, and may potentially undermine the subsequent network analyses.
The other approach, is to modify existing algorithms to incorporate higher-order movement patterns in the analyses,
Such as the recent papers in Multilinear PageRank, or clustering using supervised random walks.
However, such approaches do not generalize, that we cannot modify all existing algorithms to incorporate higher-order movement patterns.
The higher-order network resolves this dilemma, by fundamentally improving the network itself,
such that network algorithms, without any modification, will be aware of higher-order dependencies.
The reason behind direct compatibility is that HON keeps the data structure consistent with FON. Here shows an example of random walks, we see that the only change is node labeling.
Now we see how HON influences a wide range of applications.
Thirty years ago zebra mussels did not even exist in the U.S., but now they are all over the Great Lakes. They do significant economical harms.
One of the major vectors of species introduction is through global shipping; (clicks) here shows an example of how species may be taken up with ballast water at one location, and discharged at another.
Can we discover the species flow patterns by studying the global shipping network, in order to inform the policy makers to devise effective invasive species control policies?
One of the major vectors of species introduction is through global shipping; (click) here shows an example of how species may be taken up with ballast water at one location, and discharged at another.
Can we discover the species flow patterns by studying the global shipping network, in order to inform the policy makers to devise effective invasive species control policies?
Clustering methods based on flow dynamics, such as Infomap and walktrap, can help us identify clusters of ports that are loosely connected to each other by shipping traffic.
Controlling the weak links between clusters can prevent species propagation from one cluster of ports to another.
If we do clustering on the first-order network, these nodes appear to be equivalent.
In reality, given ship movements, species are potentially more likely to be introduced to Los Angeles indirectly from Shanghai.
Clustering on HON will automatically take this into account and cluster Shanghai and Los Angeles together.
If we cluster the ports on the first-order network, it gives reasonable results, that ports geographically close to each other are likely to bring invasive species to each other.
We see that for Malta, a small country in the Mediterranean, these two ports belong to the same Mediterranean cluster.
On the higher-order network, however, clustering shows that Malta Freeport belongs to as many as five clusters, very different from Valletta, even they are geographically close to each other.
The reason is that Valletta is a small local port, while Malta Freeport is an international port with lots of traffic and is potentially susceptible to multiple sources of invasions.
Clustering on HON is able to reflect such differences, with no changes to the clustering algorithm itself.
**Our biology collaborators commented that the ability to capture indirect species invasions is critical for devising effective species control policies.
**Currently we are focusing on the Arctic region, which is overseeing increased traffic due to the reduced sea ice coverage caused by climate change.
Another application, ranking web page clickstreams
Given the clickstreams of web pages (click),
one could build a first-order network like this.
According to the network structure, after a user views or uploads a photo, the user is very likely to go back to the home page. The higher-order network representation of the same data, however, shows additional information.
It keeps the nodes and connections of the first-order network (click), but in addition (click),
it adds additional higher-order nodes, representing the natural scenario that once a user uploads and views a photo, the user is very likely to keep uploading more photos rather than going back to the home page.
Given that PageRank is essentially random walks with resets on a network, the random walker is more likely to get trapped in the strong loop in HON, leading to differences in PR scores.
More than 90% pages lose PageRank scores, while a few pages gain significant scores.
In the raw data, we see weather forecast and obituaries gain the most scores on HON, which is natural considering the data set is from a news company.
By using the HON, PR is able to simulate complex user browsing behaviors, with no changes to the PR algorithm.
Let’s see how HON can help unveil anomalies that are otherwise hidden.
The network approach of anomaly detection is to build a dynamic network from a raw data, which is a series of networks for each time stamp.
Then monitor when the network shows significant changes.
Suppose we have trajectories like this. Ships at c will randomly go to d or e, no matter where the ship comes to c.
We can build both the FON and HON. Since there are no higher-order dependencies, HON looks the same as FON.
At the second time stamp, there is still no higher-order dependency, the two networks remain the same.
In time frame 1 and 2, the network distance is 0.
Now that a second-order pattern has emerged, that all ships coming from a to c will go to d, and all ships from b to c will go to e.
Since the aggregated pairwise traffic remains the same, the first-order network shows no changes.
On the HON, the network topology will change to reflect the emergence of the second-order patterns, therefore the rising network difference signals an anomaly here.
When the second-order patterns change, now that all ships from a through c will go to e and all ships from b through c will go to d, the first-order network still shows no change
But again, the HON can capture complex anomalous patterns.
We use a more comprehensive data to test the effectiveness of the HON based anomaly detection scheme.
We produce a big synthetic data with known first-order dependencies, 2nd order, 3rd order, and mixed order dependencies.
The 11 cases correspond to 11 billion movements.
We represent this data with dynamic FON and dynamic HON.
The alternation of first-order movements is reflected by both networks.
But for 2nd, 3rd and more complex movement behavior changes, FON may completely fail to capture such anomalies.
But for 2nd, 3rd and more complex movement behavior changes, FON may completely fail to capture such anomalies.
Even if it reflects some changes, the signal can be very weak, compared to HON.
We also test the anomaly detection framework on the Porto taxi gps trajectory data.
Because the graph distances of HON reflect changes in higher-order movement patterns and first-order patterns,
HON is able to amplify anomalous patterns.
Despite the rich information in HON, there is not an existing tool for visualizing it. We have developed a toolkit for that.
Here is the framework. We map the raw trajectories with useful metadata such as time and location.
Then we build both the FON and HON from the input, and create a mapping between the two.
Given the networks, we provide three levels of exploration, including the global level to identify nodes of interest, individual level, and local level
Design the interface based on five different views
This is how the global shipping network is represented as HON in our visualization software. Users can import heterogeneous data, quickly identify interactive patterns of interest, explore at different granularities, trace propagation pathways and see how they expand the subgraph of ports invaded by certain species, test targeted species control strategies for given ports / shipping routes, and drill down to the raw data to understand the formation and evolution of higher-order dependencies.
This software package expedites the discovery of complex patterns and facilitates real-world applications in decision making.
I strongly believe HON has great potentials to be further explored.
Serves as a feature for fraud detection in signatures, where each signature is a trajectory.
Eye tracking data.
Patient records.
Human trafficking.
For example, HON can not only take trajectories as the input, but also diffusion data.
We can follow the pathways in the diffusion trees to build observations.
Therefore, it is possible to analyze problems like information diffusion with HON.
We can also discretize the time series as discrete sequential data.
Making it possible to analyze stock market or health data with HON.
It is also possible to chain pairwise interactions together. For example, if we assume phone calls made within 10 minutes are related, we can build phone call cascades, and use that as the input of HON.
Overall, we have placed an important piece for representing raw data with networks
Thanks to our funding agencies,
Thanks to our funding agencies,
Unlike the hypergraph, which requires network algorithms to be designed specifically for the network,
Use this knowledge, we propose the optimized algorithm.
It is a lazy algorithm, that it builds first-order observations only
And try the expansion
Higher-order observations are only generated when necessary.
Now that if the maximum possible divergence has already fallen below the threshold, the algorithm stops trying to look for even higher orders.
We construct a synthetic data set with known higher-order dependencies as the ground truth.
We generate 10M ship movements, and inject variable orders of dependencies.
The result shows that our method correctly captures these variable orders of dependencies, does not extract false dependencies beyond the fourth order, and keeps all other dependencies the first-order for compactness.