SlideShare a Scribd company logo
Exploiting Latent Features of Text and Graphs
Thesis Proposal
Justin Sybrandt
Clemson University
April 30, 2019
1
Peer-Reviewed Work:
• Sybrandt, J., Shtutman, M., & Safro, I. (2017, August). Moliere: Automatic biomedical hypothesis
generation system. In Proceedings of the 23rd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (pp. 1633-1642). ACM.
• Acceptance rate 8.8%.
• 1 Non-author citation.
• Sybrandt, J., Shtutman, M., & Safro, I. (2018, December). Large-scale validation of hypothesis
generation systems via candidate ranking. In 2018 IEEE International Conference on Big Data (Big
Data) (pp. 1494-1503). IEEE.
• Acceptance rate 18%.
• Sybrandt, J., Carrabba, A., Herzog, A., & Safro, I. (2018, December). Are abstracts enough for
hypothesis generation? In 2018 IEEE International Conference on Big Data (Big Data) (pp.
1504-1513). IEEE.
• Acceptance rate 18%.
2
Pending Works:
• Sybrandt, J., & Safro, I. Heterogeneous Bipartite Graph Embeddings. In-submission and not
available online.
• Sybrandt, J., & Shaydulin, R., & Safro, I. Partition Hypergraphs with Embeddings. Not available
online.
• Aksenova M., Sybrandt J., Cui B., Sikirzhytski V., Ji H., Odhiambo D., Lucius M., Turner J. R.,
Broude E., Pena E., Lizzaraga S., Zhu J., Safro I., Wyatt M. D., Shtutman M. (2019). Inhibition of
the DDX3 prevents HIV-1 Tat and cocaine-induced neurotoxicity by targeting microglia activation.
https://www.biorxiv.org/content/10.1101/591438v1
• Locke, W., Sybrandt, J., Safro, I., & Atamturktur, S. (2018, November 12). Using Drive-by Health
Monitoring to Detect Bridge Damage Considering Environmental and Operational Effects.
https://doi.org/10.31224/osf.io/ntfdp
3
Other Works:
Peer-Reviewed Extended Abstracts:
• Aksenova, M., Sybrandt, J., Cui, B., Lucius, M., Ji, H., Wyatt, M., Safro, I., Zhu, J., & Shtutman,
M. (2019). Inhibition of the DEAD Box RNA Helicase 3 prevents HIV-1 Tat- and cocaine-induced
neurotoxicity by targeting microglai activation. In 2019 Meeting of the NIDA Genetic Consortium.
Extended Abstract & Poster
• Sybrandt, J., & Hick, J. (2015). Rapid replication of multi-petabyte file systems. Work in progress
in the 2015 Parallel Data Storage Workshop. Poster in 2015 Super Computing.
Online Work:
• Shaydulin, R., & Sybrandt, J. (2017). To Agile, or not to Agile: A Comparison of Software
Development Methodologies. arXiv preprint arXiv:1704.07469.
• 5 Non-author citations.
4
Overview
• Latent Variables
• Unobservable qualities of a dataset.
• Text Embeddings
• Transferable textual latent features.
• Correspond to semantic properties of words.
• Graph Embeddings
• Underlying network features.
• Correspond to roles, communities, and unobserved node-features.
5
Outline
Hypothesis Generation
Moliere: Automatic Biomedical Hypothesis Generation
Validation via Candidate Ranking
Are Abstracts Enough?
Graph Embedding
Heterogeneous Bipartite Graph Embedding
Partition Hypergraphs with Embeddings
Proposed Work
6
Hypothesis Generation Background
Drug Discovery Problem Overview
• Medical research is expensive and risky.
• Text mining can identify fruitful research directions before expensive
experiments.
Wall Street Journal
Pfizer Ends Hunt for Drugs to Treat
Alzheimer’s and Parkinson’s
8
Hypothesis Generation Overview
• PubMed contains over 27-million abstracts.
• 2-4k added daily.
• Hypothesis generation finds implicitly published relationships.
9
• Automatic Biomedical Hypothesis Generation System
• Basic Pipeline
• Data collection
• Network construction
• Abstract identification
• Topic modeling
10
Data Collection
• Titles & Abstracts
Tumours evade immune control by creating hostile microenvironments that perturb T cell
metabolism and effector function.
• Phrases (n-grams)
• “T cell metabolism”
• Predicates
• tumours → evade → immune control
• Unified Medical Language System
Neoplasms
tumor, tumour, oncological abnormality
11
Medical Text Embedding
• Use fasttext to capture latent features [3].
• Semantically similar items are nearest
neighbors.
12
Network Construction
• Connections between data types.
13
Abstract Identification
• Queries in form (a, c).
• Find shortest path.
• Identify abstracts near path.
14
Topic Modeling
• Run LDA topic modeling [2].
• Analyze word patterns across topics.
• Example: Venlafaxine interacts with HTR1A.
15
Automating Analysis
• Studying topics manually is infeasible.
• We propose plausibility ranking criteria.
• Drug discovery is a ranking problem.
• Allows for large-scale numerical evaluation.
16
Validation Through Ranking
• Evaluate sorting criteria through ROC.
• “Synthetic” experiment, similar to drug
discovery.
• Identify recent discoveries.
• Create negative samples.
• Propose ranking criteria.
17
Validation Data
• Set “cut-date.”
• Training Data:
• All papers published prior.
• Validation Data:
• Recent discoveries: SemMedDB pairs
first occurring after.
• Negative samples: Random UMLS
pairs never occurring.
18
Proposed Ranking: Embedding
• Measure distance between query terms a and c.
• Calculate weighted centroid for each topic.
• Measure distances between terms and topics.
19
Proposed Ranking: Topics
• Create nearest-neighbors network from topic embeddings.
• Measure network statistics of shortest path a − c.
20
Combination Ranking
• Embedding measures
• CSim(a, c): Cosine Similarity of Embeddings
• L2(a, c): Euclidean Distance of Embeddings
• BestCentrCSim(a, c, T ): Maximum joint topic similarity.
• BestCentrL2(a, c, T ): Minimum joint topic distance.
• BestTopPerWord(a, c, T ): Max of minimum joint similarity.
• TopicCorr(a, c, T ): Correlation of Topic Distances
• Topic network measures
• TopWalkLength(a, c, T ): Length of shortest path a ∼ c
• TopWalkBtwn(a, c, T ): Avg. a ∼ c betweenness centrality
• TopWalkEigen(a, c, T ): Avg. a ∼ c eigenvalue centrality
• TopNetCCoef(a, c, T ): Clustering coefficient of N
• TopNetMod(a, c, T ): Modularity of N
PolyMultiple(a, c, T ) = α1 · L
β1
2
+ α2 · BestCenterL
β2
2
+ α3 · BestTopPerWord(a, c, T )
β3 + α4 · TopCorr(a, c, T )
β4
+ α5 · TopWalkBtwn(a, c, T )
β5 + α6 · TopNetCCoef(a, c, T )
β6
21
Validation Results
• Top ranking criteria:
• PolyMultiple
• L2
• BestCentrCSim
22
Testing in the Real World
• Apply ranking criteria for
laboratory experiments.
• HIV-associated Neurodegenerative
Disorder
• 30% of HIV patents over 60 develop
dementia.
• We searched 40k gene relationships.
• Identified DDX3X.
23
Are Abstracts Enough?
Are Abstracts Enough?
• Determine relationship between input and output.
• Rebuild Moliere using different corpora.
• Evaluate using ranking method.
• Explore effect of:
• Corpus size.
• Document length.
• Abstracts vs. full-texts.
25
Full-text Challenges
• Expensive
• Often requires licensing.
• Longer documents
• 15.6x more words-per-document.
• Parsing
• Figure, tables, references, PDFs.
26
Experiments
• Create separate instances of Moliere
• From training vector space to Polynomial.
• Datasets
• Iterative halves of PubMed.
• PubMed Central full texts & abstracts.
• Evaluation
• Create shared set of positive and negative hypotheses.
• Cut year of 2015.
• Rank and calculate ROC curves.
27
Dataset Details
Corpus Total Words Unique Words Corpus Size Median Words
per Document
PMC Abstracts 109,987,863 673,389 1,086,704 102
PMC Full-Text 1,860,907,606 6,548,236 1,086,704 1594
PubMed 1,852,059,044 2,410,130 24,284,910 71
1/2 PubMed 923,679,660 1,505,672 12,142,455 71
1/4 PubMed 460,384,928 920,734 6,071,227 71
1/8 PubMed 229,452,214 565,270 3,035,613 71
1/16 PubMed 114,385,607 349,174 1,517,806 71
28
Abstract vs Full Text Results
• Summarize performance with L2
and Polynomial metrics.
• Polynomial evaluates whole system.
• L2 evaluates embedding.
29
Beyond Quality
• Full text improves performance quality by about 10%.
• Full text increases runtime from 2m to 1.5h.
• Topic modeling:
• Primary runtime increase.
• Less interpretable topics.
30
Effects
• Corpus size:
• Longer corpus helps slightly.
• Document length:
• Longer documents significantly improve word embeddings.
• Increase runtime.
• Abstracts vs. Full text
• Content in full texts not found in abstracts.
31
Hypothesis Generation Summary
• Moliere
• Use embedding to make network, find abstracts, perform topic modeling.
• Validation via Candidate Ranking
• Propose metrics to quality embedding and topic model qualities.
• Evaluate recently published results, extend to real-world experiments.
• Are Abstracts Enough?
• Compare performance of system across different corpus qualities.
• Full texts improve performance at large runtime penalty.
32
Graph Embedding Background
Two Different Graph Similarities [10]
• Structural Similarity • Homophilic Similarity
34
Skip-Gram Model [17]
• Sample windows centered on target
word.
• Predict leading & trailing context
from target embedding.
• Assumption: “Similar words share
similar company.”
35
Deepwalk [19]
• Sample random walks from graph.
• Interpret walks as “sentences.”
• Apply Skip-Gram model.
36
LINE [21]
• Sample first- & second-order neighbors.
• Fit observed samples to embeddings.
• Observed probability between u & v:
p(u, v) =
wuv
(i,j)∈E wij
• Predicted probability:
ˆp(u, v, ) = σ( (u) (v))
• Minimize KL-Divergence between p and ˆp.
37
Node2Vec [10]
• Combine structural and
homophilic.
• Blends breadth- and depth-first walks.
• Adds return- and out-parameters.
38
Heterogeneous Bipartite Graph Embedding
Heterogeneous Bipartite Graphs
• Contains two node types.
• GB = (V, E)
• V = A ∪ B
• A and B are disjoint.
• Neighborhood Γ(i).
• If i ∈ A, then Γ(i) ⊆ B.
40
Proposed Methods
• Boolean Heterogeneous Bipartite
Embedding
• Weight all samples equally.
• Sample direct and first-order
relationships.
• Algebraic Heterogeneous Bipartite
Embedding
• Weight sampling with algebraic
distance.
• Sample direct, first-, and second-order
relationships.
Both Methods:
• Enable type-specific latent features.
• Make only same-type comparisons.
41
Algorithm Outline
• Observed similarities:
• SA(i, j), SB(i, j), SAB(i, j)
• Predicted similarities w.r.t. embedding ( : V → Rk
):
• SA(i, j, ), SB(i, j, ), SAB(i, j, )
• Optimize:
• Minimize difference between S and S.
42
Boolean Heterogeneous Bipartite Embedding
Boolean Observations
• Observed cross-type relationships:
SAB(i, j) =



1 i ∈ Γ(j)
0 otherwise
• Observed same-type relationships:
SA(i, j) = SB(i, j) =



1 Γ(i) ∩ Γ(j) = ∅
0 otherwise
44
Boolean Predictions
• Predicted same-type relationships:
SA(i, j, ) = σ ( (i) (j))
• Predicted cross-type relationships:
• Decompose into same-type relationships.
SAB(i, j, ) = E
k∈Γ(j)
SA(i, k, ) E
k∈Γ(i)
SB(k, j, )
45
Boolean Optimization
• Loss for a particular similarity:
OX =
i,j∈V
SX(i, j, ) log


SX(i, j)
SX(i, j, )


• Optimization:
min OA + OAB + OB
46
Algebraic Heterogeneous Bipartite Embedding
Algebraic Distance I
• Stationary iterative relaxation.
• Algebraic distance for hypergraphs [20], adapted for bipartite graphs:
a0 ∼ [0, 1]
at+1(i) = λat(i) + (1 − λ)
j∈Γ(i) at(j)|Γ(j)|−1
j∈Γ(i) |Γ(j)|−1
• Between each iteration, rescale to [0, 1].
48
Algebraic Distance II
• Run T = 20 algebraic distance trials until stabilization.
• Summarize distances across all trials:
d(i, j) =
T
t =1
a
(t )
∞ (i) − a
(t )
∞ (j)
2
• Summarize similarity between nodes:
s(i, j) =
√
T − d(i, j)
√
T
49
Algebraic Observations
• Same-type:
• Two A nodes are similar if any B node is highly similar to both.
SA(i, j) = SB(i, j) = max
k∈Γ(i)∩Γ(j)
min (s(i, k), s(k, j))
• Cross-type, sample both direct and second-order neighbors:
• Decompose cross-type comparisons to same-typed neighborhoods.
SAB(i, j) = max max
k∈Γ(j)
SA(i, k), max
k∈Γ(i)
SB(k, j)
50
Algebraic Predictions
• Predicted same-type relationships:
SA(i, j, ) = max (0, (i) (j))
• Predicted same- and cross-typed relationships:
SAB(i, j, ) = E
k∈Γ(j)
SA(i, k, ) E
k∈Γ(i)
SB(k, j, )
51
Algebraic Optimization
• Loss for a particular similarity:
OX = E
i,j∈V
SX(i, j) − SX(i, j, )
2
• Optimization:
min OA + OAB + OB
52
New Embedding Methods
• Boolean Heterogeneous Bipartite
Embedding (BHBE)
• Observes existence of relationships.
• Predicts using σ( (i) (j)).
• Minimizes KL-Divergence.
• Algebraic Heterogeneous Bipartite
Embedding (AHBE)
• Observes relationships weighted
through algebraic distance.
• Predicts using max(0, (i) (j)).
• Minimizes Mean-Squared Error.
53
Combination Embedding
• Learn joint representation to
combine AHBE & BHBE.
• Direct encoding predicts links through
joint embedding.
• Auto-regularized encoding also
enforces all latent features are
captured.
54
Evaluation I
• Link prediction task.
• Select graph, delete % of edges.
• Embed remaining graph.
• Use embeddings to recover removed edges.
• Explore varying hold-out percentages.
• From 10% to 90% splits.
• Increments of 10%.
55
Evaluation II
• Models:
• A- and B-Personalized.
• Train a SVM for each node in the training set.
• Each SVM detects an embedding region
containing neighbors.
• Unified.
• Train a shallow neural network.
• Predict links given one embedding of each
type.
A-Personalized Model
56
Benchmark
Γ(i ∈ A) Γ(j ∈ B)
Graph |A|/|B| md max md max SR LCP
Amazon 16,716/5,000 3 49 8 328 75.8 1.6
DBLP 93,432/5,000 1 12 8 7,556 174.7 81.7
Friendster 220,015/5,000 1 26 133 1,612 80.3 58.3
Livejournal 84,438/5,000 1 20 16 1,441 100.9 27.0
MadGrades 11,951/6,462 3 39 4 393 57.3 99.7
YouTube 39,841/5,000 1 54 4 2,217 113.3 80.6
Table: Graph summary. We report the median (md) and max degree for each node set, as well as the
Spectral Radius (SR) and the percentage of the largest connected component (LCP).
57
MadGrades: UW. Instructor-Course Network
(|A| = 11, 951, |B| = 6, 462)
Accuracy
A-Personalized B-Personalized Unified
Training-Test Split
— BHBE — AHBE — Direct Comb.
— Auto-Reg. Comb. - - Deepwalk - - LINE
- - Node2Vec - - BiNE - - Metapath2Vec++
58
Results
• Algebraic HBE
• Best detects trends among high-degree nodes.
• Stability issues with larger graphs.
• Boolean HBE
• Outperforms typical state-of-the-art methods, competitive with other
heterogeneous methods.
• Robust across trials.
• Combinations
• BHBE and ABHE find different latent features.
• Bootstrap performance above state-of-the-art.
59
Partition Hypergraphs with Embeddings
Hypergraphs
• Generalization: hyperedges contain any
number of nodes.
• H = (V, E)
• V = {v1, v2, . . . , vn}
• E = {e1, e2, . . . , em}
• ei ⊆ V
61
Hypergraph Star Expansion
• Map hyperedges to nodes.
62
Hypergraph Partitioning: Problem Description I
• Problem: Split V into k disjoint sets...
• of approximately equal size.
• minimizing an objective of cut hyperedges.
63
Hypergraph Partitioning: Problem Description II
• Partition:
• V = V1 ∪ V2 ∪ · · · ∪ Vk
• ∀(Vi, Vj), Vi ∩ Vj = ∅
• Ecut = {e ∈ E : Vi, e ⊆ Vi}
• Metrics:
λcut := |Ecut|
λk−1 :=
e∈Ecut
|{Vi : Vi ∩ e = ∅}| − 1
64
Hypergraph Partitioning: Problem Description III
• Hypergraph partitioning is NP-Hard...
• to solve [15].
• to approximate [5].
65
Multilevel Heuristic
• Steps:
• Coarsen
• Initial Solution
• Uncoarsen: Interpolate &
local search.
• Paradigms:
• (log n)-Level: Each level, pair
almost all nodes.
• n-Level: Each level, pair two
nodes.
66
Coarsening
• Desired coarsening properties [12]:
• Reduce number of nodes & hyperedges.
• Remain structurally similar.
Contribution
Use hypergraph embeddings to better
coarsen nodes.
67
Typical Coarsening
• Assigning all nodes & hyperedges uniform weights: wi = 1.
• Measure similarities (e.g. hyperedge inner product).
SE(u, v) =
e∈E|u,v∈e
we
|e| − 1
• Match nodes into (u, v).
• Contract u and v into x.
• wx = wu + wv
• x participates in all hyperedges of u and v.
68
Embedding-based Coarsening
• Quantify similarity within embedding.
S (u, v) = (u) (v)
• Prioritize nodes with highly similar neighbors.
• Measure node similarities:
S(u, v) =
SES
wuwv
69
Embedding-based Coarsening Algorithm
• Sort V by each node’s highest neighbor similarity.
SortingCriteriau = max
v∈Γ(u)
S (u, v)
• In sorted order, pair nodes.
Partneru = argmax
v∈Γ(u)
SE(u, v)S (u, v)
wuwv
• Merge (u, v) to coarse node x.
• Assign (x) to be the centroid of its contained nodes.
70
Evaluation Method I
• Proposed Implementations:
• Zoltan: (log n)-Level, fast and highly parallel.
• KaHyPar: n-Level, high-quality partitioning.
• KaHyPar Flow: n-Level, best known algorithm.
• Considered Embeddings:
• MetaPath2Vec++
• Node2Vec
• AHBE, BHBE
• Combinations (AHBE+BHBE), (All)
71
Evaluation Method II
• Baseline Algorithms:
• hMetis [14]
• PaToH [6]
• Zoltan [8]
• KaHyPar (w/ community-based coarsening) [13]
• KaHyPar Flow (w/ flow-based refinement) [12]
72
Evaluation Method III
• Benchmark Graphs
• 86 graphs from SuiteSparse Matrix Collection.
• 10 graphs designed to interfere with typical coarsening.
• k values: 2, 4, 8, . . . , 128.
• 20 trials per combination.
• Imbalance tolerance of 3%.
73
Partitioning Results I
74
Partitioning Results II
75
Partitioning Results III
76
Partitioning Summary
• Embeddings improve multilevel hypergraph partitioning.
• Latent features indicate relevant structural similarities.
• Embedding similarity prioritizes and matches nodes.
• Most important for small partition counts (k).
• Embeddings capture key clusters.
• Centroids during coarsening smooth some finer details.
• Latent features are most important for particular graphs.
• Social networks.
• Synthetic graphs.
77
Proposed Work
Future Directions
• Bias in Scientific Embeddings
• Detect confirmation bias.
• Normalize effect of “group think.”
• Hybrid Knowledge Graph Mining
• Train text and graph embeddings.
• Formulate hypothesis generation for
deep learning.
79
Bias in Word Embeddings
• Recent work finds gender stereotypes in word embeddings [4].
• Biases exist in science.
• Confirmation bias [18]
• Over-interpreting noise [7].
• P-hacking [11].
• Example P53
• “[A]valanche of research” [22]
• What connections to focus on? Which are noise?
80
Hybrid Knowledge Graph Mining for Hypothesis Generation
• Knowledge Graphs
• Triplets, similar to SemMedDB Predicates
• Typed relationships
• Specialized Techniques
• Edge2Vec [9].
• Use text to augment graphs [16].
• SciBERT [1].
81
Attention-Based Hypothesis Generation
• Attention-mechanism creates interpretable results.
• Assigns weights to relevant input.
• Unified deep-learning model.
• Input: Embeddings for text and network features.
• Potential Outputs:
• Connection strength.
• Connection type.
• Automatic summary.
82
Timeline
Date Accomplishment
April 2019 Dissertation proposal.
August 2019 Return from summer internship. Begin exploring bias in scientific embed-
dings.
November 2019 Complete analysis of bias in scientific embeddings. Begin exploring deep
learning on knowledge graphs for hypothesis generation.
April 2020 Complete analysis of deep learning and knowledge graphs for hypothesis
generation.
June 2020 Dissertation defense.
Table: Timeline of proposed work.
83
Acknowledgments
• NSF MRI #1725573
• NSF NRT-DESE #1633608
84
Bibliography I
[1] Beltagy, I., Cohan, A., and Lo, K.
Scibert: Pretrained contextualized embeddings for scientific text.
arXiv preprint arXiv:1903.10676 (2019).
[2] Blei, D. M., Ng, A. Y., and Jordan, M. I.
Latent dirichlet allocation.
993–1022.
[3] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.
Enriching word vectors with subword information.
Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
[4] Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T.
Man is to computer programmer as woman is to homemaker? debiasing word embeddings.
In Advances in neural information processing systems (2016), pp. 4349–4357.
85
Bibliography II
[5] Bui, T. N., and Jones, C.
Finding good approximate vertex and edge partitions is np-hard.
Information Processing Letters 42, 3 (1992), 153–159.
[6] C¸ataly¨urek, ¨U., and Aykanat, C.
Patoh (partitioning tool for hypergraphs).
Encyclopedia of Parallel Computing (2011), 1479–1487.
[7] de Groot, A. D.
The meaning of “significance” for different types of research [translated and annotated by eric-jan
wagenmakers, denny borsboom, josine verhagen, rogier kievit, marjan bakker, angelique cramer,
dora matzke, don mellenbergh, and han lj van der maas].
Acta psychologica 148 (2014), 188–194.
86
Bibliography III
[8] Devine, K. D., Boman, E. G., Heaphy, R. T., Bisseling, R. H., and Catalyurek,
U. V.
Parallel hypergraph partitioning for scientific computing.
In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International (2006),
IEEE, pp. 10–pp.
[9] Gao, Z., Fu, G., Ouyang, C., Tsutsui, S., Liu, X., and Ding, Y.
edge2vec: Learning node representation using edge semantics.
arXiv preprint arXiv:1809.02269 (2018).
[10] Grover, A., and Leskovec, J.
node2vec: Scalable feature learning for networks.
In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and
data mining (2016), ACM, pp. 855–864.
87
Bibliography IV
[11] Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., and Jennions, M. D.
The extent and consequences of p-hacking in science.
PLoS biology 13, 3 (2015), e1002106.
[12] Heuer, T., Sanders, P., and Schlag, S.
Network Flow-Based Refinement for Multilevel Hypergraph Partitioning.
In 17th International Symposium on Experimental Algorithms (SEA 2018) (2018), pp. 1:1–1:19.
[13] Heuer, T., and Schlag, S.
Improving coarsening schemes for hypergraph partitioning by exploiting community structure.
In 16th International Symposium on Experimental Algorithms, (SEA 2017) (2017),
pp. 21:1–21:19.
[14] Karypis, G.
hmetis 1.5: A hypergraph partitioning package.
http://www. cs. umn. edu/˜ metis (1998).
88
Bibliography V
[15] Lengauer, T.
Combinatorial algorithms for integrated circuit layout.
Springer Science & Business Media, 2012.
[16] Li, C., Lai, Y.-Y., Neville, J., and Goldwasser, D.
Joint embedding models for textual and social analysis.
In The Workshop on Deep Structured Prediction (2017).
[17] Mikolov, T., Chen, K., Corrado, G., and Dean, J.
Efficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781 (2013).
89
Bibliography VI
[18] Munaf`o, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D.,
Du Sert, N. P., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., and Ioannidis,
J. P.
A manifesto for reproducible science.
Nature human behaviour 1, 1 (2017), 0021.
[19] Perozzi, B., Al-Rfou, R., and Skiena, S.
Deepwalk: Online learning of social representations.
In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and
data mining (2014), ACM, pp. 701–710.
[20] Shaydulin, R., Chen, J., and Safro, I.
Relaxation-based coarsening for multilevel hypergraph partitioning.
Multiscale Modeling & Simulation 17, 1 (2019), 482–506.
90
Bibliography VII
[21] Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q.
Line: Large-scale information network embedding.
In Proceedings of the 24th International Conference on World Wide Web (2015), International
World Wide Web Conferences Steering Committee, pp. 1067–1077.
[22] Vogelstein, B., Lane, D., and Levine, A. J.
Surfing the p53 network.
Nature 408, 6810 (2000), 307.
91
Outline
Hypothesis Generation
Moliere: Automatic Biomedical Hypothesis Generation
Validation via Candidate Ranking
Are Abstracts Enough?
Graph Embedding
Heterogeneous Bipartite Graph Embedding
Partition Hypergraphs with Embeddings
Proposed Work
92

More Related Content

What's hot

Semantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and SummarizationSemantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and Summarization
Gong Cheng
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
Polytechnic University of Bari
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Johann Petrak
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Polytechnic University of Bari
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Combining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationCombining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationSonia Haiduc
 
EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts ...
EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts ...EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts ...
EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts ...
Francesco Osborne
 
Extracting keywords from texts - Sanda Martincic Ipsic
Extracting keywords from texts - Sanda Martincic IpsicExtracting keywords from texts - Sanda Martincic Ipsic
Extracting keywords from texts - Sanda Martincic Ipsic
Institute of Contemporary Sciences
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Saeedeh Shekarpour
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
Frank van Harmelen
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentMaribel Acosta Deibe
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval GESIS
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Isabelle Augenstein
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
Anubhav Jain
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Marina Santini
 
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Marina Santini
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
Arnab Bhadury
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Anubhav Jain
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
Tao Xie
 

What's hot (20)

Semantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and SummarizationSemantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and Summarization
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Combining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationCombining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept Location
 
EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts ...
EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts ...EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts ...
EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts ...
 
Extracting keywords from texts - Sanda Martincic Ipsic
Extracting keywords from texts - Sanda Martincic IpsicExtracting keywords from texts - Sanda Martincic Ipsic
Extracting keywords from texts - Sanda Martincic Ipsic
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 

Similar to Sybrandt Thesis Proposal Presentation

Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big Data
Christos Hadjinikolis
 
Slides ecir2016
Slides ecir2016Slides ecir2016
Slides ecir2016
Fattane Zarrinkalam
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
مركز البحوث الأقسام العلمية
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Databricks
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
Neo4j
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
William Gunn
 
لتحليل الدراسات السابقة Nails محاضرة برنامج
  لتحليل الدراسات السابقة Nails محاضرة برنامج  لتحليل الدراسات السابقة Nails محاضرة برنامج
لتحليل الدراسات السابقة Nails محاضرة برنامج
مركز البحوث الأقسام العلمية
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
Alexander Sibiryakov
 
naveed-kamran-software-architecture-agile
naveed-kamran-software-architecture-agilenaveed-kamran-software-architecture-agile
naveed-kamran-software-architecture-agileNaveed Kamran
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker, Inc.
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
Carlos Badenes-Olmedo
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
What to read next? Challenges and Preliminary Results in Selecting Represen...
What to read next? Challenges and  Preliminary Results in Selecting  Represen...What to read next? Challenges and  Preliminary Results in Selecting  Represen...
What to read next? Challenges and Preliminary Results in Selecting Represen...
MOVING Project
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
Neo4j
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
Stefan Schlobach
 
An Empirical Study on the Adequacy of Testing in Open Source Projects
An Empirical Study on the Adequacy of Testing in Open Source ProjectsAn Empirical Study on the Adequacy of Testing in Open Source Projects
An Empirical Study on the Adequacy of Testing in Open Source Projects
Pavneet Singh Kochhar
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Mikel Emaldi Manrique
 

Similar to Sybrandt Thesis Proposal Presentation (20)

Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big Data
 
Slides ecir2016
Slides ecir2016Slides ecir2016
Slides ecir2016
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
 
لتحليل الدراسات السابقة Nails محاضرة برنامج
  لتحليل الدراسات السابقة Nails محاضرة برنامج  لتحليل الدراسات السابقة Nails محاضرة برنامج
لتحليل الدراسات السابقة Nails محاضرة برنامج
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
naveed-kamran-software-architecture-agile
naveed-kamran-software-architecture-agilenaveed-kamran-software-architecture-agile
naveed-kamran-software-architecture-agile
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
What to read next? Challenges and Preliminary Results in Selecting Represen...
What to read next? Challenges and  Preliminary Results in Selecting  Represen...What to read next? Challenges and  Preliminary Results in Selecting  Represen...
What to read next? Challenges and Preliminary Results in Selecting Represen...
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 
Ngsp
NgspNgsp
Ngsp
 
An Empirical Study on the Adequacy of Testing in Open Source Projects
An Empirical Study on the Adequacy of Testing in Open Source ProjectsAn Empirical Study on the Adequacy of Testing in Open Source Projects
An Empirical Study on the Adequacy of Testing in Open Source Projects
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
 

Recently uploaded

insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 

Recently uploaded (20)

insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 

Sybrandt Thesis Proposal Presentation

  • 1. Exploiting Latent Features of Text and Graphs Thesis Proposal Justin Sybrandt Clemson University April 30, 2019 1
  • 2. Peer-Reviewed Work: • Sybrandt, J., Shtutman, M., & Safro, I. (2017, August). Moliere: Automatic biomedical hypothesis generation system. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1633-1642). ACM. • Acceptance rate 8.8%. • 1 Non-author citation. • Sybrandt, J., Shtutman, M., & Safro, I. (2018, December). Large-scale validation of hypothesis generation systems via candidate ranking. In 2018 IEEE International Conference on Big Data (Big Data) (pp. 1494-1503). IEEE. • Acceptance rate 18%. • Sybrandt, J., Carrabba, A., Herzog, A., & Safro, I. (2018, December). Are abstracts enough for hypothesis generation? In 2018 IEEE International Conference on Big Data (Big Data) (pp. 1504-1513). IEEE. • Acceptance rate 18%. 2
  • 3. Pending Works: • Sybrandt, J., & Safro, I. Heterogeneous Bipartite Graph Embeddings. In-submission and not available online. • Sybrandt, J., & Shaydulin, R., & Safro, I. Partition Hypergraphs with Embeddings. Not available online. • Aksenova M., Sybrandt J., Cui B., Sikirzhytski V., Ji H., Odhiambo D., Lucius M., Turner J. R., Broude E., Pena E., Lizzaraga S., Zhu J., Safro I., Wyatt M. D., Shtutman M. (2019). Inhibition of the DDX3 prevents HIV-1 Tat and cocaine-induced neurotoxicity by targeting microglia activation. https://www.biorxiv.org/content/10.1101/591438v1 • Locke, W., Sybrandt, J., Safro, I., & Atamturktur, S. (2018, November 12). Using Drive-by Health Monitoring to Detect Bridge Damage Considering Environmental and Operational Effects. https://doi.org/10.31224/osf.io/ntfdp 3
  • 4. Other Works: Peer-Reviewed Extended Abstracts: • Aksenova, M., Sybrandt, J., Cui, B., Lucius, M., Ji, H., Wyatt, M., Safro, I., Zhu, J., & Shtutman, M. (2019). Inhibition of the DEAD Box RNA Helicase 3 prevents HIV-1 Tat- and cocaine-induced neurotoxicity by targeting microglai activation. In 2019 Meeting of the NIDA Genetic Consortium. Extended Abstract & Poster • Sybrandt, J., & Hick, J. (2015). Rapid replication of multi-petabyte file systems. Work in progress in the 2015 Parallel Data Storage Workshop. Poster in 2015 Super Computing. Online Work: • Shaydulin, R., & Sybrandt, J. (2017). To Agile, or not to Agile: A Comparison of Software Development Methodologies. arXiv preprint arXiv:1704.07469. • 5 Non-author citations. 4
  • 5. Overview • Latent Variables • Unobservable qualities of a dataset. • Text Embeddings • Transferable textual latent features. • Correspond to semantic properties of words. • Graph Embeddings • Underlying network features. • Correspond to roles, communities, and unobserved node-features. 5
  • 6. Outline Hypothesis Generation Moliere: Automatic Biomedical Hypothesis Generation Validation via Candidate Ranking Are Abstracts Enough? Graph Embedding Heterogeneous Bipartite Graph Embedding Partition Hypergraphs with Embeddings Proposed Work 6
  • 8. Drug Discovery Problem Overview • Medical research is expensive and risky. • Text mining can identify fruitful research directions before expensive experiments. Wall Street Journal Pfizer Ends Hunt for Drugs to Treat Alzheimer’s and Parkinson’s 8
  • 9. Hypothesis Generation Overview • PubMed contains over 27-million abstracts. • 2-4k added daily. • Hypothesis generation finds implicitly published relationships. 9
  • 10. • Automatic Biomedical Hypothesis Generation System • Basic Pipeline • Data collection • Network construction • Abstract identification • Topic modeling 10
  • 11. Data Collection • Titles & Abstracts Tumours evade immune control by creating hostile microenvironments that perturb T cell metabolism and effector function. • Phrases (n-grams) • “T cell metabolism” • Predicates • tumours → evade → immune control • Unified Medical Language System Neoplasms tumor, tumour, oncological abnormality 11
  • 12. Medical Text Embedding • Use fasttext to capture latent features [3]. • Semantically similar items are nearest neighbors. 12
  • 13. Network Construction • Connections between data types. 13
  • 14. Abstract Identification • Queries in form (a, c). • Find shortest path. • Identify abstracts near path. 14
  • 15. Topic Modeling • Run LDA topic modeling [2]. • Analyze word patterns across topics. • Example: Venlafaxine interacts with HTR1A. 15
  • 16. Automating Analysis • Studying topics manually is infeasible. • We propose plausibility ranking criteria. • Drug discovery is a ranking problem. • Allows for large-scale numerical evaluation. 16
  • 17. Validation Through Ranking • Evaluate sorting criteria through ROC. • “Synthetic” experiment, similar to drug discovery. • Identify recent discoveries. • Create negative samples. • Propose ranking criteria. 17
  • 18. Validation Data • Set “cut-date.” • Training Data: • All papers published prior. • Validation Data: • Recent discoveries: SemMedDB pairs first occurring after. • Negative samples: Random UMLS pairs never occurring. 18
  • 19. Proposed Ranking: Embedding • Measure distance between query terms a and c. • Calculate weighted centroid for each topic. • Measure distances between terms and topics. 19
  • 20. Proposed Ranking: Topics • Create nearest-neighbors network from topic embeddings. • Measure network statistics of shortest path a − c. 20
  • 21. Combination Ranking • Embedding measures • CSim(a, c): Cosine Similarity of Embeddings • L2(a, c): Euclidean Distance of Embeddings • BestCentrCSim(a, c, T ): Maximum joint topic similarity. • BestCentrL2(a, c, T ): Minimum joint topic distance. • BestTopPerWord(a, c, T ): Max of minimum joint similarity. • TopicCorr(a, c, T ): Correlation of Topic Distances • Topic network measures • TopWalkLength(a, c, T ): Length of shortest path a ∼ c • TopWalkBtwn(a, c, T ): Avg. a ∼ c betweenness centrality • TopWalkEigen(a, c, T ): Avg. a ∼ c eigenvalue centrality • TopNetCCoef(a, c, T ): Clustering coefficient of N • TopNetMod(a, c, T ): Modularity of N PolyMultiple(a, c, T ) = α1 · L β1 2 + α2 · BestCenterL β2 2 + α3 · BestTopPerWord(a, c, T ) β3 + α4 · TopCorr(a, c, T ) β4 + α5 · TopWalkBtwn(a, c, T ) β5 + α6 · TopNetCCoef(a, c, T ) β6 21
  • 22. Validation Results • Top ranking criteria: • PolyMultiple • L2 • BestCentrCSim 22
  • 23. Testing in the Real World • Apply ranking criteria for laboratory experiments. • HIV-associated Neurodegenerative Disorder • 30% of HIV patents over 60 develop dementia. • We searched 40k gene relationships. • Identified DDX3X. 23
  • 25. Are Abstracts Enough? • Determine relationship between input and output. • Rebuild Moliere using different corpora. • Evaluate using ranking method. • Explore effect of: • Corpus size. • Document length. • Abstracts vs. full-texts. 25
  • 26. Full-text Challenges • Expensive • Often requires licensing. • Longer documents • 15.6x more words-per-document. • Parsing • Figure, tables, references, PDFs. 26
  • 27. Experiments • Create separate instances of Moliere • From training vector space to Polynomial. • Datasets • Iterative halves of PubMed. • PubMed Central full texts & abstracts. • Evaluation • Create shared set of positive and negative hypotheses. • Cut year of 2015. • Rank and calculate ROC curves. 27
  • 28. Dataset Details Corpus Total Words Unique Words Corpus Size Median Words per Document PMC Abstracts 109,987,863 673,389 1,086,704 102 PMC Full-Text 1,860,907,606 6,548,236 1,086,704 1594 PubMed 1,852,059,044 2,410,130 24,284,910 71 1/2 PubMed 923,679,660 1,505,672 12,142,455 71 1/4 PubMed 460,384,928 920,734 6,071,227 71 1/8 PubMed 229,452,214 565,270 3,035,613 71 1/16 PubMed 114,385,607 349,174 1,517,806 71 28
  • 29. Abstract vs Full Text Results • Summarize performance with L2 and Polynomial metrics. • Polynomial evaluates whole system. • L2 evaluates embedding. 29
  • 30. Beyond Quality • Full text improves performance quality by about 10%. • Full text increases runtime from 2m to 1.5h. • Topic modeling: • Primary runtime increase. • Less interpretable topics. 30
  • 31. Effects • Corpus size: • Longer corpus helps slightly. • Document length: • Longer documents significantly improve word embeddings. • Increase runtime. • Abstracts vs. Full text • Content in full texts not found in abstracts. 31
  • 32. Hypothesis Generation Summary • Moliere • Use embedding to make network, find abstracts, perform topic modeling. • Validation via Candidate Ranking • Propose metrics to quality embedding and topic model qualities. • Evaluate recently published results, extend to real-world experiments. • Are Abstracts Enough? • Compare performance of system across different corpus qualities. • Full texts improve performance at large runtime penalty. 32
  • 34. Two Different Graph Similarities [10] • Structural Similarity • Homophilic Similarity 34
  • 35. Skip-Gram Model [17] • Sample windows centered on target word. • Predict leading & trailing context from target embedding. • Assumption: “Similar words share similar company.” 35
  • 36. Deepwalk [19] • Sample random walks from graph. • Interpret walks as “sentences.” • Apply Skip-Gram model. 36
  • 37. LINE [21] • Sample first- & second-order neighbors. • Fit observed samples to embeddings. • Observed probability between u & v: p(u, v) = wuv (i,j)∈E wij • Predicted probability: ˆp(u, v, ) = σ( (u) (v)) • Minimize KL-Divergence between p and ˆp. 37
  • 38. Node2Vec [10] • Combine structural and homophilic. • Blends breadth- and depth-first walks. • Adds return- and out-parameters. 38
  • 40. Heterogeneous Bipartite Graphs • Contains two node types. • GB = (V, E) • V = A ∪ B • A and B are disjoint. • Neighborhood Γ(i). • If i ∈ A, then Γ(i) ⊆ B. 40
  • 41. Proposed Methods • Boolean Heterogeneous Bipartite Embedding • Weight all samples equally. • Sample direct and first-order relationships. • Algebraic Heterogeneous Bipartite Embedding • Weight sampling with algebraic distance. • Sample direct, first-, and second-order relationships. Both Methods: • Enable type-specific latent features. • Make only same-type comparisons. 41
  • 42. Algorithm Outline • Observed similarities: • SA(i, j), SB(i, j), SAB(i, j) • Predicted similarities w.r.t. embedding ( : V → Rk ): • SA(i, j, ), SB(i, j, ), SAB(i, j, ) • Optimize: • Minimize difference between S and S. 42
  • 44. Boolean Observations • Observed cross-type relationships: SAB(i, j) =    1 i ∈ Γ(j) 0 otherwise • Observed same-type relationships: SA(i, j) = SB(i, j) =    1 Γ(i) ∩ Γ(j) = ∅ 0 otherwise 44
  • 45. Boolean Predictions • Predicted same-type relationships: SA(i, j, ) = σ ( (i) (j)) • Predicted cross-type relationships: • Decompose into same-type relationships. SAB(i, j, ) = E k∈Γ(j) SA(i, k, ) E k∈Γ(i) SB(k, j, ) 45
  • 46. Boolean Optimization • Loss for a particular similarity: OX = i,j∈V SX(i, j, ) log   SX(i, j) SX(i, j, )   • Optimization: min OA + OAB + OB 46
  • 48. Algebraic Distance I • Stationary iterative relaxation. • Algebraic distance for hypergraphs [20], adapted for bipartite graphs: a0 ∼ [0, 1] at+1(i) = λat(i) + (1 − λ) j∈Γ(i) at(j)|Γ(j)|−1 j∈Γ(i) |Γ(j)|−1 • Between each iteration, rescale to [0, 1]. 48
  • 49. Algebraic Distance II • Run T = 20 algebraic distance trials until stabilization. • Summarize distances across all trials: d(i, j) = T t =1 a (t ) ∞ (i) − a (t ) ∞ (j) 2 • Summarize similarity between nodes: s(i, j) = √ T − d(i, j) √ T 49
  • 50. Algebraic Observations • Same-type: • Two A nodes are similar if any B node is highly similar to both. SA(i, j) = SB(i, j) = max k∈Γ(i)∩Γ(j) min (s(i, k), s(k, j)) • Cross-type, sample both direct and second-order neighbors: • Decompose cross-type comparisons to same-typed neighborhoods. SAB(i, j) = max max k∈Γ(j) SA(i, k), max k∈Γ(i) SB(k, j) 50
  • 51. Algebraic Predictions • Predicted same-type relationships: SA(i, j, ) = max (0, (i) (j)) • Predicted same- and cross-typed relationships: SAB(i, j, ) = E k∈Γ(j) SA(i, k, ) E k∈Γ(i) SB(k, j, ) 51
  • 52. Algebraic Optimization • Loss for a particular similarity: OX = E i,j∈V SX(i, j) − SX(i, j, ) 2 • Optimization: min OA + OAB + OB 52
  • 53. New Embedding Methods • Boolean Heterogeneous Bipartite Embedding (BHBE) • Observes existence of relationships. • Predicts using σ( (i) (j)). • Minimizes KL-Divergence. • Algebraic Heterogeneous Bipartite Embedding (AHBE) • Observes relationships weighted through algebraic distance. • Predicts using max(0, (i) (j)). • Minimizes Mean-Squared Error. 53
  • 54. Combination Embedding • Learn joint representation to combine AHBE & BHBE. • Direct encoding predicts links through joint embedding. • Auto-regularized encoding also enforces all latent features are captured. 54
  • 55. Evaluation I • Link prediction task. • Select graph, delete % of edges. • Embed remaining graph. • Use embeddings to recover removed edges. • Explore varying hold-out percentages. • From 10% to 90% splits. • Increments of 10%. 55
  • 56. Evaluation II • Models: • A- and B-Personalized. • Train a SVM for each node in the training set. • Each SVM detects an embedding region containing neighbors. • Unified. • Train a shallow neural network. • Predict links given one embedding of each type. A-Personalized Model 56
  • 57. Benchmark Γ(i ∈ A) Γ(j ∈ B) Graph |A|/|B| md max md max SR LCP Amazon 16,716/5,000 3 49 8 328 75.8 1.6 DBLP 93,432/5,000 1 12 8 7,556 174.7 81.7 Friendster 220,015/5,000 1 26 133 1,612 80.3 58.3 Livejournal 84,438/5,000 1 20 16 1,441 100.9 27.0 MadGrades 11,951/6,462 3 39 4 393 57.3 99.7 YouTube 39,841/5,000 1 54 4 2,217 113.3 80.6 Table: Graph summary. We report the median (md) and max degree for each node set, as well as the Spectral Radius (SR) and the percentage of the largest connected component (LCP). 57
  • 58. MadGrades: UW. Instructor-Course Network (|A| = 11, 951, |B| = 6, 462) Accuracy A-Personalized B-Personalized Unified Training-Test Split — BHBE — AHBE — Direct Comb. — Auto-Reg. Comb. - - Deepwalk - - LINE - - Node2Vec - - BiNE - - Metapath2Vec++ 58
  • 59. Results • Algebraic HBE • Best detects trends among high-degree nodes. • Stability issues with larger graphs. • Boolean HBE • Outperforms typical state-of-the-art methods, competitive with other heterogeneous methods. • Robust across trials. • Combinations • BHBE and ABHE find different latent features. • Bootstrap performance above state-of-the-art. 59
  • 61. Hypergraphs • Generalization: hyperedges contain any number of nodes. • H = (V, E) • V = {v1, v2, . . . , vn} • E = {e1, e2, . . . , em} • ei ⊆ V 61
  • 62. Hypergraph Star Expansion • Map hyperedges to nodes. 62
  • 63. Hypergraph Partitioning: Problem Description I • Problem: Split V into k disjoint sets... • of approximately equal size. • minimizing an objective of cut hyperedges. 63
  • 64. Hypergraph Partitioning: Problem Description II • Partition: • V = V1 ∪ V2 ∪ · · · ∪ Vk • ∀(Vi, Vj), Vi ∩ Vj = ∅ • Ecut = {e ∈ E : Vi, e ⊆ Vi} • Metrics: λcut := |Ecut| λk−1 := e∈Ecut |{Vi : Vi ∩ e = ∅}| − 1 64
  • 65. Hypergraph Partitioning: Problem Description III • Hypergraph partitioning is NP-Hard... • to solve [15]. • to approximate [5]. 65
  • 66. Multilevel Heuristic • Steps: • Coarsen • Initial Solution • Uncoarsen: Interpolate & local search. • Paradigms: • (log n)-Level: Each level, pair almost all nodes. • n-Level: Each level, pair two nodes. 66
  • 67. Coarsening • Desired coarsening properties [12]: • Reduce number of nodes & hyperedges. • Remain structurally similar. Contribution Use hypergraph embeddings to better coarsen nodes. 67
  • 68. Typical Coarsening • Assigning all nodes & hyperedges uniform weights: wi = 1. • Measure similarities (e.g. hyperedge inner product). SE(u, v) = e∈E|u,v∈e we |e| − 1 • Match nodes into (u, v). • Contract u and v into x. • wx = wu + wv • x participates in all hyperedges of u and v. 68
  • 69. Embedding-based Coarsening • Quantify similarity within embedding. S (u, v) = (u) (v) • Prioritize nodes with highly similar neighbors. • Measure node similarities: S(u, v) = SES wuwv 69
  • 70. Embedding-based Coarsening Algorithm • Sort V by each node’s highest neighbor similarity. SortingCriteriau = max v∈Γ(u) S (u, v) • In sorted order, pair nodes. Partneru = argmax v∈Γ(u) SE(u, v)S (u, v) wuwv • Merge (u, v) to coarse node x. • Assign (x) to be the centroid of its contained nodes. 70
  • 71. Evaluation Method I • Proposed Implementations: • Zoltan: (log n)-Level, fast and highly parallel. • KaHyPar: n-Level, high-quality partitioning. • KaHyPar Flow: n-Level, best known algorithm. • Considered Embeddings: • MetaPath2Vec++ • Node2Vec • AHBE, BHBE • Combinations (AHBE+BHBE), (All) 71
  • 72. Evaluation Method II • Baseline Algorithms: • hMetis [14] • PaToH [6] • Zoltan [8] • KaHyPar (w/ community-based coarsening) [13] • KaHyPar Flow (w/ flow-based refinement) [12] 72
  • 73. Evaluation Method III • Benchmark Graphs • 86 graphs from SuiteSparse Matrix Collection. • 10 graphs designed to interfere with typical coarsening. • k values: 2, 4, 8, . . . , 128. • 20 trials per combination. • Imbalance tolerance of 3%. 73
  • 77. Partitioning Summary • Embeddings improve multilevel hypergraph partitioning. • Latent features indicate relevant structural similarities. • Embedding similarity prioritizes and matches nodes. • Most important for small partition counts (k). • Embeddings capture key clusters. • Centroids during coarsening smooth some finer details. • Latent features are most important for particular graphs. • Social networks. • Synthetic graphs. 77
  • 79. Future Directions • Bias in Scientific Embeddings • Detect confirmation bias. • Normalize effect of “group think.” • Hybrid Knowledge Graph Mining • Train text and graph embeddings. • Formulate hypothesis generation for deep learning. 79
  • 80. Bias in Word Embeddings • Recent work finds gender stereotypes in word embeddings [4]. • Biases exist in science. • Confirmation bias [18] • Over-interpreting noise [7]. • P-hacking [11]. • Example P53 • “[A]valanche of research” [22] • What connections to focus on? Which are noise? 80
  • 81. Hybrid Knowledge Graph Mining for Hypothesis Generation • Knowledge Graphs • Triplets, similar to SemMedDB Predicates • Typed relationships • Specialized Techniques • Edge2Vec [9]. • Use text to augment graphs [16]. • SciBERT [1]. 81
  • 82. Attention-Based Hypothesis Generation • Attention-mechanism creates interpretable results. • Assigns weights to relevant input. • Unified deep-learning model. • Input: Embeddings for text and network features. • Potential Outputs: • Connection strength. • Connection type. • Automatic summary. 82
  • 83. Timeline Date Accomplishment April 2019 Dissertation proposal. August 2019 Return from summer internship. Begin exploring bias in scientific embed- dings. November 2019 Complete analysis of bias in scientific embeddings. Begin exploring deep learning on knowledge graphs for hypothesis generation. April 2020 Complete analysis of deep learning and knowledge graphs for hypothesis generation. June 2020 Dissertation defense. Table: Timeline of proposed work. 83
  • 84. Acknowledgments • NSF MRI #1725573 • NSF NRT-DESE #1633608 84
  • 85. Bibliography I [1] Beltagy, I., Cohan, A., and Lo, K. Scibert: Pretrained contextualized embeddings for scientific text. arXiv preprint arXiv:1903.10676 (2019). [2] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. 993–1022. [3] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146. [4] Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (2016), pp. 4349–4357. 85
  • 86. Bibliography II [5] Bui, T. N., and Jones, C. Finding good approximate vertex and edge partitions is np-hard. Information Processing Letters 42, 3 (1992), 153–159. [6] C¸ataly¨urek, ¨U., and Aykanat, C. Patoh (partitioning tool for hypergraphs). Encyclopedia of Parallel Computing (2011), 1479–1487. [7] de Groot, A. D. The meaning of “significance” for different types of research [translated and annotated by eric-jan wagenmakers, denny borsboom, josine verhagen, rogier kievit, marjan bakker, angelique cramer, dora matzke, don mellenbergh, and han lj van der maas]. Acta psychologica 148 (2014), 188–194. 86
  • 87. Bibliography III [8] Devine, K. D., Boman, E. G., Heaphy, R. T., Bisseling, R. H., and Catalyurek, U. V. Parallel hypergraph partitioning for scientific computing. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International (2006), IEEE, pp. 10–pp. [9] Gao, Z., Fu, G., Ouyang, C., Tsutsui, S., Liu, X., and Ding, Y. edge2vec: Learning node representation using edge semantics. arXiv preprint arXiv:1809.02269 (2018). [10] Grover, A., and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (2016), ACM, pp. 855–864. 87
  • 88. Bibliography IV [11] Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., and Jennions, M. D. The extent and consequences of p-hacking in science. PLoS biology 13, 3 (2015), e1002106. [12] Heuer, T., Sanders, P., and Schlag, S. Network Flow-Based Refinement for Multilevel Hypergraph Partitioning. In 17th International Symposium on Experimental Algorithms (SEA 2018) (2018), pp. 1:1–1:19. [13] Heuer, T., and Schlag, S. Improving coarsening schemes for hypergraph partitioning by exploiting community structure. In 16th International Symposium on Experimental Algorithms, (SEA 2017) (2017), pp. 21:1–21:19. [14] Karypis, G. hmetis 1.5: A hypergraph partitioning package. http://www. cs. umn. edu/˜ metis (1998). 88
  • 89. Bibliography V [15] Lengauer, T. Combinatorial algorithms for integrated circuit layout. Springer Science & Business Media, 2012. [16] Li, C., Lai, Y.-Y., Neville, J., and Goldwasser, D. Joint embedding models for textual and social analysis. In The Workshop on Deep Structured Prediction (2017). [17] Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). 89
  • 90. Bibliography VI [18] Munaf`o, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., Du Sert, N. P., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., and Ioannidis, J. P. A manifesto for reproducible science. Nature human behaviour 1, 1 (2017), 0021. [19] Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (2014), ACM, pp. 701–710. [20] Shaydulin, R., Chen, J., and Safro, I. Relaxation-based coarsening for multilevel hypergraph partitioning. Multiscale Modeling & Simulation 17, 1 (2019), 482–506. 90
  • 91. Bibliography VII [21] Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web (2015), International World Wide Web Conferences Steering Committee, pp. 1067–1077. [22] Vogelstein, B., Lane, D., and Levine, A. J. Surfing the p53 network. Nature 408, 6810 (2000), 307. 91
  • 92. Outline Hypothesis Generation Moliere: Automatic Biomedical Hypothesis Generation Validation via Candidate Ranking Are Abstracts Enough? Graph Embedding Heterogeneous Bipartite Graph Embedding Partition Hypergraphs with Embeddings Proposed Work 92