Translation Invariance (TI) based Novel Approach for better De-noising of Dig...
HalifaxNGGs
1. Overview and Framework Applications Conclusion
N-gram graphs and proximity graphs: bringing
summarization, machine learning and
bioinformatics to the same neighborhood
George Giannakopoulos1,2
1NCSR Demokritos, Greece
2SciFY NPC, Greece
April 2014
2. Overview and Framework Applications Conclusion
Introductions
NCSR Demokritos and IIT
NCSR “Demokritos”: Biggest science research center in
Greece
Visit: http://www.iit.demokritos.gr/
3. Overview and Framework Applications Conclusion
Introductions
NCSR Demokritos and IIT
NCSR “Demokritos”: Biggest science research center in
Greece
5 institutes covering Biology, Material Science, Nuclear
Physics, Nanotechnology, Informatics and
Telecommunications, ...
Visit: http://www.iit.demokritos.gr/
4. Overview and Framework Applications Conclusion
Introductions
NCSR Demokritos and IIT
NCSR “Demokritos”: Biggest science research center in
Greece
5 institutes covering Biology, Material Science, Nuclear
Physics, Nanotechnology, Informatics and
Telecommunications, ...
Institute of Informatics and Telecommunications (IIT)
Visit: http://www.iit.demokritos.gr/
5. Overview and Framework Applications Conclusion
Introductions
NCSR Demokritos and IIT
NCSR “Demokritos”: Biggest science research center in
Greece
5 institutes covering Biology, Material Science, Nuclear
Physics, Nanotechnology, Informatics and
Telecommunications, ...
Institute of Informatics and Telecommunications (IIT)
Intelligent Information Systems (IIS) Domain
Visit: http://www.iit.demokritos.gr/
6. Overview and Framework Applications Conclusion
Introductions
NCSR Demokritos and IIT
NCSR “Demokritos”: Biggest science research center in
Greece
5 institutes covering Biology, Material Science, Nuclear
Physics, Nanotechnology, Informatics and
Telecommunications, ...
Institute of Informatics and Telecommunications (IIT)
Intelligent Information Systems (IIS) Domain
Telecommunication Systems (TELECOM) Domain
Visit: http://www.iit.demokritos.gr/
7. Overview and Framework Applications Conclusion
Introductions
NCSR Demokritos and IIT
NCSR “Demokritos”: Biggest science research center in
Greece
5 institutes covering Biology, Material Science, Nuclear
Physics, Nanotechnology, Informatics and
Telecommunications, ...
Institute of Informatics and Telecommunications (IIT)
Intelligent Information Systems (IIS) Domain
Telecommunication Systems (TELECOM) Domain
Services and Measurements (S&M) Domain
Visit: http://www.iit.demokritos.gr/
8. Overview and Framework Applications Conclusion
Introductions
SKEL Lab – Part of IIS Domain
Content Analysis and Knowledge Technologies Group (CAKT)
Complex Event Recognition Group (CER)
Creativity Research Unit (CRU)
Pesonalisation & Social Network Analysis Group (PerSoNA)
Robotics Activity Group (RoboSKEL)
Visit: http://www.iit.demokritos.gr/skel
9. Overview and Framework Applications Conclusion
Introductions
Why are we here?
Talk about a framework based on neighborhood/proximity
10. Overview and Framework Applications Conclusion
Introductions
Why are we here?
Talk about a framework based on neighborhood/proximity
Discuss its strengths and weaknesses
11. Overview and Framework Applications Conclusion
Introductions
Why are we here?
Talk about a framework based on neighborhood/proximity
Discuss its strengths and weaknesses
Get a glimpse of its generic applicability
12. Overview and Framework Applications Conclusion
Introductions
Why are we here?
Talk about a framework based on neighborhood/proximity
Discuss its strengths and weaknesses
Get a glimpse of its generic applicability
Foster potential collaborations
13. Overview and Framework Applications Conclusion
Introductions
Intuition — Text
People can read even when words are spelled wnorg
14. Overview and Framework Applications Conclusion
Introductions
Intuition — Text
People can read even when words are spelled wnorg
But order does play some role: not it does?
15. Overview and Framework Applications Conclusion
Introductions
First thoughts
We can deal with noise.
16. Overview and Framework Applications Conclusion
Introductions
First thoughts
We can deal with noise.
Exact sequence is useful, but not critical.
17. Overview and Framework Applications Conclusion
Introductions
First thoughts
We can deal with noise.
Exact sequence is useful, but not critical.
Proximity does play a role in many settings.
18. Overview and Framework Applications Conclusion
Introductions
An N-gram Graph
_cde
_bcd
1,00
_abc
1,00
_def
1,00
1,00
1,00
1,00 1,00
1,00
1,00
1,00
1,00
1,00
Indicates
neighborhood
Edges are important
Vertices are unique
Edge weights can
have different
semantics (usually:
co-occurrence)
19. Overview and Framework Applications Conclusion
Introductions
Extraction Process (Natural Language Processing example)
Given a string...
Example
String: abcde
20. Overview and Framework Applications Conclusion
Introductions
Extraction Process (Natural Language Processing example)
Given a string...
Extract n-grams...
Example
String: abcde
Char. N-grams (Rank 3): abc, bcd, cde
21. Overview and Framework Applications Conclusion
Introductions
Extraction Process (Natural Language Processing example)
Given a string...
Determine neighborhood (window size Dwin)...
Example
String: abcde
Edges (Window Size 1): abc-bcd, bcd-cde
22. Overview and Framework Applications Conclusion
Introductions
Extraction Process (Natural Language Processing example)
Given a string...
Assign weights to edges. DONE!
Example
String: abcde
Weights (Freq.): abc-bcd (1.0) , bcd-cde (1.0)
23. Overview and Framework Applications Conclusion
Introductions
N-gram graph parameters
n: describes the length/order of “atoms”
Dwin: describes the limits of a neighborhood of atoms
24. Overview and Framework Applications Conclusion
Introductions
Window-based Extraction of Neighborhood — Examples
Figure: N-gram Window Types (top to bottom): non-symmetric,
symmetric and gauss-normalized symmetric.
25. Overview and Framework Applications Conclusion
Introductions
N-gram Graph — Representation Examples
Figure: Graphs Representing the String abcdef (from left to right):
non-symmetric, symmetric and gauss-normalized symmetric. N-Grams of
Rank 3. Dwin value 2.
26. Overview and Framework Applications Conclusion
Introductions
What is an N-gram Graph?
Description of
co-existence/proximity/collocation/neighborhood
Statistics over the neighborhood
Allows different analysis levels (character, word, ...)
Allows different distance measures (1D, 2D, tree-based, ...)
Lossy compression (one graph – many “texts”?)
Restrictions upon relative positioning
27. Overview and Framework Applications Conclusion
Introductions
What is Important regarding the N-Gram Graph
Co-occurrence information inherent
Arbitrary fuzziness (parameters)
Generic applicability (domain agnostic)
Paths provide more info on neighborhood
28. Overview and Framework Applications Conclusion
Introductions
N-gram Graph vs ...
Bag-of-words More information
Probabilistic sequential models Not probabilistic (by default), not
strictly sequential
Automata No explicit transitions
Neural networks No input/output; just representation
29. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
Nice things we would like to do...
Represent sets of items with a single graph (e.g. classification)
Compare things in the graph world (e.g. clustering)
30. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
Nice things we would like to do...
Represent sets of items with a single graph (e.g. classification)
Compare things in the graph world (e.g. clustering)
Talk about graphs in the vector space
31. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-Gram Graph Generic Operators
Merging or Union ∪ and Update U
Similarity function sim
32. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
Representing Sets of Graphs
A representative graph for a set
Represents an “average” case (like a centroid!)
Common edges: transferred and weights averaged to the
result graph
Non-common edges: simply copied to the result graph
(non-linearity)
But merging graphs can be tricky...
34. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
Merge vs. Update (2)
_abc
_bcd
1.50 ∪
_abc
_bcd
3.00 =
_abc
_bcd
2.25
1.5+3
2 = 2.25, but what if we want 1+2+3
3 = 2?
35. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
Merge vs. Update (3)
updatedValue = oldValue + l × (newValue − oldValue) (1)
where 0 ≤ l ≤ 1 is the learning factor
Representative (or Centroid) Graph
Use update operator with learning factor: 1
instanceCount , where
instanceCount is the number of instances that will be described by
the graph after the update.
37. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
Claim
We can describe groups/classes/sets of things as one graph!
38. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Similarity (1)
Size Similarity: Number of Edges
Co-occurrence Similarity: Existence of Edges
Value Similarity: Existence and Weight of Edges
Derived Measures: Normalized Value Similarity, Overall
similarity
39. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Similarity (2)
|G| the size of a graph (edgecount)
40. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Similarity (3)
|G| the size of a graph (edgecount)
Size Similarity of G1, G2: SS = min(|G1|,|G2|)
max(|G1|,|G2|)
41. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Similarity (4)
|G| the size of a graph (edgecount)
Size Similarity of G1, G2: SS = min(|G1|,|G2|)
max(|G1|,|G2|)
Containment Similarity: Each common edge adds 1
min(|G1|,|G2|)
to a sum.
42. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Similarity (5)
|G| the size of a graph (edgecount)
Size Similarity of G1, G2: SS = min(|G1|,|G2|)
max(|G1|,|G2|)
Containment Similarity: Each common edge adds 1
min(|G1|,|G2|)
to a sum.
Value Similarity: Using weights, every common edge adds
min(wi
e ,wj
e )
max(wi
e ,wj
e )
SS
43. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Similarity (6)
|G| the size of a graph (edgecount)
Size Similarity of G1, G2: SS = min(|G1|,|G2|)
max(|G1|,|G2|)
Containment Similarity: Each common edge adds 1
min(|G1|,|G2|)
to a sum.
Value Similarity: Using weights, every common edge adds
min(wi
e ,wj
e )
max(wi
e ,wj
e )
SS
Normalized Value Similarity, SS factored out: NVS = VS
SS
44. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Similarity (7)
|G| the size of a graph (edgecount)
Size Similarity of G1, G2: SS = min(|G1|,|G2|)
max(|G1|,|G2|)
Containment Similarity: Each common edge adds 1
min(|G1|,|G2|)
to a sum.
Value Similarity: Using weights, every common edge adds
min(wi
e ,wj
e )
max(wi
e ,wj
e )
SS
Normalized Value Similarity, SS factored out: NVS = VS
SS
Overall Similarity for n ∈ [Lmin, LMAX]: Weighted sum of rank
similarity.
45. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Size Similarity
Example
a
b
1.0
c
8.0
a
b
1.0
c
4.0
d
1.0
e
1.0
|G1| = 2 |G2| = 4
Result: 2
4 = 0.5
46. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Containment Similarity
Example
a
b
1.0
c
8.0
a
b
1.0
c
4.0
d
1.0
e
1.0
Result: 1
2 + 1
2 = 1.0
47. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Value Similarity
Example
a
b
1.0
c
8.0
a
b
1.0
c
4.0
d
1.0
e
1.0
Result:
1.0
1.0
4 +
4.0
8.0
4 = 1
4 + 1
8 = 0.375
48. Overview and Framework Applications Conclusion
N-Gram Graph Generic Operators
N-gram Graph – Normalized Value Similarity
Example
a
b
1.0
c
8.0
a
b
1.0
c
4.0
d
1.0
e
1.0
Result: VS
SS = 0.375
0.5 = 0.75
50. Overview and Framework Applications Conclusion
Summary evaluation
Evaluating summaries
Given a set of texts
...and a set of ideal summaries
...can we grade other summaries similarly to humans?
51. Overview and Framework Applications Conclusion
Summary evaluation
Main NGG-based approaches
AutoSummENG compare summary graph to model summaries
individualy [Giannakopoulos et al., 2008]
MeMoG compare summary graph to combined model
graph[Giannakopoulos and Karkaletsis, 2011]
N-POWER use different levels of NGG analysis and train
estimation model[Giannakopoulos et al., 2014]
54. Overview and Framework Applications Conclusion
Summary evaluation
The power of combined n-gram graphs: N-POWER
Features: Base evaluation scores (“n-gram graph”-based)
Target: Human summary grade, Grammaticality,
Responsiveness, ...
Regression: Combines features to estimate the target
56. Overview and Framework Applications Conclusion
Summary evaluation
Improving
Setting Set A
Year x features Pearson Spearman Kendall
2009
Baseline: ROUGE-SU4 0.33 0.30 0.22
Baseline: BE 0.26 0.27 0.20
Baseline comb. 0.34 0.29 0.21
NPowER 0.60 0.42 0.32
All 0.83 0.80 0.65
2010
Baseline: ROUGE-SU4 0.61 0.61 0.47
Baseline: BE 0.48 0.50 0.38
Baseline comb. 0.61 0.60 0.47
NPowER 0.72 0.68 0.54
All 0.75 0.72 0.58
Table: Per summary correlation to Responsiveness.
57. Overview and Framework Applications Conclusion
Summary evaluation
Across sets and years
Setting Target: Responsiveness
Train Year Test Year Pearson Spearman Kendall
2009 2010 0.72 0.65 0.52
2010 2009 0.61 0.47 0.35
Setting Target: Responsiveness
Train Set Test Set Pearson Spearman Kendall
A B 0.64 0.55 0.42
B A 0.63 0.55 0.42
Table: Correlations between NPowER grades and
Responsiveness across years (top half) and sets
(bottom half)
59. Overview and Framework Applications Conclusion
Summarization
Cluster sentences: subtopics
60. Overview and Framework Applications Conclusion
Summarization
Union of subtopics: the essence
61. Overview and Framework Applications Conclusion
Summarization
And then?...
Similarity of a sentence to the essence: salience
Similarity between sentences: redundancy
62. Overview and Framework Applications Conclusion
Summarization
And then?...
Similarity of a sentence to the essence: salience
Similarity between sentences: redundancy
Thus, NewSum [Giannakopoulos et al., 2014] was born
63. Overview and Framework Applications Conclusion
Summarization
A little more on NewSum
Using n-gram graphs
Android Application
Website version (http://www.newsumontheweb.org)
Applied to English, Greek, German
Highly rated by users on quality of summaries
Born in SciFY (http://www.scify.org)
Awarded (Efarmogiada #3)
64. Overview and Framework Applications Conclusion
Personalization and Classification
N-gram graph information in personalization
User likes things based on
content (e.g. free-strings or images)
meta-data
recency
...and others
How can we create a common vector space, using NGGs?
65. Overview and Framework Applications Conclusion
Personalization and Classification
Combining multi-modal features
"Interesting"
Content Model
"Critical"
Content Model
abcdef
Content
String
Representation
Content Graph
"Uninteresting"
Content Model
(1,0,1,0,...,0.42,0.27,0.8)
Feature Vector
Type} Content
N-Gram Graph
Normalized Value
Similarity
Figure: Embedding into vector space
Used for adaptive entity subscription services
[Giannakopoulos and Palpanas, 2009]
66. Overview and Framework Applications Conclusion
Personalization and Classification
Topic classification
Figure: Classification only with NGG features
67. Overview and Framework Applications Conclusion
Personalization and Classification
Results over different document types
Dreuters Dblogs Dtwitter
Vector Model 76.67% 49.81% 45.22%
Bigrams 56.02% 56.19% 53.94%
Trigrams 64.33% 61.53% 55.73%
Four-grams 71.19% 63.51% 51.42%
Bigram Graphs 50.81% 49.57% 41.08%
Trigram Graphs 90.71% 62.33% 62.94%
Four-gram Graphs 93.71% 64.94% 57.61%
Table: Precision of Naive Bayes Multinomial
Classifier [Giannakopoulos et al., 2012]
68. Overview and Framework Applications Conclusion
Bioinformatics
Classification of Constrained DNA Elements
The setting
DNA sequences
Different classes (based on different criteria)
Question: Can we perform classification using NGGs?
69. Overview and Framework Applications Conclusion
Bioinformatics
Datasets
UltraConserved Noncoding Elements (UCNEs)
EU100 nonexonic CNEs (EU100nx CNEs)
Amniotic and Mammalian CNEs
Worm and Insect UCNEs
70. Overview and Framework Applications Conclusion
Bioinformatics
Results
Task Description NGG GS
worm exons vs. surrogates 57.7 61.05
human exons vs. surrogates 74.3 63.41
insect exons vs. surrogates 52.36 59.45
worm UCNEs vs. surrogates 56.76 55.62
human UCNEs vs. surrogates 82.63 72.00
human EU100nx CNEs vs. surrogates 76.43 63.75
amniotic CNEs vs. surrogates 78.62 63.00
mammalian CNEs vs. surrogates 75.85 55.65
insect UCNEs vs. surrogates 64.15 62.65
Average 68.76 61.84
Table: Constrained DNA sequences versus background surrogates
[Polychronopoulos et al., 2014]
71. Overview and Framework Applications Conclusion
Bioinformatics
Still alive! Questions?
Next: Just the conclusion.
72. Overview and Framework Applications Conclusion
Strengths and weaknesses
Strengths
Language-neutral due to statistical nature
No preprocessing required
More than bag-of-words (subword level, sequence information
kept)
More that the vector space representation (graph paths,
arbitrary fuzziness, ...)
Different than statistical models (encodes restrictions on
neighborhood with arbitrary fuzziness, expresses possibility
more than probability in current operators, minimal data
needed)
Combinable with vector space
Makes use of uncommon events
73. Overview and Framework Applications Conclusion
Strengths and weaknesses
Weaknesses
Somewhat demanding in computing power (BUT fully
parallelizable)
Difficult to express generalization potential
Optimal parameters non-trivial to determine (but can be
done, e.g. [Giannakopoulos et al., 2008])
Can become noisy if a graph represents too many instances
74. Overview and Framework Applications Conclusion
Strengths and weaknesses
Proximity graphs
Generalization of n-gram graphs
Applicable in any similarity space
Vertices can be anything (e.g. vectors)
They allow nesting/hierarchy (graph of graphs)
Already applied to behavior recognition from video
75. Overview and Framework Applications Conclusion
Strengths and weaknesses
Where should I use n-gram graphs?
When neighborhood seems to contain information
When we have a measure of proximity
When a whole can be described by the neighborhood of its
parts
When uncommon co-occurrences (rare events) are of
importance
When I want to combine multi-modal concepts that co-occur
76. Overview and Framework Applications Conclusion
Strengths and weaknesses
Ongoing domains of research
Material science (surface grading)
Video/scene analysis
Personalization
Classification
e-Government (opinion mining, sentiment analysis)
N-gram graph theory
77. Overview and Framework Applications Conclusion
To remember...
Why n-gram graphs?
Generic framework for neighborhood/proximity patterns
Can express complex relations
Proven usefulness in a variety of domains
Already implemented in the JINSECT library
[Giannakopoulos, 2010]
78. Overview and Framework Applications Conclusion
To remember...
Thank you
N-gram graphs and proximity graphs: bringing
summarization, machine learning and
bioinformatics to the same neighborhood
George Giannakopoulos1,2
1NCSR Demokritos, Greece
2SciFY NPC, Greece
ggianna@iit.demokritos.gr
79. Overview and Framework Applications Conclusion
To remember...
Giannakopoulos, G. (2010).
JINSECT.
http://mloss.org/software/view/234/.
Giannakopoulos, G. and Karkaletsis, V. (2011).
AutoSummENG and MeMoG in evaluating guided summaries.
In TAC 2011 Workshop.
Giannakopoulos, G., Karkaletsis, V., Vouros, G., and Stamatopoulos, P. (2008).
Summarization system evaluation revisited: N-gram graphs.
ACM Trans. Speech Lang. Process., 5(3):1–39.
Giannakopoulos, G., Kiomourtzis, G., and Karkaletsis, V. (2014).
NewSum: “N-Gram Graph”-Based Summarization in the Real World.
IGI.
Giannakopoulos, G., Mavridi, P., Paliouras, G., Papadakis, G., and Tserpes, K. (2012).
Representation models for text classification: a comparative analysis over three web document types.
ICPS. ACM.
Giannakopoulos, G. and Palpanas, T. (2009).
Adaptivity in entity subscription services.
In Proceedings of ADAPTIVE2009.
Polychronopoulos, D., Krithara, A., Nikolaou, C., Paliouras, G., Almirantis, Y., and Giannakopoulos, G.
(2014).
Analysis and classification of constrained DNA elements with n-gram graphs and genomic signatures.