Tg noh jeju_workshop

A Semantic Relatedness Measure Based on Co-occurrence Network and Graph Kernel Kyungpook National University 노태길 (Tae-Gil Noh) tailblues@me.com 2011년 1월 20일,ACL 유치 기념 워크샵

Overview A new semantic relatedness measure From co-occurrence observations on raw corpus For words and phrases Improving Vector space model with Network representations Similarity measure in a kernel space Co-occurrence observations compared by kernel R-convolution Kernel and Graph Kernel

Introduction: Semantic relatedness measure Measuring semantic distance between two terms/phrases Also known as semantic similarity, or semantic distance. A tool that can be used in various NLP situations. Examples Which is semantically closer to “orange juice”? 음료수(Drinks) 향신료(Spice) Which sense describes the term better in the context? 1) “Apple launched a new device …”, 2) “Apple is my favorite fruit, second only to …” A) A company famous with its iPhone and iPod. B) A fruit with lots of Vitamin-C, shiny red or green skins, …

Semantic Relatedness, with Lexical Resources By Lexical Resources WordNet, Thesauri, Ontologies, … Pros Reliable data generated by lexicographers. Detailed relationships between lexicons. Cons Generated by human, high cost. Not readily available for minor languages. There are always new / unlisted entries.

Semantic Relatedness, with Corpus Semantic Relatedness based on corpus By observationson unlabeled corpus. Measuring relatedness to a numerical value Various methods Co-occurrence Vectors Mutual Information (PMI) Rank reducing (LSA, random projection, ESA) Topic Models (PLSA, LDA, CTM)

Semantic Relatedness, with Corpus Corpus based methods generally approaches Two terms are “semantically close” if; “occurs in similar documents” Shares similar distributions among documents “co-occurs with similar terms” Shares common co-occurring terms Occurrences / co-occurrences are generally expressed as vectors; Vectors themselves are used as representations, Or refined by mathematical/statistical methods Weighting schemes, rank-reduce, higher-order vector, random-projection, topic estimation, etc

Generating “Semantic Space”

Motivation Network has “more structure” A network of terms can be seen as a relaxation of “Bag-of-words” (independent) assumption. Previous work showed: structure of a co-occurrence network can be used to induce senses [Veronis 2005]. However, network itself was never used as a representation before. What if, co-occurrence vectors are replaced by co-occurrence networks? What tools are needed? Can we gain some performance improvements?

An example of capturing co-occurrences; as a vector, or as a network. A data disc can contain anything; system files,, ... Eject the system disc by pressing ... This is their best concert on disc. On the double disc soundtrack, the orchestra have ... Disc of the year & best orchestra winner is announced by ... concert data disc-{ data, system, files } disc-{ system, } disc-{ concert } disc-{ soundtrack, orchestra} disc-{ year, orchestra, winner } 1 1 1 soundtrack system 1 disc 1 2 1 1 winner 1 1 2 1 1 files concert, data, files, orchestra, soundtrack, system, winner, year ( 1, 1, 1, 2, 1, 2, 1, 1 ) 1 orchestra year

Replacing co-occurrence vectors with co-occurrence networks A Co-occurrence Patterns CosSim(A,B) L1dist(A,B) EucDist(A,B) B Vector Representation Applying Similarity/Distances functions of vector domain A Co-occurrence Patterns B NetworkSIM(A,B) Applying Network similarity measure Network representation

Using co-occurrence network as a direct representation Evaluating gains on on some NLPtasks Compare the performance with vector-basedbaselines, unsupervised state-of-the-arts Tasks like Synonym finding TOEFL-synonym test set Word sense disambiguation General domain & Biomedical domain Annotation translation Automatic translation of FLICKR tags

Two basic issues of using network representations Expressing phrases How to compose an expression of a phrase? In vector semantic spaces, vector summation/multiplications are used to represent phrases. Equivalent network operations must be defined. Comparing two networks Given two network representations, how their similarity can be calculated? A network similarity measure equivalent to cosine similarity is needed.

An example WSD setup LSO + disc = context2 Microsoft + disc = context1 A WSD setup from [Wilks, 1990] & [Schütze, 1998] This is also sometimes called as a modified Lesk algorithm. Vsense1 (vdisc+vdiskette+vmagnetic disc) Vcontext1 “Microsoft will replace your disc, if its within ...” {Microsoft, disc} θ Vcontext2 “Previn and the LSO on the front of any disc were ..." {LSO, disc} Vsense2 (vphonograph+vrecord+vsound recording) WordNet Senses (synsets) Disc sense-1: {disk, diskette, magnetic disc} Disc sense-2: {phonograph, record, sound recording}

The two issues in WSD setup(network case) disc sense - 1 network of {Microsoft, disc} {Microsoft, disc} Microsoft disc network of {disk, diskette, magnetic disc} (+) (2)? (1)? disc sense - 2 LSO disc (+) network of {LSO, disc} {LSO, disc} network of {phonograph, record, sound recording}

Issue #1Network operators Generating context (multi-term) networks from single-term networks. Two network operations; Network Union Equivalent to vector summation Network Intersection Similar to vector multiplications in effect Defined as matrix operations Since networks are represented as adjacency matrices

Issue #1Network operators 1 1 3 2 1 2 3 1 A A 4 1 4 1 2 4 1 2 B 1 2 4 B 1 2 1

Network of Disc Network of Disc & LSO Network of Disc & Microsoft

Issue #2Similarity measure for networks Cosine similarity is a normalized dot product Graph kernel Dot product of two “Graph structure”. Graph kernels have been used in biomedical domains to compare proteins and genes. A R-convolution kernel A way to systemically define kernels for structures. In language processing, tree kernels are the most widely used case of R-convolution kernel.

Random walk graph kernel The most widely used graph kernel. It compares two graphs by Measuring common random walks numerically The result is a dot product value in an infinitely high dimension, where each dimension is each possible random walk It has “tottering” issue A well known problem: kernel effectiveness is severely limited by counting cycles again and again. I have proposed an efficient acyclic version, that can be used if all node labels are unique. i.e. co-occurrence networks

Network similarity by comparing every possible walks

Walks to steps, steps to nodes & edges... Graph -> Walks Walks -> Steps Steps -> Node/Edge R-convolution kernel [Haussler, 1999]

Simplest possible sub kernels Nodekernel Delta Kernel (exact match) Edge kernel Brownian Bridge Kernel

Two issues solved; previous WSD setup Networks of Candidate senses (union & intersection) Network of Context Terms observed in Context of target term t1 t2 tn ... (1) Network Operations (union) (2) Network kernel (similarity function) ....

Synonym Test Synonym Test Finding synonym from given candidates ex) grin: {exercise, rest, joke, smile} applying Selecting the most similar candidate in terms of normalized dot product. Testset TOEFL Synonym test set (Landauer 1997) Training corpus British National Corpus (BNC-XML)

Raw frequency PMI weighting (positive PMI)

Synonym Test Summary Within same conditions, various parameters Same corpus, same sampling method. Network performs about 3 points better in average. But statistically insignificant. Only 80 tests in this test set. Network similarity is less sensitive to the context window size.

Word Sense Disambiguation WSDis, Sense disambiguation example: term “disc” Phrase 1) Previn and the LSO on the front of any disc was... Phrase2) Microsoft will replace your disc, if it’s within … Sense candidates Sense1) Disc as “Phonograph, record, recording” Sense2) Disc as “Magnetic disc” A task to assign a sense from the candidates Again, selecting the most similar sense candidate in terms of kernel similarity value.

Word Sense Disambiguation General Domain Test set: SensEval-3 lexical sample data Sense candidates: WordNet Senses Corpus: BNC-XML Sense expressions SynsetUnion/Intersection Context expressions Union of phrase terms 4+ point performance gain Statistically significant Network version is comparable to state-of-the-art unsupervised WSD Supervised Unsupervised

Word Sense Disambiguation WSD Accuracy on Biomedical WSD test set Biomedical Domain Test set: extended NLM Dataset Corpus: PubMed open subset Same representation for senses and context. Sense candidates from UMLS- Metathesaurus Average number of senses were: 2.4 Outperformed baseline vector method nearly 10+points

Flickr tag translation Tag translation Translation Disambiguation Finding proper translation for given term Spring, Field, Flowers  {spring of season=(봄, Frühjahr), spring as a mechanical device=(스프링, Sprungfeder), hot/water springs=(샘, Brunnen ) … } Experiments on MIRFLICK 25000 Image Translating English tags in image number 1 to 1000, from English to German. Baseline method (state-of-the-art) Coherence (Mutual Information) based method. A method that selects the most co-occurring translation candidates in the target language corpus. {spring, field, flowers}

Tag translation wood : Holz (wood as material), Wald (forest) desk : Schalter (a counter), Schreibtisch (a desk for reading/writing), Tisch (a table) {wood, desk} {Holz, Schalter},{Holz, Schreibtisch}, {Holz, Tisch}, {Wald, Schalter},{Wald, Schreibtisch},{Wald, Tisch} (1) (2) {Holz, Schreibtisch} {Wald, Schreibtisch} {wood, desk} {Holz, Tisch}

Tag translation Candidates are notsenses, but target language networks Incompatible node labels! Target network nodes have German labels Adopting a node kernel with machine-readable dictionary

Tag translation result Targets 3696 tags that are listed in the dictionary, among 5899 unique tags. 965 among 3696 only had single translation. Outperformed the coherent based translation nearly 5%.

Summary Network as Semantic Representation Co-occurrence Network to replace co-occurrence vectors Performance gains In some NLP tasks that needs semantic relatedness measures, the network-based representations constantly outperformed equivalent vector representations. Co-occurrence network and the associated kernel They can be used in applications that uses co-occurrence vectors and cosine similarity Language resources can be adopted to the kernel with minimal impact, by modifying sub-kernels. One notable shortcoming is that the kernel operation is so much slower than the cosine similarity calculation.

Please remember this, even if you forget everything else! (Not true) Data mined from corpus should be represented as vectors. There are well established mathematical methods to compare data captured in forms of structures. R-Convolution Kernel (Not true) Kernels are for kernel machines. A kernel is just a dot product in a higher dimensional space. without explicitly generating that high dimension – kernel trick Kernels are essential in kernel machines (i.e. SVMs), but a kernel can be useful just as a dot product itself.

Tg noh jeju_workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tg noh jeju_workshop

Similar to Tg noh jeju_workshop (20)

Recently uploaded

Recently uploaded (20)

Tg noh jeju_workshop