Like this presentation? Why not share!

# Medium Information Quantity

## by Vitalie Scurtu, Natural Language Processing Engineer at YooDeal on Jul 27, 2013

• 129 views

### Views

Total Views
129
Views on SlideShare
129
Embed Views
0

Likes
0
0
0

No embeds

## Medium Information QuantityPresentation Transcript

• Medium Information Quantity Vitalie Scurtu
• History ● Shannon - "A Mathematical Theory of Communication“ - 1948 ● Lossless compression (Shannon-Fano, Adaptive Huffman) ● Today it is used in cryptography, analysis of DNA sequences (http://pnylab. com/pny/papers/cdna/cdna/index.html), in Natural Language Processing applications
• What is information quantity? The information quantity of one phenomenon depends on its frequency ● Low information quantity ○ It rains in London ○ The economy is in crisis ○ Berlusconi went with Escorts Low Information quantity: I am telling things you already heard of many times, nothing new ● High information quantity ○ Today it snows in Rome ○ Dentists on strike High Information quantity: I am telling things you never or rarely heard of, much new information
• Entropy and the medium information quantity ● The formula of information quantity ○ H (m ) = -Σ p(m) * log p(m) ○ V(m) = 1/n Σ 1- p(m) ■ p(m) - probability that m will happen H(m) - Entropy or information quantity (IQ) V(m) - Medium information quantity (MQI)
• Probability coefficients lim p(x) -> 1 x=1..n y=p(x)
• Logarithm coefficients lim log(x) -> -∞ x=0..1 y=log(x)
• Coefficients of Shannon Entropy p(x) log(p(x))*-1 ● Very likely words: in, the, is, has,of ● Very unlikely words: APT, x=p(1..n) y=x*log(x)*-1
• Documents distribution based on its entropy Zipf distribution (long tail) Few in superior extremity and many in inferior extremity ● MIN=0 ● MAX=1700(no limit) x=doc(1..n) y=H(doc(1..n)
• Documents distribution based on its medium information quantity Gaussian Distribution Few in extremity and the majority inside the medium values MIN=0 MAX=1.0 x=doc(1...n) Y=V(doc(1...n))
• Documents distribution based on its medium quantity x=doc(1...n) y=V(doc(1...n))
• Entropy depends on the text length Correlation: 0.99 - the highest correlation, almost identical
• Conclusions ● Very low correlation of MIQ with text lengths 0.05 vs. 0.985 ● Correlation of MIQ vs. IQ is 0.57 ● Entropy depends on text length, MIQ does not depend on text length therefore it find anomalies ● MIQ: information about text style ● MIQ compensates IQ
• The End ¿Questions? email to scurtu19@gmail.com