Presented at PyData London 2015 http://london.pydata.org/schedule/presentation/43/
(version with animation/transitions on slides.google: https://docs.google.com/presentation/d/1fVMYTXcWD40aKo6_Z4iTpX2IKLZcAO13U7d5hcUN2EU/edit )
(older version has been presented on
12 May 2015, The London Big-O Meetup
http://www.meetup.com/big-o-london/events/222028048/ )
Introduction of info theory basis for image/video coding, especially, entropy, rate-distortion theory,
entropy coding, huffman coding, arithmetic coding
Book review by Luca Lamera
"The Fourth Industrial Revolution". Klaus Schwab, founder of the World Economic Forum.
-
topics: IoT, Industry 4.0, Tech, Innovation, Future, Robotics, Automation.
-
Please do not hesitate to contact me if you have any questions.
Luca Lamera
Are you ready for the 4th industrial revolution?Sylvain Kalache
It's been a year that I left my job at LinkedIn to start my new professional life in the world of education, I wanted to share the biggest thing I learnt during this time.
Our world as we know it is about to drastically change, with the recent huge improvements in the world of deep learning and artificial intellligence, we are about to enter a new world where robot will take over a lot of tasks that were done by humans. What will be the impact? How shall we react? How to train the workforce? Are few questions I answer in this deck.
Linked blog post here: https://www.linkedin.com/pulse/you-ready-4th-industrial-revolution-sylvain-kalache
The rules of the game have changed
The way goods are produced today is completely different, and so are the goods produced.
With the new trend in technology, consumers can now influence design and control production, and manufacturers are now able to adapt quickly to specific consumer demands.
This shift is particularly exciting for consumers who are able to see the results of their input taken into consideration.
Gone are the days for manufacturers who may be threatened by consumer feedback. Today the technology exists for the development and creation processes to engage consumers earlier to poll for their ideas and opinions. Consumers can become a part of the development process.
The evolution lead manufacturing to face other new challenges such as mass-customization, sustainability and 3D printing . Thus, factories have to be adapted and smarter to improve the consumer experience. Internet of Things, Big Data analytic and remote control are one of the key factors and must be supported by an efficient business process management to connect machines and real time data together. Then, OEMs will be able to answer glocal needs and lower time-to-market, cost while producing high quality products and/or services. Those who embrace this approach are ready to enter the 4th Industrial Revolution.
Introduction of info theory basis for image/video coding, especially, entropy, rate-distortion theory,
entropy coding, huffman coding, arithmetic coding
Book review by Luca Lamera
"The Fourth Industrial Revolution". Klaus Schwab, founder of the World Economic Forum.
-
topics: IoT, Industry 4.0, Tech, Innovation, Future, Robotics, Automation.
-
Please do not hesitate to contact me if you have any questions.
Luca Lamera
Are you ready for the 4th industrial revolution?Sylvain Kalache
It's been a year that I left my job at LinkedIn to start my new professional life in the world of education, I wanted to share the biggest thing I learnt during this time.
Our world as we know it is about to drastically change, with the recent huge improvements in the world of deep learning and artificial intellligence, we are about to enter a new world where robot will take over a lot of tasks that were done by humans. What will be the impact? How shall we react? How to train the workforce? Are few questions I answer in this deck.
Linked blog post here: https://www.linkedin.com/pulse/you-ready-4th-industrial-revolution-sylvain-kalache
The rules of the game have changed
The way goods are produced today is completely different, and so are the goods produced.
With the new trend in technology, consumers can now influence design and control production, and manufacturers are now able to adapt quickly to specific consumer demands.
This shift is particularly exciting for consumers who are able to see the results of their input taken into consideration.
Gone are the days for manufacturers who may be threatened by consumer feedback. Today the technology exists for the development and creation processes to engage consumers earlier to poll for their ideas and opinions. Consumers can become a part of the development process.
The evolution lead manufacturing to face other new challenges such as mass-customization, sustainability and 3D printing . Thus, factories have to be adapted and smarter to improve the consumer experience. Internet of Things, Big Data analytic and remote control are one of the key factors and must be supported by an efficient business process management to connect machines and real time data together. Then, OEMs will be able to answer glocal needs and lower time-to-market, cost while producing high quality products and/or services. Those who embrace this approach are ready to enter the 4th Industrial Revolution.
"You Can Do It" by Louis Monier (Altavista Co-Founder & CTO) & Gregory Renard (CTO & Artificial Intelligence Lead Architect at Xbrain) for Deep Learning keynote #0 at Holberton School (http://www.meetup.com/Holberton-School/events/228364522/)
If you want to assist to similar keynote for free, checkout http://www.meetup.com/Holberton-School/
From Research Objects to Reproducible Science TalesBertram Ludäscher
University of Southampton. Electronics & Computer Science. Research Seminar (Invited Talk).
TITLE: From Research Objects to Reproducible Science Tales
ABSTRACT. Rumor has it that there is a reproducibility crisis in science. Or maybe there are multiple crises? What do we mean by reproducibility and replicability anyways? In this talk I will first make an attempt at sorting out some of the terminological confusion in this area, focusing on computational aspects. The PRIMAD model is another attempt to describe different aspects of reproducibility studies by focusing on the "delta" between those studies and the original study. In addition to these more theoretical investigations, I will discuss practical efforts to create more reproducible and more transparent computational platforms such as the one developed by the Whole-Tale project: here 'tales' are executable research objects that may combine data, code, runtime environments, and narratives (i.e., the traditional "science story"). I will conclude with some thoughts about the remaining challenges and opportunities to bridge the large conceptual gaps that continue to exist despite the recognition of problems of reproducibility and transparency in science.
ABOUT the Speaker. Bertram Ludäscher is a professor at the School of Information Sciences at the University of Illinois, Urbana-Champaign and a faculty affiliate with the National Center for Supercomputing Applications (NCSA) and the Department of Computer Science at Illinois. Until 2014 he was a professor at the Department of Computer Science at the University of California, Davis. His research interests range from practical questions in scientific data and workflow management, to database theory and knowledge representation and reasoning. Prior to his faculty appointments, he was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UC San Diego. He received his M.S. (Dipl.-Inform.) in computer science from the University of Karlsruhe (now K.I.T.), and his PhD (Dr. rer. nat.) from the University of Freiburg, in Germany.
La résolution de problèmes à l'aide de graphesData2B
- Science des Réseaux
- Réseaux géographiques
- Réseaux temporels
- Le Big Data et la Science des Réseaux
- Les réseaux en Intelligence Analytique
- Réseaux de données sociales et analyse communautaire
- Réseaux de données agroalimentaires et analyse stratégique
- Intelligence émotionnelle
- Intelligence analytique et réseaux de neurones
- De l’apprentissage automatique (machine learning) au raisonnement automatique.
From list sorting to network routing, and from hash tables to capacity planning, a programmer's daily work is filled with probability. We use probabilistic algorithms, data structures, and systems constantly often without even thinking about it. Experienced engineers reach for probabilistic algorithms frequently and intentionally, especially when building systems of serious scale. How do probabilistic algorithms actually work in practice? And how do we know they'll be safe and reliable in our critical production systems? We'll address those questions, explore a few algorithms, and see why "with high probability" is often better than "exactly".
Data Data Everywhere: Not An Insight to Take Action UponArun Kejariwal
The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems.
A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to):
# Packets being dropped due to unresponsive downstream services
# A network hiccup
# Transient hardware or software failure
# An issue with the data collection service
“Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data.
Deep Learning: concepts and use cases (October 2018)Julien SIMON
An introduction to Deep Learning theory
Neurons & Neural Networks
The Training Process
Backpropagation
Optimizers
Common network architectures and use cases
Convolutional Neural Networks
Recurrent Neural Networks
Long Short Term Memory Networks
Generative Adversarial Networks
Getting started
There are many modern techniques for identifying anomalies in datasets. There are fewer that work as online algorithms suitable for application to real-time streaming data. What’s worse? Most of these methodologies require a deep understanding of the data itself. In this talk, we tour what the options are for identifying anomalies in real-time data and discuss how much we really need to know before hand to guess at the ever-useful question: is this normal?
Big, Open, Data and Semantics for Real-World Application Near YouBiplav Srivastava
(This is material presented as keynote at AMECSE 2014 on 21 Oct 2014 at Cairo, Egypt.)
State-of-the-art Artifical Intelligence (AI) and data management techniques have been demonstrated to process large volumes of noisy data to extract meaningful patterns and drive decisions in diverse applications ranging from space exploration (NASA's Curiosity), game shows (IBM's Watson in Jeopardy™ ) and even consumer products (Apple's SIRI™ voice-recognition). However, what stops them from helping us in more mundane things like fighting diseases, eliminating hunger, improving commuting
to work, or reducing financial frauds and corruption? Consumable data!
In this talk, Biplav will demonstrate and discuss how large volumes of data (Big), made available publicly (Open), can be productively used with semantic web and analytical techniques to drive day-to-day applications. One important source of this type of data is government open data which is from governments and free to be reused. Big Open Data is leading to early examples of "open innovations" - a confluence of open data (e.g., Data.gov, data.gov.in), accessible via API techniques (e.g., Open 311),
annotated with semantic information (e.g., W3C ontologies, Schema.org) and processed with analytical techniques (e.g., R, Weka) to drive actionable insights. The talk will illustrate how this can help bring increased benefits to citizens and discuss research issues that can accelerate its pace. It is increasingly being adopted by progressive businesses and governments to drive innovation that matters.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak
Will your decisions change if you'll know that the audience of your website isn't 5M users, but rather 5'042'394'953? Unlikely, so why should we always calculate the exact solution at any cost? An approximate solution for this and many similar problems would take only a fraction of memory and runtime in comparison to calculating the exact solution.
"You Can Do It" by Louis Monier (Altavista Co-Founder & CTO) & Gregory Renard (CTO & Artificial Intelligence Lead Architect at Xbrain) for Deep Learning keynote #0 at Holberton School (http://www.meetup.com/Holberton-School/events/228364522/)
If you want to assist to similar keynote for free, checkout http://www.meetup.com/Holberton-School/
From Research Objects to Reproducible Science TalesBertram Ludäscher
University of Southampton. Electronics & Computer Science. Research Seminar (Invited Talk).
TITLE: From Research Objects to Reproducible Science Tales
ABSTRACT. Rumor has it that there is a reproducibility crisis in science. Or maybe there are multiple crises? What do we mean by reproducibility and replicability anyways? In this talk I will first make an attempt at sorting out some of the terminological confusion in this area, focusing on computational aspects. The PRIMAD model is another attempt to describe different aspects of reproducibility studies by focusing on the "delta" between those studies and the original study. In addition to these more theoretical investigations, I will discuss practical efforts to create more reproducible and more transparent computational platforms such as the one developed by the Whole-Tale project: here 'tales' are executable research objects that may combine data, code, runtime environments, and narratives (i.e., the traditional "science story"). I will conclude with some thoughts about the remaining challenges and opportunities to bridge the large conceptual gaps that continue to exist despite the recognition of problems of reproducibility and transparency in science.
ABOUT the Speaker. Bertram Ludäscher is a professor at the School of Information Sciences at the University of Illinois, Urbana-Champaign and a faculty affiliate with the National Center for Supercomputing Applications (NCSA) and the Department of Computer Science at Illinois. Until 2014 he was a professor at the Department of Computer Science at the University of California, Davis. His research interests range from practical questions in scientific data and workflow management, to database theory and knowledge representation and reasoning. Prior to his faculty appointments, he was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UC San Diego. He received his M.S. (Dipl.-Inform.) in computer science from the University of Karlsruhe (now K.I.T.), and his PhD (Dr. rer. nat.) from the University of Freiburg, in Germany.
La résolution de problèmes à l'aide de graphesData2B
- Science des Réseaux
- Réseaux géographiques
- Réseaux temporels
- Le Big Data et la Science des Réseaux
- Les réseaux en Intelligence Analytique
- Réseaux de données sociales et analyse communautaire
- Réseaux de données agroalimentaires et analyse stratégique
- Intelligence émotionnelle
- Intelligence analytique et réseaux de neurones
- De l’apprentissage automatique (machine learning) au raisonnement automatique.
From list sorting to network routing, and from hash tables to capacity planning, a programmer's daily work is filled with probability. We use probabilistic algorithms, data structures, and systems constantly often without even thinking about it. Experienced engineers reach for probabilistic algorithms frequently and intentionally, especially when building systems of serious scale. How do probabilistic algorithms actually work in practice? And how do we know they'll be safe and reliable in our critical production systems? We'll address those questions, explore a few algorithms, and see why "with high probability" is often better than "exactly".
Data Data Everywhere: Not An Insight to Take Action UponArun Kejariwal
The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems.
A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to):
# Packets being dropped due to unresponsive downstream services
# A network hiccup
# Transient hardware or software failure
# An issue with the data collection service
“Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data.
Deep Learning: concepts and use cases (October 2018)Julien SIMON
An introduction to Deep Learning theory
Neurons & Neural Networks
The Training Process
Backpropagation
Optimizers
Common network architectures and use cases
Convolutional Neural Networks
Recurrent Neural Networks
Long Short Term Memory Networks
Generative Adversarial Networks
Getting started
There are many modern techniques for identifying anomalies in datasets. There are fewer that work as online algorithms suitable for application to real-time streaming data. What’s worse? Most of these methodologies require a deep understanding of the data itself. In this talk, we tour what the options are for identifying anomalies in real-time data and discuss how much we really need to know before hand to guess at the ever-useful question: is this normal?
Big, Open, Data and Semantics for Real-World Application Near YouBiplav Srivastava
(This is material presented as keynote at AMECSE 2014 on 21 Oct 2014 at Cairo, Egypt.)
State-of-the-art Artifical Intelligence (AI) and data management techniques have been demonstrated to process large volumes of noisy data to extract meaningful patterns and drive decisions in diverse applications ranging from space exploration (NASA's Curiosity), game shows (IBM's Watson in Jeopardy™ ) and even consumer products (Apple's SIRI™ voice-recognition). However, what stops them from helping us in more mundane things like fighting diseases, eliminating hunger, improving commuting
to work, or reducing financial frauds and corruption? Consumable data!
In this talk, Biplav will demonstrate and discuss how large volumes of data (Big), made available publicly (Open), can be productively used with semantic web and analytical techniques to drive day-to-day applications. One important source of this type of data is government open data which is from governments and free to be reused. Big Open Data is leading to early examples of "open innovations" - a confluence of open data (e.g., Data.gov, data.gov.in), accessible via API techniques (e.g., Open 311),
annotated with semantic information (e.g., W3C ontologies, Schema.org) and processed with analytical techniques (e.g., R, Weka) to drive actionable insights. The talk will illustrate how this can help bring increased benefits to citizens and discuss research issues that can accelerate its pace. It is increasingly being adopted by progressive businesses and governments to drive innovation that matters.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
Similar to Information surprise or how to find interesting data (20)
Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak
Will your decisions change if you'll know that the audience of your website isn't 5M users, but rather 5'042'394'953? Unlikely, so why should we always calculate the exact solution at any cost? An approximate solution for this and many similar problems would take only a fraction of memory and runtime in comparison to calculating the exact solution.
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilitySciAstra
The Indian Statistical Institute (ISI) has extended its application deadline for 2024 admissions to April 2. Known for its excellence in statistics and related fields, ISI offers a range of programs from Bachelor's to Junior Research Fellowships. The admission test is scheduled for May 12, 2024. Eligibility varies by program, generally requiring a background in Mathematics and English for undergraduate courses and specific degrees for postgraduate and research positions. Application fees are ₹1500 for male general category applicants and ₹1000 for females. Applications are open to Indian and OCI candidates.
3. Define Surprise!
surprise
[countable] an event, a piece of news, etc. that is
unexpected or that happens suddenly
SYNONYMS: shock, … , eye-opener
[uncountable, countable] a feeling caused by something
happening suddenly or unexpectedly
SYNONYMS: astonishment, ...
(Oxford Advanced Learner's Dictionary)
9. Quantify
Complexity
can measure any content type.
Note: complex is not random!
Measures of complexity
1. Subjective rating
2. #Distinct elements
3. #Dimension
4. #Control parameters
5. Minimal description
6. Information content
7. Minimal generator
8. Minimum energy
Abdallah, S., & Plumbley, M. (2009). Information dynamics: patterns of expectation
and surprise in the perception of music. Connection Science, 21(2-3), 89-117.
<vs>
10. Neuro/Cognitive Science
How do we perceive information?
Machine Learning
How to measure differences?
Surprise Quants in academia
11. ... machine that constantly tells you what you already know
is just irritating. So software alerts users only to surprises...
Horvitz, E., Apacible, J., Sarin, R., & Liao, L. Prediction, Expectation, and
Surprise: Methods, Designs, and Study of a Deployed Traffic Forecasting
Service.
Friston, K. (2010). The free-energy principle: a unified brain theory?. Nature
Reviews Neuroscience, 11(2), 127-138.
Surprise Quants in academia
Neuro/Cognitive Science
How do we perceive information?
Machine Learning
How to measure differences?
12. Machine LearningNeuro/Cognitive Science
Surprise Quants in academia
Itti, L., & Baldi, P. F. (2005). Bayesian surprise attracts human attention. In Advances in neural information
processing systems (pp. 547-554).
13. Surprise Quants in academia
Itti, L., & Baldi, P. F. (2005). Bayesian surprise attracts human attention. In Advances in neural information
processing systems (pp. 547-554).
meh
wow
meh
14. Typical ML applications
Unsupervised Learning
1. Decision trees (inf. gain)
2. MaxEnt principle
3. ...
Specifically after ‘surprise’:
4. One-class classification
5. Anomaly detection
6. Novelty measure Pimentel, M. A., Clifton, D. A., Clifton, L., & Tarassenko, L. (2014).
A review of novelty detection. Signal Processing, 99, 215-249.
15. Model of a cat
Data Model
(expectations)
Data
(stream) Surprising?
(interesting, new)
Update
wow
(act)
meh
(ignore)
Element
(attention window)
16. Model of a cat’s surprise
Surprising?
(interesting, new)
17. Quantify surprisal /self-information/
The surprise /information/ in observing the
occurrence of an event having probability .
Axioms:
≤
≥
∗
Derive:
∗ ∗
∗
Surprisal /self-information/:
−
Flipping a fair coin provides 1bit of new information.
bits
or wows
bits
19. Model of a cat
Data Model
(expectations)
Data
(stream) Surprising?
(interesting, new)
Update
wow
(act)
meh
(ignore)
Element
(attention window)
20. Model of a cat’s knowledge
Data Model
(expectations)
21. Quantify ‘knowledge’ /entropy/
The Shannon entropy is the expected value of the self-information.
Notes:
1. The maximum entropy distribution
is the least informative.
2. The statistical mechanics and the
information entropy are principally
the same.
max: log2
(n)
Entropy of a Bernoulli trial
X Є {0,1}
22. Entropy applications
Analysis of a binary of GeoIP ISP database:
Analyzing unknown binary files using information entropy:
http://yurichev.com/blog/entropy/
23. Entropy applications
Visualizing the OSX ksh binary (see binvis.io)
Visualizing entropy in binary files
http://corte.si/posts/visualisation/entropy/index.html
1,2:
Cryptic
signature
24. Model of a cat’s discovery
Data Model
(expectations)
Surprising?
(interesting, new)
wow
(act)
meh
(ignore)
Element
(attention window)
What has changed?
25. The Kullback–Leibler divergence /relative entropy, information gain/:
is a measure of the information lost when Q is used to approximate P
(measures the expected number of extra bits required to recode)
Quantify ‘discovery’ /information gain/
"KL-Gauss-Example" T. Nathan Mundhenk
Not a true measure: asymmetric →
26. Quantify ‘discovery surprise’
Symmetric KL Distances: All result in the same performance:
Pinto, D., Benedí, J. M., & Rosso, P. (2007).
Clustering narrow-domain short texts by
using the Kullback-Leibler distance. In
Computational Linguistics and Intelligent Text
Processing
38. Simplistic topic modeling
- tweets are super short
+ important events are widely discussed
+ events change vocabulary
- timeslot aggregation favors the predominant event
Document is a timeslot.
Model:
- bag of words
- freq. threshold > 200 tweets
- term frequency (naive)
- tokenizer: https://github.
com/jaredks/tweetokenize
+ a few touches
39. Simplistic topic modeling
Document is a time slot.
Model:
- bag of words
- freq. threshold > 200 tweets
- term frequency (naive)
- tokenizer: https://github.
com/jaredks/tweetokenize
+ a few touches
41. Test a domain specific hack
Vocabulary: catastrophe
…
42. Vocabulary slots: KLD
How surpriseful vocabulary of
each hour against the whole dataset
Beware: on this scale individual hours
are small, but events are plentiful
Higher KLD on sparse data
Lower KLD on dense data
61. 1. Benchmark: ‘hot’ events from media
2. Fight bots
a. spam (repetitions, bots)
b. ‘forced’ opinions
c. filter low quality
3. Topic model
a. no just Term Frequency
b. split topics (!)
To improve in Tweets app