Social media analysis constitutes a scientific field that is rapidly gaining ground due to its numerous research challenges and practical applications, as well as the unprecedented availability of data in real time. Several of these applications have significant social and economical impact, such as journalism, crisis management, advertising, etc. However, two issues regarding these applications have to be confronted. The first one is the financial cost. Despite the abundance of information, it typically comes at a premium price, and only a fraction is provided free of charge. For example, Twitter, a predominant social media online service, grants researchers and practitioners free access
to only a small proportion (1%) of its publicly available stream. The second issue is the computational cost. Even when the full stream is available, off the shelf approaches are unable to operate in such settings due to the real-time computational demands. Consequently, real world applications as well as research efforts that exploit such information are limited to utilizing only a subset of the available data. In this paper, we are interested in evaluating the extent to which analytical processes are affected by the aforementioned limitation. In particular, we apply a plethora of analysis processes on two subsets of Twitter public data, obtained through the service’s sampling API’s. The first one is the default 1% sample, whereas the second is the Gardenhose sample that
our research group has access to, returning 10% of all public data. We extensively evaluate their relative performance in numerous scenarios.
Talk given at TAPP'16 (Theory and Practice of Provenance), June 2016, paper is here:
https://arxiv.org/abs/1604.06412
Abstract:
The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive factors:
low cost data generation, inexpensively scalable storage and processing infrastructure (cloud), software frameworks and tools for massively distributed data processing, and parallelisable data analytics algorithms.
One observation that is often overlooked, however, is that each of these elements is not immutable, rather they all evolve over time.
As those datasets change over time, the value of their derivative knowledge may decay, unless it is preserved by reacting to those changes. Our broad research goal is to develop models, methods, and tools for selectively reacting to changes by balancing costs and benefits, i.e. through complete or partial re-computation of some of the underlying processes.
In this paper we present an initial model for reasoning about change and re-computations, and show how analysis of detailed provenance of derived knowledge informs re-computation decisions.
We illustrate the main ideas through a real-world case study in genomics, namely on the interpretation of human variants in support of genetic diagnosis.
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Paolo Missier
Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population?
The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration.
The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh.
In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project
Our vision for the selective re-computation of genomics pipelines in reaction to changes to tools and reference datasets.
How do you prioritise patients for re-analysis on a given budget?
ISEC 2014 (International Statistical Ecology Conference) Olga Lyashevska
The effect of grid spacing on spatial prediction of species abundances was estimated. Data on counts of intertidal macrofauna (M. balthica) were collected in the Dutch Wadden sea over a grid of 500 × 500 m. The first step in the procedure was modelling of the zero-inflated data
without taking spatial dependency into account. The problem of excess zeros was addressed through a mixture model (Lambert, 1992) which allowed to distinguish the point mass at zero through a Bernoulli process and the count component through a Poisson process. In the second
step spatial correlation in both processes was then accounted for through generalised linear geostatistical model (GLSM) (Diggle et al., 1998; Christensen, 2004). Using simulations from the conditional distribution by MCMC a Monte Carlo approximation to the likelihood function was made. In the third step the two calibrated GLSMs were used to generate 100 pseudo-realities. This was done by conditional simulation from the original grid to the nodes of a fine prediction grid (100 × 100 m) supplemented with 1000 randomly selected validation points. The simulated pseudo-realities of the Bernoulli variable and the Poisson variable were combined
into 100 pseudo-realities of a zero-inflated Poisson variable. In the fourth step each simulated
pseudo-reality was repeatedly sampled by grid sampling with a varying spacing. Each sample was used to predict the study variable at the validation points by inverse distance weighted interpolation, and to estimate the Mean Squared Error (MSE). By averaging the MSEs over the
pseudo-realities an estimate of the model-expectation of the MSE was obtained. The results showed that the decrease in resolution of the sampling grid (upscaling) had a clear effect on the precision of the predictions. This has direct implications for decisions with respect to sampling
density for ecological monitoring programmes.
With the tremendous growth of social networks, there has been a growth in the amount of new data that is being created every minute on these networking sites. The notion of community in this social networking world has caught lots of attention. Studying Twitter is useful for understanding how people use new communication technologies to form social connections and maintain existing ones. We analysed how geo-tagged tweets in Twitter can be used to identify useful user features and behavior as well as identify landmarks/places of interests. We also analysed several clustering algorithms and proposed different similarity measures to detect communities.
Talk given at TAPP'16 (Theory and Practice of Provenance), June 2016, paper is here:
https://arxiv.org/abs/1604.06412
Abstract:
The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive factors:
low cost data generation, inexpensively scalable storage and processing infrastructure (cloud), software frameworks and tools for massively distributed data processing, and parallelisable data analytics algorithms.
One observation that is often overlooked, however, is that each of these elements is not immutable, rather they all evolve over time.
As those datasets change over time, the value of their derivative knowledge may decay, unless it is preserved by reacting to those changes. Our broad research goal is to develop models, methods, and tools for selectively reacting to changes by balancing costs and benefits, i.e. through complete or partial re-computation of some of the underlying processes.
In this paper we present an initial model for reasoning about change and re-computations, and show how analysis of detailed provenance of derived knowledge informs re-computation decisions.
We illustrate the main ideas through a real-world case study in genomics, namely on the interpretation of human variants in support of genetic diagnosis.
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Paolo Missier
Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population?
The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration.
The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh.
In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project
Our vision for the selective re-computation of genomics pipelines in reaction to changes to tools and reference datasets.
How do you prioritise patients for re-analysis on a given budget?
ISEC 2014 (International Statistical Ecology Conference) Olga Lyashevska
The effect of grid spacing on spatial prediction of species abundances was estimated. Data on counts of intertidal macrofauna (M. balthica) were collected in the Dutch Wadden sea over a grid of 500 × 500 m. The first step in the procedure was modelling of the zero-inflated data
without taking spatial dependency into account. The problem of excess zeros was addressed through a mixture model (Lambert, 1992) which allowed to distinguish the point mass at zero through a Bernoulli process and the count component through a Poisson process. In the second
step spatial correlation in both processes was then accounted for through generalised linear geostatistical model (GLSM) (Diggle et al., 1998; Christensen, 2004). Using simulations from the conditional distribution by MCMC a Monte Carlo approximation to the likelihood function was made. In the third step the two calibrated GLSMs were used to generate 100 pseudo-realities. This was done by conditional simulation from the original grid to the nodes of a fine prediction grid (100 × 100 m) supplemented with 1000 randomly selected validation points. The simulated pseudo-realities of the Bernoulli variable and the Poisson variable were combined
into 100 pseudo-realities of a zero-inflated Poisson variable. In the fourth step each simulated
pseudo-reality was repeatedly sampled by grid sampling with a varying spacing. Each sample was used to predict the study variable at the validation points by inverse distance weighted interpolation, and to estimate the Mean Squared Error (MSE). By averaging the MSEs over the
pseudo-realities an estimate of the model-expectation of the MSE was obtained. The results showed that the decrease in resolution of the sampling grid (upscaling) had a clear effect on the precision of the predictions. This has direct implications for decisions with respect to sampling
density for ecological monitoring programmes.
With the tremendous growth of social networks, there has been a growth in the amount of new data that is being created every minute on these networking sites. The notion of community in this social networking world has caught lots of attention. Studying Twitter is useful for understanding how people use new communication technologies to form social connections and maintain existing ones. We analysed how geo-tagged tweets in Twitter can be used to identify useful user features and behavior as well as identify landmarks/places of interests. We also analysed several clustering algorithms and proposed different similarity measures to detect communities.
Smart metering technologies allow for gathering high resolution water demand data in the residential sector, opening up new opportunities for the development of models describing water consumers’ behaviors. Yet, gathering such accurate water demand data at the end-use level is limited by metering intrusiveness, costs, and privacy issues. In this paper, we contribute a stochastic simulation model for synthetically generating high-resolution time series of water use at the end-use level. Each water end-use fixture in our model is characterized by its signature (i.e., its typical single-use pattern), as well as frequency distributions of its number of uses per day, single use duration, time of use during the day, and contribution to the total household water demand. The model relies on statistical data from a real-world metering campaign across 9 cities in the US. Showcasing our model outputs, we demonstrate the potential usability of this model for characterizing the water end-use demands of different communities, as well as for analyzing the major components of peak demand and performing scenario analysis.
The 14th Summer Environmental Health Sciences Institute took place in Houston, TX the week of 7/14/2014. This workshop on climate change, comes from educational designers from the National Center for Atmospheric Research. While you may not have been able to join us, you can still review content and download all the activities at our website: https://scied.ucar.edu/events/clone-climate-change-connections-2014
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...Werner Leyh
Abstract. The aim of this work is to explore the opportunities offered by
semantic standardization to interlink primary “spatial data” (GI) from “Open-
StreetMap” (OSM) with repositories of the “Linked Open Data Cloud” (LOD).
Research in natural sciences can generate vast amounts of spatial data, where
Wikidata could be considered as the central hub between more detailed natural
science hubs on the spatial semantic web. Wikidata is a world readable and
writable community-driven knowledge base. It offers the opportunity to collaboratively
construct an open access knowledge graph that spans biology,
medicine, and all other domains of knowledge. In this study, we discuss
the opportunities and challenges provided by exploring Wikidata as a central
integration facility by interlink it with OSM, a popular, community driven
collection of free geographic data. This is empowered by the reuse of terms
and properties from commonly understood controlled vocabularies that
represent their respective well-identified knowledge domains.
URL: https://www.springerprofessional.de/en/interlinking-standardized-openstreetmap-data-and-citizen-science/13302088
DOI: https://doi.org/10.1007/978-3-319-60366-7_9
Werner Leyh, Homero Fonseca Filho
University of São Paulo (USP), São Paulo, Brazil
WernerLeyh@yahoo.com
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...GigaScience, BGI Hong Kong
Scott Edmunds talk at the AIST Computational Biology Research Center in Tokyo: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods), July 1st 2014
1 PHY 241 Fall 2018 PHY 241 Lab 7- Momentum is Conserved.docxoswald1horne84988
1
PHY 241 Fall 2018
PHY 241 Lab 7- Momentum is Conserved
Introduction:
Momentum is a vector quantity which is measured by taking the product of an objects mass and
velocity,
𝑝 = 𝑚�⃗�. (1)
Much like energy, the concept of momentum is useful because we have a law which guarantees that the
momentum of an appropriate system is conserved.
“The total amount of momentum in a system is a constant unless momentum is transferred
through the system boundary by an Impulse.”
Where an impulse is an external force which acts on a system over time,
𝐼 = ∫ 𝐹𝑒𝑥𝑡⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ 𝑑𝑡.
Equipment:
Two CBR 2- connected directly to a computer using USB cables
Various collision carts
Mass blocks for carts
2 m track
Bubble level
Computer with Logger Pro or Logger Lite and Excel.
Triple beam balance scale.
Procedure:
1) Design a procedure to collect the information you need to measure the momentum of two
carts simultaneously. WARNING: Occasionally, the clicks from your two different CBRs will
interfere with each other and give incorrect data. Your group should develop criteria to
determine when data is invalid and a response.
2) Generate a plot of the momentum of each cart as well as the total momentum similar to
“Carts’ Momenta.” Notice you must correct for the fact that the two different CBRs are
using different coordinate systems.
2
PHY 241 Fall 2018
3) Similarly, generate a plot of the kinetic energy of each cart as well as the total kinetic
energy.
4) This should allow you to make a single plot containing both the Kinetic Energy and the
Momenta for the same collision. Notice you will need to let Excel know that Energy needs
to be plotted on a “Secondary Axis” because these two quantities have different units.
1 1.2 1.4 1.6 1.8 2
E
n
e
rg
y
(
J)
M
o
m
e
n
tu
m
(
k
g
m
/s
)
Time (s)
Energy and Momentum
Total Momentum Total Kinetic Energy
1 1.2 1.4 1.6 1.8 2
M
o
m
e
n
tu
m
(
k
g
m
/s
)
Time (s)
Carts' Momenta
Cart 1 Cart 2 Total Momentum
1 1.2 1.4 1.6 1.8 2
E
n
e
rg
y
(
J)
Time (s)
Carts' Energies
Cart 1 Cart 2 Total Kinetic Energy
3
PHY 241 Fall 2018
5) At this point there are a few questions that that arise from the Energy and Momentum
graph above. To
A) DA- Is the behavior of the Energy and Momentum graph unique to the specific details of
the collision. Collect energy and momentum data for at least four different collisions
(magnet/spring/Velcro, different mass carts, etc.) and find a way to visualize all this data
so you can qualitatively compare and contrast features you see in the data.
B) Researcher- Choose a single trial to investigate momentum carefully. Is momentum
conserved? Measure the Impulse generated by force(s) on your system and see if you
can account for any changes in momentum you observed. Be as quantitative as
possible.
C) PI- Choose a single trial to investigate energy carefully. Energy appears to
Data Curation and Debugging for Data Centric AIPaul Groth
It is increasingly recognized that data is a central challenge for AI systems - whether training an entirely new model, discovering data for a model, or applying an existing model to new data. Given this centrality of data, there is need to provide new tools that are able to help data teams create, curate and debug datasets in the context of complex machine learning pipelines. In this talk, I outline the underlying challenges for data debugging and curation in these environments. I then discuss our recent research that both takes advantage of ML to improve datasets but also uses core database techniques for debugging in such complex ML pipelines.
Presented at DBML 2022 at ICDE - https://www.wis.ewi.tudelft.nl/dbml2022
My presentation to the World Nuclear Association Symposium 2015. In this presentation I discussed updated findings of my review of 100 % renewable energy system literature.
SCALGO software allows for efficient handling of massive terrain data on standard workstations. The software works provably efficient on all input data sets and always delivers fully-specified output. It eliminates the need for accuracy-decreasing data thinning and cumbersome workflows such as those introduced by data tiling.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Smart metering technologies allow for gathering high resolution water demand data in the residential sector, opening up new opportunities for the development of models describing water consumers’ behaviors. Yet, gathering such accurate water demand data at the end-use level is limited by metering intrusiveness, costs, and privacy issues. In this paper, we contribute a stochastic simulation model for synthetically generating high-resolution time series of water use at the end-use level. Each water end-use fixture in our model is characterized by its signature (i.e., its typical single-use pattern), as well as frequency distributions of its number of uses per day, single use duration, time of use during the day, and contribution to the total household water demand. The model relies on statistical data from a real-world metering campaign across 9 cities in the US. Showcasing our model outputs, we demonstrate the potential usability of this model for characterizing the water end-use demands of different communities, as well as for analyzing the major components of peak demand and performing scenario analysis.
The 14th Summer Environmental Health Sciences Institute took place in Houston, TX the week of 7/14/2014. This workshop on climate change, comes from educational designers from the National Center for Atmospheric Research. While you may not have been able to join us, you can still review content and download all the activities at our website: https://scied.ucar.edu/events/clone-climate-change-connections-2014
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...Werner Leyh
Abstract. The aim of this work is to explore the opportunities offered by
semantic standardization to interlink primary “spatial data” (GI) from “Open-
StreetMap” (OSM) with repositories of the “Linked Open Data Cloud” (LOD).
Research in natural sciences can generate vast amounts of spatial data, where
Wikidata could be considered as the central hub between more detailed natural
science hubs on the spatial semantic web. Wikidata is a world readable and
writable community-driven knowledge base. It offers the opportunity to collaboratively
construct an open access knowledge graph that spans biology,
medicine, and all other domains of knowledge. In this study, we discuss
the opportunities and challenges provided by exploring Wikidata as a central
integration facility by interlink it with OSM, a popular, community driven
collection of free geographic data. This is empowered by the reuse of terms
and properties from commonly understood controlled vocabularies that
represent their respective well-identified knowledge domains.
URL: https://www.springerprofessional.de/en/interlinking-standardized-openstreetmap-data-and-citizen-science/13302088
DOI: https://doi.org/10.1007/978-3-319-60366-7_9
Werner Leyh, Homero Fonseca Filho
University of São Paulo (USP), São Paulo, Brazil
WernerLeyh@yahoo.com
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...GigaScience, BGI Hong Kong
Scott Edmunds talk at the AIST Computational Biology Research Center in Tokyo: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods), July 1st 2014
1 PHY 241 Fall 2018 PHY 241 Lab 7- Momentum is Conserved.docxoswald1horne84988
1
PHY 241 Fall 2018
PHY 241 Lab 7- Momentum is Conserved
Introduction:
Momentum is a vector quantity which is measured by taking the product of an objects mass and
velocity,
𝑝 = 𝑚�⃗�. (1)
Much like energy, the concept of momentum is useful because we have a law which guarantees that the
momentum of an appropriate system is conserved.
“The total amount of momentum in a system is a constant unless momentum is transferred
through the system boundary by an Impulse.”
Where an impulse is an external force which acts on a system over time,
𝐼 = ∫ 𝐹𝑒𝑥𝑡⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ 𝑑𝑡.
Equipment:
Two CBR 2- connected directly to a computer using USB cables
Various collision carts
Mass blocks for carts
2 m track
Bubble level
Computer with Logger Pro or Logger Lite and Excel.
Triple beam balance scale.
Procedure:
1) Design a procedure to collect the information you need to measure the momentum of two
carts simultaneously. WARNING: Occasionally, the clicks from your two different CBRs will
interfere with each other and give incorrect data. Your group should develop criteria to
determine when data is invalid and a response.
2) Generate a plot of the momentum of each cart as well as the total momentum similar to
“Carts’ Momenta.” Notice you must correct for the fact that the two different CBRs are
using different coordinate systems.
2
PHY 241 Fall 2018
3) Similarly, generate a plot of the kinetic energy of each cart as well as the total kinetic
energy.
4) This should allow you to make a single plot containing both the Kinetic Energy and the
Momenta for the same collision. Notice you will need to let Excel know that Energy needs
to be plotted on a “Secondary Axis” because these two quantities have different units.
1 1.2 1.4 1.6 1.8 2
E
n
e
rg
y
(
J)
M
o
m
e
n
tu
m
(
k
g
m
/s
)
Time (s)
Energy and Momentum
Total Momentum Total Kinetic Energy
1 1.2 1.4 1.6 1.8 2
M
o
m
e
n
tu
m
(
k
g
m
/s
)
Time (s)
Carts' Momenta
Cart 1 Cart 2 Total Momentum
1 1.2 1.4 1.6 1.8 2
E
n
e
rg
y
(
J)
Time (s)
Carts' Energies
Cart 1 Cart 2 Total Kinetic Energy
3
PHY 241 Fall 2018
5) At this point there are a few questions that that arise from the Energy and Momentum
graph above. To
A) DA- Is the behavior of the Energy and Momentum graph unique to the specific details of
the collision. Collect energy and momentum data for at least four different collisions
(magnet/spring/Velcro, different mass carts, etc.) and find a way to visualize all this data
so you can qualitatively compare and contrast features you see in the data.
B) Researcher- Choose a single trial to investigate momentum carefully. Is momentum
conserved? Measure the Impulse generated by force(s) on your system and see if you
can account for any changes in momentum you observed. Be as quantitative as
possible.
C) PI- Choose a single trial to investigate energy carefully. Energy appears to
Data Curation and Debugging for Data Centric AIPaul Groth
It is increasingly recognized that data is a central challenge for AI systems - whether training an entirely new model, discovering data for a model, or applying an existing model to new data. Given this centrality of data, there is need to provide new tools that are able to help data teams create, curate and debug datasets in the context of complex machine learning pipelines. In this talk, I outline the underlying challenges for data debugging and curation in these environments. I then discuss our recent research that both takes advantage of ML to improve datasets but also uses core database techniques for debugging in such complex ML pipelines.
Presented at DBML 2022 at ICDE - https://www.wis.ewi.tudelft.nl/dbml2022
My presentation to the World Nuclear Association Symposium 2015. In this presentation I discussed updated findings of my review of 100 % renewable energy system literature.
SCALGO software allows for efficient handling of massive terrain data on standard workstations. The software works provably efficient on all input data sets and always delivers fully-specified output. It eliminates the need for accuracy-decreasing data thinning and cumbersome workflows such as those introduced by data tiling.
Similar to Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014 (20)
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Mining Twitter Data with Resource Constraints - IEEE/ACM Conference on Web Intelligence 2014
1. Mining Twitter Data with Resource Constraints
Geoge Valkanas, Ioannis Katakis,
Dimitrios Gunopulos, Anthony Stefanidis
August 12, 2015
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 1 / 18
2. Research Question
Is the 1% sample provided by the Twitter API sucient for
spatio-temporal analysis tasks? ... which tasks?
! We compare with the 10% sample (Garden Hose)
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 2 / 18
3. Outline
1 Problem and Motivation
2 Data Collection
3 Experiments in Various Tasks
Geo-location Coverage
Sentiment Analysis
Popular Topic Detection
Graph Evolution
4 Conclusions
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 3 / 18
4. Introduction
Twitter Samples
Two ways to access the stream
Public Stream: 1% Sample
Garden Hose: 10% Sample
... in both cases, we don't know details about the sampling method.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 4 / 18
5. Introduction
Constraints
Financial cost
Licences of larger samples, are costly and dicult to obtain.
Computational cost
7 Giga Bytes per minute
O the shelf approaches are unable to operate in such settings
In practice: those who engage in social media analytical tasks have
practically no choice but to resort to the downsized information. However,
being only a small fraction of the entire stream, it is unclear how reliable
this information is for each type of application.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 5 / 18
6. Introduction
A more concrete example
The INSIGHT Project: Improve understanding, prediction and warning of
emergencies through real-time processing of data streams including social
data.
(a) Floods in Germany (2013) (b) Control Center in Dublin CC
How much data are ecient for our task?
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 6 / 18
7. Introduction
Tasks we look into...
Sentiment Analysis
Geo-located information
Popular tweets
Social Graph Evolution
Linguistic Analysis
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 7 / 18
8. Data
The data
10M
1M
100K
Default Gardenhose
0 20 40 60 80 100
Tweet Count
Hours
(c) All tweets
100K
10K
1K
Default Gardenhose
0 20 40 60 80 100
GPS Tweet Count
Hours
(d) GPS-tagged tweets
Figure : Comparing default and gardenhose samples for volume over time
4 day period - November 2013
The two samples dier by an order of magnitude
Exhibit the same temporal pattern
Geotagged tweets are between 1-2% of their respective sampled data
Geotagged are more
attened out
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 8 / 18
9. Experiments
Geo-location coverage - Experiment 1
Bounding Box
Twitter also allows its users to ask for geotagged information.
The user provides a bounding box, by specifying 4 coordinates in the
form [(latmin; lonmin)(latmax ; lonmax )], and Twitter returns tweets that
fall within this region.
25
0
−25
−50
60 90 120 150
lon
lat
. In this particular case, where geotagged tweets are asked for instead of a
general sample, the volume of the returned results is the same for the two
samples!.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 9 / 18
10. Experiments
Geo-location coverage - Experiment 2
4 dierent crawls in London area
Loc1 Loc2 Loc3 Loc4
1400
1200
1000
800
600
400
200
0
0 5 10 15 20 25 30 35 40 45
Count
Half-Hour Interval
. As the overlap increases between the bounding boxes, so does the
similarity between two dierent crawls.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 10 / 18
11. Experiments
Sentiment Analysis
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Sample 1% Sample10%
0 20 40 60 80 100
Ratio
Hour
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
Sample 1% Sample10%
0 20 40 60 80 100
Ratio
Hour
Positive and Negative Sentiment Ratio
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0 20 40 60 80 100
Ratio
Hours
Pos 1%
Neg 1%
Pos 10%
Neg 10%
- Dictionary based
sentiment analysis
- Ratio of tweets is
the same in both
samples
- Ratios in geo-tagged
tweets are lower,
meaning that
geottagged tweets
oer less
sentiment-oriented
information
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 11 / 18
12. Experiments
Popular Topic Detection - Experiment
1 Extract the top-k most retweeted posts, that appear in our data
(both samples).
2 Compare the two lists (Kendall Correlation)
3 Compare the two lists with the ground truth (= actual retweet count
information included in the tweet)
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 12 / 18
13. Experiments
Popular Topic Detection - Results
1
0.98
0.96
0.94
0.92
0.9
0.88
0.86
10 100 1000 10000
Kendall Correl.
List Items
S1-S10
S1-S10P1
S1-S10P2
S10P1-S10P2
S1-S1P1
(a) Kendall
1
0.99
0.98
0.97
0.96
0.95
0.94
10 100 1000 10000
Common Items (%)
List Items
S1-S10
S1-S10P1
S1-S10P2
S10P1-S10P2
S1-S1P1
(b) Common Items
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Sample 1% Sample 10%
1
5
10
100
500
1000
2500
5000
7500
10000
Kendall Correl.
Iteration
(c) Vs the ground truth
Figure : Comparing the top-N most retweeted items
Conclusions
For up to 10 items, 1% is adequate. That is not however the case for
list with more than 1000 items.
Comparison with Ground Truth: 10% has higher correlation.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 13 / 18
14. Experiments
Graph Evolution Study - Experiment
Study the re-tweet graph (directed)
Edges are weighted (more re-tweets ! larger weight) and decay over
time
Edges are removed when their weight drops below a certain threshold
Method 1: Iter At each time interval extract a new graph
Method 2: Glb At each time interval aggregate the new nodes to the
current graph
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 14 / 18
15. Experiments
Results
300000
250000
200000
150000
100000
50000
0
0 200 400 600 800 1000 1200
Value
Iteration
Iter 1%
Glb 1%
Iter 10%
Glb 10%
(a) Size
100
90
80
70
60
50
40
30
20
10
0
Glb 1% Glb 10%
0 200 400 600 800 1000 1200
Value
Iteration
(b) Lar. Con. Comp. Size
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0 200 400 600 800 1000 1200
Value
Iteration
Iter 1%
Glb 1%
Iter 10%
Glb 10%
(c) Clustering Coecient
Figure : Statistical properties of the extracted retweet graph, over time
Conclusions
No signi
16. cant dierences between the two samples
LCC does not follow the 24-hour pattern
Clustering coecient of 10% similar 100%
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 15 / 18
17. Experiments
More on the paper...
Retweet Burstiness
The rate at which users retweet information plays an important role
in capturing trending topics
We investigate wether there is a dierence between the rates of
receiving retweets in both samples
Linguistic Analysis
Is there a correlation between the spoken languages in Twitter, and
the ground truth obtained from studies in the physical world?
What are the dierences between the two samples in this context?
We use language detection tools and ground truth information from
Wikipedia.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 16 / 18
18. Summary and Conclusions
Conclusions
Research question: Is the default sample sucient? For which tasks?
Focused on spatio-temporal tasks
We compared 1% with 10% sample
The samples have quite similar properties
However when you get into the details (less popular re-tweets) the
bigger sample is better
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 17 / 18
19. Summary and Conclusions
The End...
Thank You!
Contact: @iokat // ioannis.katakis@gmail.com // www.katakis.eu
Acknowledgement
This work has been co-
20. nanced by EU and Greek National funds through the Operational Program Education and Lifelong
Learning of the National Strategic Reference Framework (NSRF) - Research Funding Programs: Heraclitus II fellowship,
THALIS - GeomComp, THALIS - DISFER, ARISTEIA - MMD and the EU funded project INSIGHT.
Valkanas et al (UoA and GMU) Mining Twitter with Resource Constraints August 12, 2015 18 / 18