This document describes a study that aimed to characterize HIV-vulnerable populations on Twitter by analyzing user sentiments and extracting risk-related information. Researchers collected Twitter data using different APIs and classified tweets based on predefined HIV risk words. They modeled the data as a property graph in Neo4j with nodes for users, tweets, hashtags etc. and edges to represent relationships. Queries were run to find conversations between users mentioning drug and sex-related terms, most mentioned users, topics discussed by followers of high-risk users, and proximity of drug and homosexual users in the social graph. The study demonstrated how social network analysis and graph databases can help identify at-risk groups for public health interventions.
Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGijaia
In this paper we present and compare two methodologies for rapidly inducing multiple subject-specific
taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency
method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based
algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide
Web, DMOZ1
to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input
to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical
semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics
project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook,
Instagram, Flickr) with an objective of enhancing an individual’s exploration of their personal information
through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms
based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that
the induced taxonomies are of high quality.
With its focus on investigating the basis for the sustained existence
of living systems, modern biology has always been a fertile, if not
challenging, domain for formal knowledge representation and automated
reasoning. With thousands of databases and hundreds of ontologies now
available, there is a salient opportunity to integrate these for
discovery. In this talk, I will discuss our efforts to build a rich
foundational network of ontology-annotated linked data, develop
methods to intelligently retrieve content of interest, uncover
significant biological associations, and pursue new avenues for drug
discovery. As the portfolio of Semantic Web technologies continue to
mature in terms of functionality, scalability, and an understanding of
how to maximize their value, researchers will be strategically poised
to pursue increasingly sophisticated KR projects aimed at improving
our overall understanding of human health and disease.
bio: Dr. Michel Dumontier is an Associate Professor of Medicine
(Biomedical Informatics) at Stanford University. His research aims to
find new treatments for rare and complex diseases. His research
interest lie in the publication, integration, and discovery of
scientific knowledge. Dr. Dumontier serves as a co-chair for the World
Wide Web Consortium Semantic Web in Health Care and Life Sciences
Interest Group (W3C HCLSIG) and is the Scientific Director for
Bio2RDF, a widely used open-source project to create and provide
linked data for life sciences.
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)Michel Dumontier
In the quest to translate the results biomedical research into effective clinical applications, many are now trying to make sense of the large and rapidly growing amount of public biomedical data. However, substantial challenges exist in traversing the currently fragmented data landscape. In this talk, I will discuss our efforts to use Semantic Web technologies to facilitate biomedical research through the formulation, publication, integration, and exploration of facts, expert knowledge, and web services.
Bio2RDF is an open-source project that offers a large and
connected knowledge graph of Life Science Linked Data. Each dataset is expressed using its own vocabulary, thereby hindering integration, search, query, and browse data across similar or identical types of data. With growth and content changes in source data, a manual approach to maintain mappings has proven untenable. The aim of this work is to develop a (semi)automated procedure to generate high quality mappings
between Bio2RDF and SIO using BioPortal ontologies. Our preliminary results demonstrate that our approach is promising in that it can find new mappings using a transitive closure between ontology mappings. Further development of the methodology coupled with improvements in
the ontology will offer a better-integrated view of the Life Science Linked Data
IBC FAIR Data Prototype Implementation slideshowMark Wilkinson
Discussion about ways of achieving FAIRness of both metadata and data. Brute force approaches, and more elegant "projection" approaches are shown.
Relevant papers are at:
doi: 10.7717/peerj-cs.110 (https://peerj.com/articles/cs-110/)
doi: 10.3389/fpls.2016.00641 (https://doi.org/10.3389/fpls.2016.00641)
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGijaia
In this paper we present and compare two methodologies for rapidly inducing multiple subject-specific
taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency
method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based
algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide
Web, DMOZ1
to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input
to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical
semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics
project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook,
Instagram, Flickr) with an objective of enhancing an individual’s exploration of their personal information
through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms
based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that
the induced taxonomies are of high quality.
With its focus on investigating the basis for the sustained existence
of living systems, modern biology has always been a fertile, if not
challenging, domain for formal knowledge representation and automated
reasoning. With thousands of databases and hundreds of ontologies now
available, there is a salient opportunity to integrate these for
discovery. In this talk, I will discuss our efforts to build a rich
foundational network of ontology-annotated linked data, develop
methods to intelligently retrieve content of interest, uncover
significant biological associations, and pursue new avenues for drug
discovery. As the portfolio of Semantic Web technologies continue to
mature in terms of functionality, scalability, and an understanding of
how to maximize their value, researchers will be strategically poised
to pursue increasingly sophisticated KR projects aimed at improving
our overall understanding of human health and disease.
bio: Dr. Michel Dumontier is an Associate Professor of Medicine
(Biomedical Informatics) at Stanford University. His research aims to
find new treatments for rare and complex diseases. His research
interest lie in the publication, integration, and discovery of
scientific knowledge. Dr. Dumontier serves as a co-chair for the World
Wide Web Consortium Semantic Web in Health Care and Life Sciences
Interest Group (W3C HCLSIG) and is the Scientific Director for
Bio2RDF, a widely used open-source project to create and provide
linked data for life sciences.
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)Michel Dumontier
In the quest to translate the results biomedical research into effective clinical applications, many are now trying to make sense of the large and rapidly growing amount of public biomedical data. However, substantial challenges exist in traversing the currently fragmented data landscape. In this talk, I will discuss our efforts to use Semantic Web technologies to facilitate biomedical research through the formulation, publication, integration, and exploration of facts, expert knowledge, and web services.
Bio2RDF is an open-source project that offers a large and
connected knowledge graph of Life Science Linked Data. Each dataset is expressed using its own vocabulary, thereby hindering integration, search, query, and browse data across similar or identical types of data. With growth and content changes in source data, a manual approach to maintain mappings has proven untenable. The aim of this work is to develop a (semi)automated procedure to generate high quality mappings
between Bio2RDF and SIO using BioPortal ontologies. Our preliminary results demonstrate that our approach is promising in that it can find new mappings using a transitive closure between ontology mappings. Further development of the methodology coupled with improvements in
the ontology will offer a better-integrated view of the Life Science Linked Data
IBC FAIR Data Prototype Implementation slideshowMark Wilkinson
Discussion about ways of achieving FAIRness of both metadata and data. Brute force approaches, and more elegant "projection" approaches are shown.
Relevant papers are at:
doi: 10.7717/peerj-cs.110 (https://peerj.com/articles/cs-110/)
doi: 10.3389/fpls.2016.00641 (https://doi.org/10.3389/fpls.2016.00641)
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
smartAPIs: EUDAT Semantic Working Group Presentation @ RDA 9th PlenaryMark Wilkinson
smartAPIs are an approach to the incremental, machine-aided, semantic annotation of Web APIs. Starting from existing, popular standards, we will provide enhanced tools for authoring ever-richer metadata, guided by global community knowledge encapsulated in ontologies, and aided by "smart suggestions" based on mining the metadata from previous API specifications.
The project is led by Michel Dumontier (Maastricht University). This presentation was given on his behalf by Mark Wilkinson (UPM, Madrid; Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R)
Building a Network of Interoperable and Independently Produced Linked and Ope...Michel Dumontier
Over 15 years ago, Sir Tim Berners Lee proclaimed the founding of an exciting new future involving intelligent agents operating over smarter data in order to perform complex tasks at the behest of their human controllers. At the heart of this vision lies an uneasy alliance between tedious formal knowledge representations and powerful analytics over big, but often messy data. Bio2RDF, our decade old open source project to create Linked Data for the life sciences, has weaved emergent Semantic Web technologies such as ontologies and Linked Data to generate FAIR - Findable, Accessible, Interoperable, and Reusable - data in the form of billions of machine accessible statements for use in downstream biomedical discovery.
This revolution in data publication has been strengthened by action from global bioinformatics institutions such as the NCBI, NCBO, EBI, and DBCLS. Notably, NCBI's PubChem has successfully coupled large scale data integration with community-based standards to offer a remakable biochemical knowledge resource amenable to data hungry discovery tools. Yet, in the face of increasing pressure from researchers, funders, and publishers, will these approaches be sufficient for growing and maintaining a comprehensive knowledge graph that is inclusive of all biomedical research?
A presentation to the New Year's Event for Maastricht University's Knowledge Engineering @ Work Program. https://www.maastrichtuniversity.nl/news/kework-first-10-students-academic-workstudy-track-graduate
Scholarly Communication for Bioinformatics StudentsPhilip Bourne
Presentation made to the incoming bioinformatics and systems biology students at UCSD on how they could get involved in changing scholarly communication. Given February 28, 2011
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1
It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.
This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Mark Wilkinson
The primary slide deck for the SADI tutorial. We explain the motivation, simple SADI services, more complex SADI services, and then do a detailed walk-through of building a service, including the Perl service code and examples of service invocation at the command line, and using the SHARE client. You will want to look at the sample data/queries in this slide deck: http://www.slideshare.net/markmoby/sample-data-and-other-ur-ls-55737183 and the example service code in this slide deck: http://www.slideshare.net/markmoby/example-code-for-the-sadi-bmi-calculator-web-service?related=1
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...Werner Leyh
Abstract. The aim of this work is to explore the opportunities offered by
semantic standardization to interlink primary “spatial data” (GI) from “Open-
StreetMap” (OSM) with repositories of the “Linked Open Data Cloud” (LOD).
Research in natural sciences can generate vast amounts of spatial data, where
Wikidata could be considered as the central hub between more detailed natural
science hubs on the spatial semantic web. Wikidata is a world readable and
writable community-driven knowledge base. It offers the opportunity to collaboratively
construct an open access knowledge graph that spans biology,
medicine, and all other domains of knowledge. In this study, we discuss
the opportunities and challenges provided by exploring Wikidata as a central
integration facility by interlink it with OSM, a popular, community driven
collection of free geographic data. This is empowered by the reuse of terms
and properties from commonly understood controlled vocabularies that
represent their respective well-identified knowledge domains.
URL: https://www.springerprofessional.de/en/interlinking-standardized-openstreetmap-data-and-citizen-science/13302088
DOI: https://doi.org/10.1007/978-3-319-60366-7_9
Werner Leyh, Homero Fonseca Filho
University of São Paulo (USP), São Paulo, Brazil
WernerLeyh@yahoo.com
smartAPIs: EUDAT Semantic Working Group Presentation @ RDA 9th PlenaryMark Wilkinson
smartAPIs are an approach to the incremental, machine-aided, semantic annotation of Web APIs. Starting from existing, popular standards, we will provide enhanced tools for authoring ever-richer metadata, guided by global community knowledge encapsulated in ontologies, and aided by "smart suggestions" based on mining the metadata from previous API specifications.
The project is led by Michel Dumontier (Maastricht University). This presentation was given on his behalf by Mark Wilkinson (UPM, Madrid; Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R)
Building a Network of Interoperable and Independently Produced Linked and Ope...Michel Dumontier
Over 15 years ago, Sir Tim Berners Lee proclaimed the founding of an exciting new future involving intelligent agents operating over smarter data in order to perform complex tasks at the behest of their human controllers. At the heart of this vision lies an uneasy alliance between tedious formal knowledge representations and powerful analytics over big, but often messy data. Bio2RDF, our decade old open source project to create Linked Data for the life sciences, has weaved emergent Semantic Web technologies such as ontologies and Linked Data to generate FAIR - Findable, Accessible, Interoperable, and Reusable - data in the form of billions of machine accessible statements for use in downstream biomedical discovery.
This revolution in data publication has been strengthened by action from global bioinformatics institutions such as the NCBI, NCBO, EBI, and DBCLS. Notably, NCBI's PubChem has successfully coupled large scale data integration with community-based standards to offer a remakable biochemical knowledge resource amenable to data hungry discovery tools. Yet, in the face of increasing pressure from researchers, funders, and publishers, will these approaches be sufficient for growing and maintaining a comprehensive knowledge graph that is inclusive of all biomedical research?
A presentation to the New Year's Event for Maastricht University's Knowledge Engineering @ Work Program. https://www.maastrichtuniversity.nl/news/kework-first-10-students-academic-workstudy-track-graduate
Scholarly Communication for Bioinformatics StudentsPhilip Bourne
Presentation made to the incoming bioinformatics and systems biology students at UCSD on how they could get involved in changing scholarly communication. Given February 28, 2011
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1
It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.
This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Mark Wilkinson
The primary slide deck for the SADI tutorial. We explain the motivation, simple SADI services, more complex SADI services, and then do a detailed walk-through of building a service, including the Perl service code and examples of service invocation at the command line, and using the SHARE client. You will want to look at the sample data/queries in this slide deck: http://www.slideshare.net/markmoby/sample-data-and-other-ur-ls-55737183 and the example service code in this slide deck: http://www.slideshare.net/markmoby/example-code-for-the-sadi-bmi-calculator-web-service?related=1
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...Werner Leyh
Abstract. The aim of this work is to explore the opportunities offered by
semantic standardization to interlink primary “spatial data” (GI) from “Open-
StreetMap” (OSM) with repositories of the “Linked Open Data Cloud” (LOD).
Research in natural sciences can generate vast amounts of spatial data, where
Wikidata could be considered as the central hub between more detailed natural
science hubs on the spatial semantic web. Wikidata is a world readable and
writable community-driven knowledge base. It offers the opportunity to collaboratively
construct an open access knowledge graph that spans biology,
medicine, and all other domains of knowledge. In this study, we discuss
the opportunities and challenges provided by exploring Wikidata as a central
integration facility by interlink it with OSM, a popular, community driven
collection of free geographic data. This is empowered by the reuse of terms
and properties from commonly understood controlled vocabularies that
represent their respective well-identified knowledge domains.
URL: https://www.springerprofessional.de/en/interlinking-standardized-openstreetmap-data-and-citizen-science/13302088
DOI: https://doi.org/10.1007/978-3-319-60366-7_9
Werner Leyh, Homero Fonseca Filho
University of São Paulo (USP), São Paulo, Brazil
WernerLeyh@yahoo.com
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF DatasetsKemele M. Endris
Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for the citizens. However, effective data-centric applications demand data management techniques able to process a large volume of data which may include sensitive data, e.g., financial transactions, medical procedures, or personal data. Managing sensitive data requires the enforcement of privacy and access control regulations, particularly, during the execution of queries against datasets that include sensitive and non-sensitive data. In this paper, we tackle the problem of enforcing privacy regulations during query processing, and propose BOUNCER, a privacy-aware query engine over federations of RDF datasets. BOUNCER allows for the description of RDF datasets in terms of RDF molecule templates, i.e., abstract descriptions of the properties of the entities in an RDF dataset and their privacy regulations. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over RDF datasets that not only contain the relevant entities to answer a query, but that are also regulated by policies that allow for accessing these relevant entities. We empirically evaluate the effectiveness of the BOUNCER privacy-aware techniques over state-of-the-art benchmarks of RDF datasets. The observed results suggest that BOUNCER can effectively enforce access control regulations at different granularity without impacting the performance of query processing.
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
talk for paper published at ICWE2019:
Primo F, Missier P, Romanovsky A, Mickael F, Cacho N. A customisable pipeline for continuously harvesting socially-minded Twitter users. In: Procs. ICWE’19. Daedjeon, Korea; 2019.
In June 2013, the Alfred P. Sloan Foundation awarded NISO a grant to undertake a two-phase initiative to explore, identify, and advance standards and/or best practices related to a new suite of potential metrics in the community.The NISO Altmetrics Project has successfully moved to Phase Two, the formation of three working groups, A, B, & C. Working Group B, led by Kristi Holmes, PhD, Director, Galter Health Sciences Library at Northwestern University, and Mike Taylor, Senior Product Manager, Informetrics at Elsevier, is focused on the Output Types & Identifiers within the alternative metrics landscape.
Microblogging today has gotten an acclaimed specific instrument among Internet clients. Endless clients share assessments on various bits of life dependably. Accordingly, microblogging districts are rich wellsprings of information for assessment mining and tendency assessment. Since microblogging has shown up by and large lately, there several investigation works that were given to this point. In our paper, we base on using Twitter, the most notable microblogging stage, for the task of feeling examination. We advise the most ideal approach to thus accumulate a corpus for assessment and evaluation mining purposes. We play out a semantic assessment of the amassed corpus and clarify found wonders. Utilizing the corpus, we build up an end classifier, that can pick positive, negative, and honest evaluations for an annual. Test assessments show that our proposed strategies are convincing and act in a way that is better than actually proposed procedures. In our appraisal, we worked with English, in any case, the proposed procedure can be utilized with some other language. Krunal Dhardev | Dr. Kamalraj R "Twitter Sentiment Analysis" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42385.pdf Paper URL: https://www.ijtsrd.comcomputer-science/other/42385/twitter-sentiment-analysis/krunal-dhardev
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
This paper involves two approaches for finding the trending topics in social networks that is key-based approach and link-based approach. In conventional key-based approach for topics detection have mainly focus on frequencies of (textual) words. We propose a link-based approach which focuses on posts reflected in the mentioning behavior of hundreds users. The anomaly detection in the twitter data set is carried out by retrieving the trend topics from the twitter in a sequential manner by using some API and corresponding user for training, then computed anomaly score is aggregated from different users. Further the aggregated anomaly score will be feed into change-point analysis or burst detection at the pinpoint, in order to detect the emerging topics. We have used the real time twitter account, so results are vary according to the tweet trends made. The experiment shows that proposed link-based approach performs even better than the keyword-based approach.
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
Crowdsourcing platforms are revolutionizing research by providing a way to collect clinical and behavioral data with unprecedented speed and efficiency. This seminar explores another digital platform called TurkPrime that is designed to suuport research participant recruitment. TurkPrime is a relatively new panel service that allows researchers to target specific demographic groups. If you watched our previous webinar on Amazon’s Mechanical Turk, also known as MTurk, you may find it interesting that TurkPrime offers a proportional matching sampling approach rather than MTurk’s opt-in, convenience sampling approach. Tasks that can be implemented with TurkPrime include: excluding participants on the basis of previous participation, longitudinal studies, making changes to a study while it is running, automating the approval process, increasing the speed of data collection, sending bulk e-mails and bonuses, enhancing communication with participants, monitoring dropout and engagement rates, providing enhanced sampling options, and many others.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
LIBER Webinar: 23 Things About Research Data ManagementLIBER Europe
These are the slides for the LIBER Webinar "23 Things About Research Data Management", held on 23 February 2017. A recording of the webinar is available here: https://www.youtube.com/watch?v=HGH6fVHrnKQ
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen. Our processes enable simple creation of dataset records and linking to source data, providing a seamless federated knowledge graph for novice and advanced users alike.
Presented May 7th, 2019 at the Knowledge Graph Conference, Columbia University.
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen.
This talk was presented at The Molecular Medicine Tri-Conference/Bio-IT West on March 11, 2019.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. Current Relevance of HIV
● 33.4 million cases.
● Second growth phase of
HIV already been
reported in some of the
countries.
● Need to intensify HIV
prevention efforts - this
is difficult.
3. How can technology help?
● Philosophical question : Can social networks help in identifying users with
high risk of HIV infection?
● Goal of project: Characterize HIV vulnerable populations by extracting
user sentiments from social networks like Twitter.
4. History & Related work
● Epidemiology - Hippocrates, 400 B.C. -> Digital Epidemiology - Marcel
Salathe et. al. 2012.
● Unraveling Abstinence and Relapse: Smoking Cessation Reflected in
Social Media - Dr. Elizabeth Murnane, CHI 2014.
● Methods of using real-time social media technologies for detection and
remote monitoring of HIV outcomes - Sean D. Young et. al., Elsevier
Preventive Medicine, 2014.
5. Data source
● 210 notable social networks - 43 things to Zooppa.
● Twitter was chosen because of the results published in earlier studies.
● Programmatic access to tweets using Streaming API.
○ Sample Hose (~4200 tweets/min)
○ Filter Hose (~40 tweets/min)
○ Fire Hose (~420000 tweets/min)
6. Data collection
● Streaming API
● MongoDB
○ Tweets
○ HIV Corpus
○ HIV Corpus cleaned
○ Related tweets/users
● Neo4j
7. Data classification & cleaning
● Classification
○ Filter tweets based on a pre-defined set of HIV risk words.
○ Five Risk Buckets : Drug, SexVenues, STI, Sex, Homosexual.
● Cleaning
○ Keep or discard tweets based on co-occurring words.
○ Manually scavenged through classified tweets to create lists with Dr.
Nella Green’s help.
○ Exception and Inclusion lists for every HIV risk word.
8. Why Graph DB?
● Twitter’s deeply associative data can be easily modeled.
● Most use cases correspond to analyzing sub-structures and
connectedness. Queries on a graph are much faster than join bombs in
relational data models.
● We use Neo4j - mature and scalable native graph store with good
support.
24. Conversations among users..
“How many conversations are happening among the drug bucket users alone ,
sex bucket users alone and across drug bucket users and sex bucket users?”
MATCH p=(
(n:ONTOLOGY_BUCKET{id: 'DrugBucket'})-[r]-(m:ONTOLOGY_INSTANCE)
-[r1]-
(t:TWEET)<-[r2:IS_REPLY_FOR*2..]-(t1:TWEET))
where not
(t)-[:`IS_REPLY_FOR`]->(:`TWEET`)
RETURN count(DISTINCT t)
Queries
Output:
8 (1692 ms)
25. Conversations among users..
“How many conversations are happening among the drug bucket users alone ,
sex bucket users alone and across drug bucket users and sex bucket users?”
MATCH p=((n:ONTOLOGY_BUCKET)-[r]-(m:ONTOLOGY_INSTANCE)
-[r1]-
(t:TWEET)<-[r2:IS_REPLY_FOR*2..]-(t1:TWEET))
where n.id in ["HomosexualTermsBucket","STIBucket","SexBucket","
SexVenues"]
and not (t)-[:`IS_REPLY_FOR`]->(:`TWEET`)
RETURN count(DISTINCT t);
Queries
Output:
20 (2350 ms)
26. Conversations among users..
“How many conversations are happening among the drug bucket users alone ,
sex bucket users alone and across drug bucket users and sex bucket users?”
MATCH p1=((n:ONTOLOGY_BUCKET)-[r]-(m:ONTOLOGY_INSTANCE)
-[r1]-
(t:TWEET)<-[r2:IS_REPLY_FOR*2..]-(t1:TWEET)
-[r3]-
(o:ONTOLOGY_INSTANCE)-[r4]-(p:ONTOLOGY_BUCKET {id: 'DrugBucket'}))
where n.id in ["HomosexualTermsBucket","STIBucket","SexBucket","
SexVenues"]
and not (t)-[:`IS_REPLY_FOR`]->(:`TWEET`)
RETURN count(DISTINCT t);
Queries
Output:
2 (207952 ms)
27. Conversations among users..
“How many conversations are happening among the drug bucket users alone ,
sex bucket users alone and across drug bucket users and sex bucket users?”
MATCH p1=((n:ONTOLOGY_BUCKET {id: 'DrugBucket'})-[r]-(m:ONTOLOGY_INSTANCE)
-[r1]-
(t:TWEET)<-[r2:IS_REPLY_FOR*2..]-(t1:TWEET)
-[r3]-
(o:ONTOLOGY_INSTANCE)-[r4]-(p:ONTOLOGY_BUCKET))
where p.id in ["HomosexualTermsBucket","STIBucket","SexBucket","
SexVenues"]
and not (t)-[:`IS_REPLY_FOR`]->(:`TWEET`)
RETURN count(DISTINCT t);
Queries
Output:
1 (234202 ms)
28. Finding most referred users..
“List users in the descending order of referral counts”
MATCH p=((u:USER)-[r:MENTIONED_IN]->() )
RETURN u.name,count(p) AS num_mentions
ORDER BY num_mentions DESC limit 5;
Queries
Output:
+--------------------------------------+
| u.name | num_mentions |
+--------------------------------------+
| "cc7764343d" | 261 |
| "972b1707f7" | 256 |
| "9be7e77265" | 235 |
| "8dc5aaf21a" | 232 |
| "e1095646aa" | 220 |
+--------------------------------------+
(172 ms)
29. Finding most referred users..
“List users in the descending order of referral counts”
MATCH p=((u:USER)-[r:MENTIONED_IN]->(t) )
where not (t)-[:`IS_REPLY_FOR`]->(:`TWEET`)
RETURN u.name,count(p) AS num_mentions
ORDER BY num_mentions DESC limit 5;
Queries
Output:
+----------------------------------+
| u.name | num_mentions |
+----------------------------------+
| "00f4edeac2" | 28 |
| "8987f033aa" | 16 |
| "e6e67c5cef" | 10 |
| "fdf2ce82fd" | 6 |
| "86609dbd6e" | 5 |
+----------------------------------+
(198 ms)
Forbidden substructure
30. Topics of interest around a hub..
“What are the main topics in the discussions among people who are at a one-
hop following distance from their sub-graph’s hubs.”
MATCH (n:USER)<-[r:FOLLOWS*1..]-(m)
OPTIONAL MATCH (m)-[r1:TWEETED]->( t:TWEET)-[o]->(p:ONTOLOGY_INSTANCE)-[q]-
>(s:ONTOLOGY_BUCKET {id:” DrugBucket”})
WITH COUNT(t) as count, n as hub
WHERE count >= 2
MATCH (o:ONTOLOGY_BUCKET)<-[r2*2..2]-(t1:TWEET)
<-[TWEETED]-(neighbour:USER)-[r3:FOLLOWS]-hub
return o.id, hub.name, count(t1)
ORDER BY count(t1) DESC limit 5
Queries
Output:
+-------------------------------------+
| o.id | hub.name | count(t1) |
+-------------------------------------+
| "SexBucket" | "b4f30295f9" | 1 |
| "DrugBucket" | "b4f30295f9" | 1 |
+-------------------------------------+
(589 ms)
31. Two most consulted drug users..
“The real world data tells us that lots of homosexual (MSM) people consume
drugs or psycho-stimulants. Identify two drug bucket users who are most
consulted by homosexual people on Twitter”
MATCH (o:ONTOLOGY_BUCKET {id:"DrugBucket"})
<-[ri1:INSTANCE_OF]-(oi1:ONTOLOGY_INSTANCE)
<-[rhr1:HAS_RISK_WORD]-(t1:TWEET)
<-[rt1:TWEETED]-(drug:USER)-[MENTIONED_IN]->
(t:TWEET)<-[rt2:TWEETED]-(homosex:USER)
-[rt3:TWEETED]->(t2:TWEET)-[rhr2:HAS_RISK_WORD]
->(oi2:ONTOLOGY_INSTANCE)-[ri2:INSTANCE_OF]
->(o1:ONTOLOGY_BUCKET {id:"HomosexualTermsBucket"})
RETURN drug.name, count(DISTINCT t)
ORDER BY count(DISTINCT t) DESC
LIMIT 2
Queries
Output:
+------------------------------------------+
| drug.name | count(DISTINCT t) |
+------------------------------------------+
| "748d9dc913" | 26 |
| "5a74f759b8" | 13 |
+------------------------------------------+
(13825 ms)
32. Proximity of drug bucket users..
“How close are drug bucket users to other homosexual bucket users in terms
of proximity in the social graph?”
MATCH p =
(o1:ONTOLOGY_BUCKET {id:” HomosexualTermsBucket ”})<-[ri1:INSTANCE_OF]-(oi1:
ONTOLOGY_INSTANCE)<-[rrw1:HAS_RISK_WORD]-(t1:TWEET)
<-[rt1:TWEETED]- (u1:USER)-[r:FOLLOWS*1..3]->(u2:USER) -[rt2:TWEETED]->(t2:
TWEET)-[rrw2:HAS_RISK_WORD]->(oi2:ONTOLOGY_INSTANCE)-[ri2:INSTANCE_OF]->
(o2:ONTOLOGY_BUCKET {id:” DrugBucket”})
return u1.name,length(p), count(u2)
ORDER BY length(p)
Queries
Output:
+-------------------------------------+
| u1.name | length(p) | count(u2) |
+-------------------------------------+
| "1b0056b07a"| 7 | 4 |
| "0c384be19a"| 7 | 2 |
+-------------------------------------+
(260 ms)
34. Shortest paths vs. diameter between users
● Finding user-connected components
○ Perform BFS traversal and add a property ‘subgraph’ for each node
○ Forbidden substructure - Users can be connected via ontology
buckets or ontology instances
● Neo4j Java Traversal Framework API Code Snippet
Traverser traverser = db.traversalDescription()
.breadthFirst()
.relationships(RelTypes.TWEETED)
.relationships(RelTypes.FOLLOWS)
.relationships(RelTypes.IS_REPLY_FOR)
.relationships(RelTypes.MENTIONED_IN)
.evaluator(Evaluators.excludeStartPosition())
.uniqueness(Uniqueness.NODE_GLOBAL).traverse(n);
Eliminate forbidden
substructure
Queries
35. Shortest paths vs. diameter between users
Find average shortest path between any 2 users in a connected component
and compare it to diameter of the connected component
match (n:USER) WITH n.subgraph as subGraphNum, count(n) as c
WHERE c >= 7
WITH collect(subGraphNum) as collectionSG
MATCH p=shortestPath((s:USER)-[:FOLLOWS|MENTIONED_IN|TWEETED|IS_REPLY_FOR*..]-(d:USER))
WHERE s.subgraph=d.subgraph and s.subgraph in collectionSG and length(p)>1
RETURN s.subgraph, sum(length(p))/count(p), max(length(p)), ((sum(length(p))/count(p))*1.0)
/max(length(p))
ORDER BY ((sum(length(p))/count(p))*1.0)/max(length(p)) DESC
Queries
42. Commonly discussed topics around Sex
Venues
● Some tweets are geotagged
○ Neo4j Spatial plugin to create spatial index on tweets
● Find tweets tweeted near a specific Sex Venue
○ Perform a withinDistance query for the coordinates of the sex venue
● Are these tweets talking about specific topics?
○ Topic Modeling -LDA (Gensim) on tweets
Queries
43. Commonly discussed topics around Sex
Venues
Find what topic HIV risk users are talking about the most, around a particular
Sex Venue.
REST API Code Snippet
headers = {'content-type': 'application/json'}
url = "http://localhost:7474/db/data/ext/SpatialPlugin/graphdb/findGeometriesWithinDistance"
payload = {
"layer" : "geom",
"pointX" : -117.161324,
"pointY" : 32.710671,
"distanceInKm" : 2
}
r = requests.post(url, data=json.dumps(payload), headers=headers)
44. ● Cleaning tweets - remove mentions, urls
● Stop word list - Stanford NLTK library
● Gensim - Corpora & lda libraries
● Free parameters
○ Number of topics - 2,3,4
○ Distance Radius for ‘withinDistance’ query - 2,5,10 kms
LDA on Tweets found around Sex Venues
Queries
45. LDA on Colocated Tweets - Results
Topic #1 gay, san, diego, queen, flicks, glass, amp, dont, coke, get
Topic #2 gay, san, diego, ca, glass, amp, cheers, flicks, bourbon, happy
Drug & Homosexual
bucket
coke
glass
dope
gay
queen
…
...
Sex Venues Bucket
groovy laid back amp nasty
cheers
flicks
bourbon street
club san diego
pecs
…
...
46. Interesting patterns..
“What is the longest conversation thread among any set of users?”
MATCH p = (n:TWEET)<-[r:IS_REPLY_FOR*]-(m:TWEET) RETURN p order by length
(p) desc limit 1
205 nodes
(157179 ms)
Queries
48. Challenges
● Data collection
○ Sampled data (1%)
○ Twitter APIs call rate limit per user - 15 calls/15 mins.
○ Collecting users who have favorited a tweet.
○ Extracting conversations/retweet chains associated with a tweet.
● Data classification and cleaning
○ Working with microblogs.
○ Iterative process.
● Restricted visualization for Neo4j
○ Hard to decipher patterns in graph.
49. Future
● More representative dataset - Firehose API
● Innovative Data Visualizations to visualize evolving graphs
● Machine Learning for better HIV risk tweets classification.
○ Mechanical Turk for labeling
○ Logistic Regression for classification
● SD Primary Infection Cohort - Overlaying real-world HIV infection graph
on top of an enriched social network
50. Conclusion
● Structured approach to model social networks and derive insights from
networks like Twitter. Best practices in collecting and managing Twitter
data for social networks analysis.
● Current results - Graph queries to derive intuitions on factors that
influence HIV risk behaviour.
● Vision for the future.