Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
The document discusses different types of analysis for automated content analysis, including sentiment analysis and stopword removal. It covers bag-of-words approaches to sentiment analysis, which involve comparing words in a text to lists of positive and negative words. More advanced approaches are mentioned that take the structure of text into account, such as identifying sentence structure using linguistic concepts.
This document provides a summary of a meeting on machine learning. It recaps unsupervised and supervised machine learning techniques. Unsupervised techniques discussed include principal component analysis (PCA) and latent Dirichlet allocation (LDA). PCA is used to find how words co-occur in documents. LDA can be implemented in Python using gensim to infer topics in a collection of documents. Supervised machine learning techniques the audience has previously used are regression models. The document concludes by noting models will only use a portion of available data for training and validation.
The document discusses techniques for scaling up automated content analysis projects. It begins by looking back at the workflow and techniques covered in previous sessions, such as developing components separately, writing functions, and making the code robust. It then looks forward by discussing additional techniques that were not covered, such as using Selenium for dynamic web scraping, databases for storing large datasets, word embeddings, and more advanced natural language processing and machine learning models. The document also introduces the INCA project, which aims to scale up content analysis by collecting data in a way that allows for reuse across multiple projects, using a database backend and reusable preprocessing and analysis code. The goal is to make automated content analysis usable with minimal Python knowledge.
This document provides an overview of trends in automated content analysis (ACA) with a focus on supervised and unsupervised machine learning techniques. It discusses how supervised machine learning can be used to code frames and topics by training a classifier on a small set of manually coded data. Unsupervised machine learning techniques like latent Dirichlet allocation (LDA) are also covered, with LDA described as a technique that can infer topics in a document based on word frequencies without manual labeling. The document provides examples of implementing supervised naive Bayes classification and LDA topic modeling in Python using scikit-learn and Gensim libraries.
Christoph Trattner gave a presentation on applying cognitive models to recommender systems. He discussed how ACT-R, an influential cognitive architecture, has been used to model temporal tagging behavior with power functions performing better than exponential functions. Experiments applying this approach achieved state-of-the-art results in predicting tag reuse and recommending tags. A two-step collaborative filtering approach that integrated tags and time (CIRTT) also outperformed baselines in recommending items. The presentation concluded with details on an open-source code framework for this research.
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
The document discusses different types of analysis for automated content analysis, including sentiment analysis and stopword removal. It covers bag-of-words approaches to sentiment analysis, which involve comparing words in a text to lists of positive and negative words. More advanced approaches are mentioned that take the structure of text into account, such as identifying sentence structure using linguistic concepts.
This document provides a summary of a meeting on machine learning. It recaps unsupervised and supervised machine learning techniques. Unsupervised techniques discussed include principal component analysis (PCA) and latent Dirichlet allocation (LDA). PCA is used to find how words co-occur in documents. LDA can be implemented in Python using gensim to infer topics in a collection of documents. Supervised machine learning techniques the audience has previously used are regression models. The document concludes by noting models will only use a portion of available data for training and validation.
The document discusses techniques for scaling up automated content analysis projects. It begins by looking back at the workflow and techniques covered in previous sessions, such as developing components separately, writing functions, and making the code robust. It then looks forward by discussing additional techniques that were not covered, such as using Selenium for dynamic web scraping, databases for storing large datasets, word embeddings, and more advanced natural language processing and machine learning models. The document also introduces the INCA project, which aims to scale up content analysis by collecting data in a way that allows for reuse across multiple projects, using a database backend and reusable preprocessing and analysis code. The goal is to make automated content analysis usable with minimal Python knowledge.
This document provides an overview of trends in automated content analysis (ACA) with a focus on supervised and unsupervised machine learning techniques. It discusses how supervised machine learning can be used to code frames and topics by training a classifier on a small set of manually coded data. Unsupervised machine learning techniques like latent Dirichlet allocation (LDA) are also covered, with LDA described as a technique that can infer topics in a document based on word frequencies without manual labeling. The document provides examples of implementing supervised naive Bayes classification and LDA topic modeling in Python using scikit-learn and Gensim libraries.
Christoph Trattner gave a presentation on applying cognitive models to recommender systems. He discussed how ACT-R, an influential cognitive architecture, has been used to model temporal tagging behavior with power functions performing better than exponential functions. Experiments applying this approach achieved state-of-the-art results in predicting tag reuse and recommending tags. A two-step collaborative filtering approach that integrated tags and time (CIRTT) also outperformed baselines in recommending items. The presentation concluded with details on an open-source code framework for this research.
From Search to Predictions in Tagged Information SpacesChristoph Trattner
Tagging gained tremendously in popularity over past few years. When looking into the literature of tagging we find a lot of work regarding people's tagging motivation, their behavior, models that describe the folksonomy generation process, emergent semantic structures, etc., but interestingly we find quite little research showing the value of tags for searching an overloaded information space. Furthermore, there is lot of literature on the tag or item prediction problem, but interestingly almost all of them lookat the issue from a data-driven perspective. To bridge this gap in the literature, we have conducted several in-depth studies in the past showing the value of tags for lookup and exploratory search. We looked at the problem from a network theoretic and interface perspective and we will show how useful tags are for searching. Furthermore, we reviewed literature on memory processes from cognitive science and have invented a number of novel recommender algorithms based on the ACT-R and MINERVA2 theory. We will show that these approaches can not only predict tags and items extremely well, but also reveal how these models can help in explaining the recommendation processes better than current approaches.
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Rich Heimann
Big Social Data Analysis: Using location & Twitter to explore the tragic aftermath of the Sandy Hook Elementary School shooting.
The growth of social media over the last decade has revolutionized the way individuals interact and industries conduct business. Individuals produce data at an unprecedented rate by interacting, sharing, and consuming content through social media. However, analyzing this ever-growing pile of data is quite tricky and, if done erroneously, could lead to wrong inferences.
In this webinar you will gain, by example insights to mining social media data and exposing underlying latent structures relating to ideology and sentiment as well as space and time.
Recommending Tags with a Model of Human CategorizationChristoph Trattner
Social tagging involves complex processes of human categorization that have been the topic of much research in the cognitive sciences. In this paper we present a recommender approach for social tags whose principles are derived from some of the more prominent and empirically well-founded models from this research tradition. The basic architecture is a simple three-layers connectionist model. The input layer encodes patterns of semantic features of a user-specific re- source, which are either latent topics elicited through Latent Dirichlet Allocation (LDA) or available external categories. The hidden layer categorizes the resource by matching the encoded pattern against already learned exemplar patterns. The latter are composed of unique feature patterns and associated tag distributions. Finally, the output layer samples tags from the associated tag distributions to verbalize the preceding categorization process. We have evaluated this approach on a real-world folksonomy gathered from Wikipedia bookmarks in Delicious. In the experiment our approach outperformed LDA, a well-established algorithm. We at- tribute this to the fact that our approach processes seman- tic information (either latent topics or external categories) across the three different layers, and this substantially enhances the recommendation performance. With this paper, we demonstrate that a theoretically guided design of algorithms not only holds potential for improving existing recommendation mechanisms, but it also allows us to derive more generalizable insights about how human information interaction on the Web is determined by both semantic and verbal processes.
The document provides information about L-3 Data Tactics' data science team and services. The 11-person team consists of graduates from top universities with advanced degrees in various technical fields. They have expertise in both standard data science techniques ("horizontals" like regression and text analysis) as well as specialized vertical domains. The team takes a practical approach to data science focused on finding patterns in data. They provide data science training and services to government customers.
The document describes data workflows and data integration systems. It defines a data integration system as IS=<O,S,M> where O is a global schema, S is a set of data sources, and M are mappings between them. It discusses different views of data workflows including ETL processes, Linked Data workflows, and the data science process. Key steps in data workflows include extraction, integration, cleansing, enrichment, etc. Tools to support different steps are also listed. The document introduces global-as-view (GAV) and local-as-view (LAV) approaches to specifying the mappings M between the global and local schemas using conjunctive rules.
This document summarizes a seminar on machine learning using big data. It discusses the history of data storage and traditional databases. It then introduces machine learning and the types of learning, including supervised and unsupervised learning. Specific algorithms for each type are covered such as k-means clustering for unsupervised and naive Bayes for supervised. Case studies on applications like Amazon product recommendations are presented. The document concludes by discussing tools for machine learning and future applications as more connected devices generate extensive data.
Recommending Items in Social Tagging Systems Using Tag and Time InformationChristoph Trattner
In this work we present a novel item recommendation ap- proach that aims at improving Collaborative Filtering (CF) in social tagging systems using the information about tags and time. Our algorithm follows a two-step approach, where in the first step a potentially interesting candidate item-set is found using user-based CF and in the second step this can- didate item-set is ranked using item-based CF. Within this ranking step we integrate the information of tag usage and time using the Base-Level Learning (BLL) equation com- ing from human memory theory that is used to determine the reuse-probability of words and tags using a power-law forgetting function.
As the results of our extensive evaluation conducted on data- sets gathered from three social tagging systems (BibSonomy, CiteULike and MovieLens) show, the usage of tag-based and time information via the BLL equation also helps to improve the ranking and recommendation process of items and thus, can be used to realize an effective item recommender that outperforms two alternative algorithms which also exploit time and tag-based information.
This document provides an introduction to data science, including:
- Why data science has gained popularity due to advances in AI research and commoditized hardware.
- Examples of where data science is applied, such as e-commerce, healthcare, and marketing.
- Definitions of data science, data scientists, and their roles.
- Overviews of machine learning techniques like supervised learning, unsupervised learning, deep learning and examples of their applications.
- How data science can be used by businesses to understand customers, create personalized experiences, and optimize processes.
Grounded theory meets big data: One way to marry ethnography and digital methodsCitizens in the Making
This document discusses how grounded theory can be applied to analyze large datasets from social media platforms like Twitter. It proposes combining both qualitative and computational methods like machine learning. Specifically, it suggests using machine learning like latent Dirichlet allocation to identify topic clusters in Twitter data, which can then inform the qualitative coding categories used to analyze content, profiles and other metadata. The document advocates an emergent approach to coding to build conceptual knowledge from Twitter data, and emphasizes the importance of being reflexive and considering multiple perspectives.
The Web of Data: do we actually understand what we built?Frank van Harmelen
Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.
Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information? Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?
To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.
In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.
This document summarizes Ted Dunning's approach to recommendations based on his 1993 paper. The approach involves:
1. Analyzing user data to determine which items are statistically significant co-occurrences
2. Indexing items in a search engine with "indicator" fields containing IDs of significantly co-occurring items
3. Providing recommendations by searching the indicator fields for a user's liked items
The approach is demonstrated in a simple web application using the MovieLens dataset. Further work could optimize and expand on the approach.
This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.
Module 9: Natural Language Processing Part 2Sara Hooker
This document provides an overview of natural language processing techniques for gathering and analyzing text data, including web scraping, topic modeling, and clustering. It discusses gathering text data through APIs or web scraping using tools like Beautiful Soup. It also covers representing text numerically using bag-of-words and TF-IDF, visualizing documents in multi-dimensional spaces based on word frequencies, and using k-means clustering to group similar documents together based on cosine or Euclidean distances between their vectors. The document uses examples of Netflix movie descriptions to illustrate these NLP techniques.
This workshop was presented in Riyadh, SA in 21-22 Jan 2019, with the collaboration with Riyadh Data Geeks group.
To learn more about the workshop please see this website:
http://bit.ly/2Ucjmm5
In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
Our daily life is strongly influenced through decision-making processes based on large amounts of data, of which both the
data values as the meaningful (semantic) relationships can be included in knowledge graphs.
Given their automatic processing, knowledge graphs must be of high quality on both these fronts.
This thesis focuses on both improving data quality, as assessing semantic quality of knowledge graphs.
On the one hand, it describes a framework to generate knowledge graphs with extensible data transformations that can clean data ("RML + FnO"), expanded to perform data transformations automatically and implementation-independent ("FnO.io").
On the other hand, it describes a validation approach building on a rule-based reasoning solution ("Validatrr"). This takes into account the semantics used, and enables specific improvements to knowledge graph due to detail root cause explanation of quality problems.
Thanks to these contributions, data values in knowledge graphs are cleaned up while generating knowledge graphs, and they can be completed using automatic data transformations on existing knowledge graphs. Our validation approach makes it possible to accurately assess the quality of semantic relationships in knowledge graphs.
The combined work makes it easier to improve data quality and assess semantic quality for knowledge graphs, which ensures that knowledge graphs can be used correctly in decision-making processes.
Je t’aime… moi non plus: reporting on the opportunities, expectations and cha...Christoph Trattner
This presentation will be a live exchange of ideas & arguments, between a representative of a start up working on agricultural information management and discovery, and a representative of academia that has recently completed his PhD and is now leading a young and promising research team.
The two presenters will focus on the case of a recommendation service that is going to be part of a web portal for organic agriculture researchers and educators (called Organic.Edunet), which will help users find relevant educational material and bibliography. They currently develop this as part of an EU-funded initiative but would both be interested to find a way to further sustain this work: the start up by including this to the bundle of services that it offers to the users of its information discovery packages, and the research team by attracting more funding to further explore recommendation technologies.
The start up representative will describe his evergoing, helpless and aimless efforts to include a research activity on recommender systems within the R&D strategy of the company, for the sakes of the good-old-PhD-times. And will explain why this failed.
The academia representative will describe the great things that his research can do to boost the performance of recommendation services in such portals. And why this does-not-work-yet-operationally because he cannot find real usage data that can prove his amazing algorithm outside what can be proven in offline lab experiments using datasets from other domains (like MovieLens and CiteULike).
Both will explain how they started working together in order to design, experimentally test, and deploy the Organic.Edunet recommendation service. And will describe their expectations from this academic-industry collaboration. Then, they will reflect on the challenges they see in such partnerships and how (if) they plan to overcome them.
This document provides an introduction to text analytics using IBM SPSS Modeler. It defines key terms related to text analytics and outlines the main steps in the text analytics process: extraction, categorization, and visualization. It then provides a tutorial on using IBM SPSS Modeler to perform text analytics, including sourcing text, extracting concepts and relationships, categorizing records, and visualizing results. Templates and resources are described that can be used to start an interactive workbench session in Modeler for exploring text analytics.
ODSC East 2017: Data Science Models For GoodKarry Lu
Abstract: The rise of data science has been largely fueled by the promise of changing the business landscape - enhancing one's competitive advantage, increasing business optimization and efficiency, and ultimately delivering a better bottom-line. This promise reaches across sectors as machine learning methods are getting better, data access continues to grow, and computation power is easily accessible. However, because the practice of doing data science can be expensive, there is a danger that this so-called promise of data science may only be available to the most well-resourced organizations with sophisticated data capabilities and staff. For the past five years, DataKind has been working to ensure social change organizations too have access to data science, teaming them up with data scientists to build machine learning and artificial intelligence solutions that aim to reduce human suffering. In doing so, DataKind has learned what it takes to apply data science in the social sector and the many applications it has for creating positive change in the world. This session presents DataKind projects showcasing the wide range of applications for ML/AI for social good. From using satellite imagery and remote sensing techniques to detect wheat farm boundaries to protect livelihoods in Ethiopia, to leveraging NLP to automate the time consuming process of synthesizing findings from academic studies to inform conservation efforts and to classifying text records to better understand human rights conditions across the world to using machine learning to reduce traffic fatalities in U.S. cities, learn about some of the latest breakthroughs and findings in the data science for social good space and learn how you can get involved
El documento describe la regla de Farrar sobre la producción de miel en colmenas. Farrar observó que a medida que aumenta el número de abejas en una colmena, también aumenta la proporción de abejas recolectoras y la producción individual de cada abeja. Esto se debe a que las colonias grandes tienen una mayor proporción de abejas recolectoras que las pequeñas. La tabla muestra la relación directamente proporcional entre el peso de la población de abejas y la producción de miel de la colmena.
From Search to Predictions in Tagged Information SpacesChristoph Trattner
Tagging gained tremendously in popularity over past few years. When looking into the literature of tagging we find a lot of work regarding people's tagging motivation, their behavior, models that describe the folksonomy generation process, emergent semantic structures, etc., but interestingly we find quite little research showing the value of tags for searching an overloaded information space. Furthermore, there is lot of literature on the tag or item prediction problem, but interestingly almost all of them lookat the issue from a data-driven perspective. To bridge this gap in the literature, we have conducted several in-depth studies in the past showing the value of tags for lookup and exploratory search. We looked at the problem from a network theoretic and interface perspective and we will show how useful tags are for searching. Furthermore, we reviewed literature on memory processes from cognitive science and have invented a number of novel recommender algorithms based on the ACT-R and MINERVA2 theory. We will show that these approaches can not only predict tags and items extremely well, but also reveal how these models can help in explaining the recommendation processes better than current approaches.
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Rich Heimann
Big Social Data Analysis: Using location & Twitter to explore the tragic aftermath of the Sandy Hook Elementary School shooting.
The growth of social media over the last decade has revolutionized the way individuals interact and industries conduct business. Individuals produce data at an unprecedented rate by interacting, sharing, and consuming content through social media. However, analyzing this ever-growing pile of data is quite tricky and, if done erroneously, could lead to wrong inferences.
In this webinar you will gain, by example insights to mining social media data and exposing underlying latent structures relating to ideology and sentiment as well as space and time.
Recommending Tags with a Model of Human CategorizationChristoph Trattner
Social tagging involves complex processes of human categorization that have been the topic of much research in the cognitive sciences. In this paper we present a recommender approach for social tags whose principles are derived from some of the more prominent and empirically well-founded models from this research tradition. The basic architecture is a simple three-layers connectionist model. The input layer encodes patterns of semantic features of a user-specific re- source, which are either latent topics elicited through Latent Dirichlet Allocation (LDA) or available external categories. The hidden layer categorizes the resource by matching the encoded pattern against already learned exemplar patterns. The latter are composed of unique feature patterns and associated tag distributions. Finally, the output layer samples tags from the associated tag distributions to verbalize the preceding categorization process. We have evaluated this approach on a real-world folksonomy gathered from Wikipedia bookmarks in Delicious. In the experiment our approach outperformed LDA, a well-established algorithm. We at- tribute this to the fact that our approach processes seman- tic information (either latent topics or external categories) across the three different layers, and this substantially enhances the recommendation performance. With this paper, we demonstrate that a theoretically guided design of algorithms not only holds potential for improving existing recommendation mechanisms, but it also allows us to derive more generalizable insights about how human information interaction on the Web is determined by both semantic and verbal processes.
The document provides information about L-3 Data Tactics' data science team and services. The 11-person team consists of graduates from top universities with advanced degrees in various technical fields. They have expertise in both standard data science techniques ("horizontals" like regression and text analysis) as well as specialized vertical domains. The team takes a practical approach to data science focused on finding patterns in data. They provide data science training and services to government customers.
The document describes data workflows and data integration systems. It defines a data integration system as IS=<O,S,M> where O is a global schema, S is a set of data sources, and M are mappings between them. It discusses different views of data workflows including ETL processes, Linked Data workflows, and the data science process. Key steps in data workflows include extraction, integration, cleansing, enrichment, etc. Tools to support different steps are also listed. The document introduces global-as-view (GAV) and local-as-view (LAV) approaches to specifying the mappings M between the global and local schemas using conjunctive rules.
This document summarizes a seminar on machine learning using big data. It discusses the history of data storage and traditional databases. It then introduces machine learning and the types of learning, including supervised and unsupervised learning. Specific algorithms for each type are covered such as k-means clustering for unsupervised and naive Bayes for supervised. Case studies on applications like Amazon product recommendations are presented. The document concludes by discussing tools for machine learning and future applications as more connected devices generate extensive data.
Recommending Items in Social Tagging Systems Using Tag and Time InformationChristoph Trattner
In this work we present a novel item recommendation ap- proach that aims at improving Collaborative Filtering (CF) in social tagging systems using the information about tags and time. Our algorithm follows a two-step approach, where in the first step a potentially interesting candidate item-set is found using user-based CF and in the second step this can- didate item-set is ranked using item-based CF. Within this ranking step we integrate the information of tag usage and time using the Base-Level Learning (BLL) equation com- ing from human memory theory that is used to determine the reuse-probability of words and tags using a power-law forgetting function.
As the results of our extensive evaluation conducted on data- sets gathered from three social tagging systems (BibSonomy, CiteULike and MovieLens) show, the usage of tag-based and time information via the BLL equation also helps to improve the ranking and recommendation process of items and thus, can be used to realize an effective item recommender that outperforms two alternative algorithms which also exploit time and tag-based information.
This document provides an introduction to data science, including:
- Why data science has gained popularity due to advances in AI research and commoditized hardware.
- Examples of where data science is applied, such as e-commerce, healthcare, and marketing.
- Definitions of data science, data scientists, and their roles.
- Overviews of machine learning techniques like supervised learning, unsupervised learning, deep learning and examples of their applications.
- How data science can be used by businesses to understand customers, create personalized experiences, and optimize processes.
Grounded theory meets big data: One way to marry ethnography and digital methodsCitizens in the Making
This document discusses how grounded theory can be applied to analyze large datasets from social media platforms like Twitter. It proposes combining both qualitative and computational methods like machine learning. Specifically, it suggests using machine learning like latent Dirichlet allocation to identify topic clusters in Twitter data, which can then inform the qualitative coding categories used to analyze content, profiles and other metadata. The document advocates an emergent approach to coding to build conceptual knowledge from Twitter data, and emphasizes the importance of being reflexive and considering multiple perspectives.
The Web of Data: do we actually understand what we built?Frank van Harmelen
Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.
Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information? Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?
To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.
In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.
This document summarizes Ted Dunning's approach to recommendations based on his 1993 paper. The approach involves:
1. Analyzing user data to determine which items are statistically significant co-occurrences
2. Indexing items in a search engine with "indicator" fields containing IDs of significantly co-occurring items
3. Providing recommendations by searching the indicator fields for a user's liked items
The approach is demonstrated in a simple web application using the MovieLens dataset. Further work could optimize and expand on the approach.
This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.
Module 9: Natural Language Processing Part 2Sara Hooker
This document provides an overview of natural language processing techniques for gathering and analyzing text data, including web scraping, topic modeling, and clustering. It discusses gathering text data through APIs or web scraping using tools like Beautiful Soup. It also covers representing text numerically using bag-of-words and TF-IDF, visualizing documents in multi-dimensional spaces based on word frequencies, and using k-means clustering to group similar documents together based on cosine or Euclidean distances between their vectors. The document uses examples of Netflix movie descriptions to illustrate these NLP techniques.
This workshop was presented in Riyadh, SA in 21-22 Jan 2019, with the collaboration with Riyadh Data Geeks group.
To learn more about the workshop please see this website:
http://bit.ly/2Ucjmm5
In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
Our daily life is strongly influenced through decision-making processes based on large amounts of data, of which both the
data values as the meaningful (semantic) relationships can be included in knowledge graphs.
Given their automatic processing, knowledge graphs must be of high quality on both these fronts.
This thesis focuses on both improving data quality, as assessing semantic quality of knowledge graphs.
On the one hand, it describes a framework to generate knowledge graphs with extensible data transformations that can clean data ("RML + FnO"), expanded to perform data transformations automatically and implementation-independent ("FnO.io").
On the other hand, it describes a validation approach building on a rule-based reasoning solution ("Validatrr"). This takes into account the semantics used, and enables specific improvements to knowledge graph due to detail root cause explanation of quality problems.
Thanks to these contributions, data values in knowledge graphs are cleaned up while generating knowledge graphs, and they can be completed using automatic data transformations on existing knowledge graphs. Our validation approach makes it possible to accurately assess the quality of semantic relationships in knowledge graphs.
The combined work makes it easier to improve data quality and assess semantic quality for knowledge graphs, which ensures that knowledge graphs can be used correctly in decision-making processes.
Je t’aime… moi non plus: reporting on the opportunities, expectations and cha...Christoph Trattner
This presentation will be a live exchange of ideas & arguments, between a representative of a start up working on agricultural information management and discovery, and a representative of academia that has recently completed his PhD and is now leading a young and promising research team.
The two presenters will focus on the case of a recommendation service that is going to be part of a web portal for organic agriculture researchers and educators (called Organic.Edunet), which will help users find relevant educational material and bibliography. They currently develop this as part of an EU-funded initiative but would both be interested to find a way to further sustain this work: the start up by including this to the bundle of services that it offers to the users of its information discovery packages, and the research team by attracting more funding to further explore recommendation technologies.
The start up representative will describe his evergoing, helpless and aimless efforts to include a research activity on recommender systems within the R&D strategy of the company, for the sakes of the good-old-PhD-times. And will explain why this failed.
The academia representative will describe the great things that his research can do to boost the performance of recommendation services in such portals. And why this does-not-work-yet-operationally because he cannot find real usage data that can prove his amazing algorithm outside what can be proven in offline lab experiments using datasets from other domains (like MovieLens and CiteULike).
Both will explain how they started working together in order to design, experimentally test, and deploy the Organic.Edunet recommendation service. And will describe their expectations from this academic-industry collaboration. Then, they will reflect on the challenges they see in such partnerships and how (if) they plan to overcome them.
This document provides an introduction to text analytics using IBM SPSS Modeler. It defines key terms related to text analytics and outlines the main steps in the text analytics process: extraction, categorization, and visualization. It then provides a tutorial on using IBM SPSS Modeler to perform text analytics, including sourcing text, extracting concepts and relationships, categorizing records, and visualizing results. Templates and resources are described that can be used to start an interactive workbench session in Modeler for exploring text analytics.
ODSC East 2017: Data Science Models For GoodKarry Lu
Abstract: The rise of data science has been largely fueled by the promise of changing the business landscape - enhancing one's competitive advantage, increasing business optimization and efficiency, and ultimately delivering a better bottom-line. This promise reaches across sectors as machine learning methods are getting better, data access continues to grow, and computation power is easily accessible. However, because the practice of doing data science can be expensive, there is a danger that this so-called promise of data science may only be available to the most well-resourced organizations with sophisticated data capabilities and staff. For the past five years, DataKind has been working to ensure social change organizations too have access to data science, teaming them up with data scientists to build machine learning and artificial intelligence solutions that aim to reduce human suffering. In doing so, DataKind has learned what it takes to apply data science in the social sector and the many applications it has for creating positive change in the world. This session presents DataKind projects showcasing the wide range of applications for ML/AI for social good. From using satellite imagery and remote sensing techniques to detect wheat farm boundaries to protect livelihoods in Ethiopia, to leveraging NLP to automate the time consuming process of synthesizing findings from academic studies to inform conservation efforts and to classifying text records to better understand human rights conditions across the world to using machine learning to reduce traffic fatalities in U.S. cities, learn about some of the latest breakthroughs and findings in the data science for social good space and learn how you can get involved
El documento describe la regla de Farrar sobre la producción de miel en colmenas. Farrar observó que a medida que aumenta el número de abejas en una colmena, también aumenta la proporción de abejas recolectoras y la producción individual de cada abeja. Esto se debe a que las colonias grandes tienen una mayor proporción de abejas recolectoras que las pequeñas. La tabla muestra la relación directamente proporcional entre el peso de la población de abejas y la producción de miel de la colmena.
Slides for the second meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
This document summarizes a presentation on analyzing word co-occurrences in text data using network analysis techniques. It discusses counting the frequency of word combinations, representing the co-occurrence data as a network with nodes for words and edges for co-occurrences, and visualizing the network in Gephi. It also provides an example analysis of tweets about a political debate, examining which topics were emphasized by each candidate based on word associations on Twitter.
This document summarizes research on visualizing the melting behavior of palmitic acid inside a spherical container. It provides background on phase change materials (PCMs) and their applications in energy storage. The experimental setup used a glass spherical container filled with palmitic acid and thermocouples to observe melting over time. Results showed conduction heat transfer initially, with natural convection strengthening melting at the bottom later. The design and properties of the PCM and container impact energy storage optimization.
The document contains tips for spelling 100 commonly misspelled words in English. For each word, it provides a mnemonic or rule to help remember the correct spelling. For example, it notes that "accommodate" can accommodate both double c and double m, and that "apparent" must pay the rent. The overall document aims to help improve spelling through associative memory techniques.
El documento describe las propiedades y tipos de colores. Explica que los colores pigmento tienen una forma física real y se pueden mezclar como muestras de color, mientras que los colores de luz se producen a través de la mezcla de luz roja, verde y azul. Luego define los términos tono, saturación y brillo para describir las cualidades de los colores, y menciona algunos tipos de armonía de color como complementario, fríos y calientes, y claroscuro.
ViennaPharmacia is an Austrian joint venture company that produces and sells food supplements, drugs, and other pharmaceutical products. It uses tested recipes, often from old sources, to develop natural and effective products. The company manages the entire production process from organizing recipes to packaging and delivery. It offers private label production and exports products internationally through a network of partners in many countries. All products are manufactured according to GMP and HACCP quality standards. Two highlighted products are Peaceful Sleep, which contains valerian and other extracts to promote sleep, and Ginger Kids, a ginger extract supplement proven safe and effective for children ages 3 and up.
This document proposes conceptualizing and measuring news exposure as a network of users and news items. It outlines some common assumptions about news consumption that are outdated, such as people using a fixed set of news outlets. The document then presents a model where news items and users are represented as nodes, and their connections as edges. For example, an edge between a user and news item would indicate the user has read that item. Implementing this model involves collecting data on users, news items, and their connections to build a graph database that can be used to analyze news diffusion and exposure. This network approach is presented as an improved way to measure individual-level exposure to specific news items in today's unbundled media environment.
This was a presentation I did for Saudi Aramco during an interview on people, culture, and perceptions in today's market. It gives and insight as to where our world is heading when it comes to the importance of diversity in the workplace.
The Roman Republic was founded in 509 BC after the overthrow of the last Roman king. It established a balanced system of government with two consuls as executive magistrates, a senate of wealthy landowners, and popular assemblies for citizens. The republic expanded Roman control through military conquests including the Punic Wars against Carthage, led by generals like Scipio Africanus. These victories established Rome as the dominant power in the Mediterranean world.
This document announces and provides information about the 2nd Andean Summit on Anti-Corruption Compliance & Enforcement taking place from November 16-18, 2016 in Bogota, Colombia. The summit will address increasing importance of anti-corruption compliance for businesses operating in the Andean region. It will feature practical panels and discussions on successfully conducting business in compliance with international and local laws. Topics will include beneficial ownership, regional anti-corruption legislation and enforcement, OECD membership requirements, compliance programs, infrastructure projects, third party risks, economic sanctions, political landscapes, and more. The event is aimed at legal, compliance and ethics professionals and will provide CLE/CPE credits.
Este documento presenta ejercicios de conversión entre diferentes sistemas numéricos como binario, decimal, octal y hexadecimal. Incluye ejercicios de suma, resta, multiplicación y resta por complemento entre números en estos sistemas. También explica brevemente el código ASCII y su tabla de caracteres.
Los cromosomas procariotas están formados por una sola cadena circular de ADN unida a la membrana celular, mientras que los cromosomas eucariotas contienen varias cadenas lineales de ADN empaquetadas con proteínas como las histonas dentro del núcleo celular. Los cromosomas procariotas también pueden contener plásmidos de ADN adicionales.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python. First Meeting.
This document provides an overview of a course on Big Data and Automated Content Analysis. It introduces the instructor, Damian Trilling, and a PhD student, Joanna Strycharz. It then discusses definitions of Big Data, implications and criticisms, and whether the techniques used in the course constitute Big Data research. Next, it outlines methods that will be covered, including data collection, analysis techniques, and the programming language Python. Finally, it discusses reasons for building one's own tools rather than using commercial software.
Quantitative data refers to numerical data that can be analyzed statistically. This document discusses various types of quantitative data like counts, measurements, and projections. It also describes common methods for analyzing quantitative data such as surveys, cross-tabulation, trend analysis, and gap analysis. The advantages of quantitative data include conducting in-depth research with minimum bias and accurate results. However, quantitative data also has limitations like providing restricted information and results depending on the question types used to collect the data.
This document discusses using R for Twitter data analytics. It outlines the basics of Twitter data analytics using R, including collecting real-time Twitter data, text mining techniques for Twitter data, and sentiment analysis. Some key steps involved are exploring the Twitter corpus, preprocessing the text by removing stopwords and stemming words, creating a document-term matrix, and calculating TF-IDF weights. Cosine similarity is used to measure similarity between text documents. The goal is to extract useful patterns and insights from large amounts of Twitter data in real-time.
Data analysis is identifying trends, patterns, and correlations in vast amounts of raw data to make data-informed decisions. These procedures employ well-known statistical analysis approaches, such as clustering and regression, and apply them to larger datasets with the assistance of modern tools.
Diamonds in the Rough (Sentiment(al) AnalysisScott K. Wilder
Gary Angel and Scott K. Wilder presented on sentiment analysis. Gary is the president of Semphonic, a web analytics consultancy, and Scott is a digital strategist. They discussed how sentiment analysis works, its limitations, and best practices for using it. Specifically, they noted that sentiment analysis provides anecdotal, not primary, insights and that the most accurate approach combines automated tools with manual review of verbatim comments.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Social CI: A Work method and a tool for Competitive Intelligence NetworkingComintelli
This presentation is from a webinar hosted by Comintelli and is a part of a project called CIBAS: a collaboration with the Department of Media Technology at Södertörn University in Stockholm, Sweden. A new notion called social CI is introduced, meaning competitive intelligence (CI) for the networking organization. A conceptual framework for social CI is presented that is based on a socio-technical perspective combining both social and technical aspects. The presented framework is related to notions such as Enterprise 2.0 and wikinomics. A research design prototype of a tool for collaborative CI, CoCI, is also demonstrated. CoCI is a tool that has been developed using the Social CI framework that demonstrates how CI methods and CI tools can be developed together using a socio-technical approach.
This document provides an overview of sentiment analysis on Twitter. It discusses how sentiment analysis can be used to determine sentiment behind texts and social media updates. The document outlines the methodology used, including data collection from Twitter, preprocessing, feature extraction, and using classifiers like Naive Bayes to predict sentiment. It also discusses applications of sentiment analysis and future areas of improvement, such as using different algorithms and temporal analysis. The goal is to more accurately analyze human sentiment from social media data.
This document presents a study on detecting suicidal ideation using machine learning techniques. It discusses related work on sentiment analysis and suicide detection. The objectives are to design an efficient suicidal analysis method and create a Bangla dataset. Unsupervised learning techniques like k-means clustering, hierarchical clustering, and principal component analysis are proposed for the analysis. The procedure involves data preprocessing, applying algorithms, and calculating accuracy. The work will analyze existing datasets, develop a new Bangla dataset, and design methods to effectively detect suicidal thoughts.
Citi Global T4I Accelerator Data and Analytics PresentationMarquis Cabrera
Presented on data and analytics for the Citi T4I Global Social Good Accelerator, which is an open innovation initiative seeking to source tech solutions that promote integrity around the world.
This document provides an agenda for an introduction to deep learning presentation. It begins with an introduction to basic AI, machine learning, and deep learning terms. It then briefly discusses use cases of deep learning. The document outlines how to approach a deep learning problem, including which tools and algorithms to use. It concludes with a question and answer section.
Juliette Melton - Mobile User Experience ResearchWeb Directions
Most user experience research takes place sitting behind a computer. And yet these days, most networked experiences are happening on mobile devices. Some common user experience research methods work well in a mobile environment — others don’t. In this talk, Juliette Melton will guide you through how to use some great existing research methods in a mobile context, how to incorporate some new (and fun!) methods into your arsenal, and propose next generation tools and services to make mobile user experience research even better.
Juliette has ten years of experience building, managing, and researching digital environments and is a human factors researcher based at IDEO in San Francisco. She’s deeply interested in the intersections between digital culture, learning, and communication. Her work has spanned a broad range of industries including social media, casual gaming, education administration, electronic publishing, corporate banking, computer hardware, and public health.
Community education — through workshops, lectures, and writing — is an important part of her work. Remote user experience methods, agile project management, and research program planning are frequent topics.
Juliette holds an MEd from the Technology, Innovation, and Education program at the Harvard Graduate School of Education where she focused on developing models for innovative networked learning applications. She also has a BA in Comparative Literature from Haverford College.
Follow Juliette on Twitter: @j
Intellexy social media analysis solutions d2011Maya Marashlian
The document discusses the importance of social media monitoring and analysis for businesses. It notes that consumers now generate more brand content than companies and that companies no longer control what is said about their brands online. It advocates listening to social media conversations to better understand consumer perceptions and influence discussions. The company described, Intellexy, provides social media monitoring and analysis services to help clients understand conversations, sentiment, trends and more to improve their brands.
Intellexy Social Media Monitoring and Analysis Solutions D2011MayaMar
The document discusses the importance of social media monitoring and analysis for businesses. It notes that consumers now generate more brand content than companies and that companies no longer control what is said about their brands online. It argues that marketing's role is to change conversations by having more people speak positively or fewer speak negatively about a brand. The document introduces Intellexy as a company that provides social media monitoring and analysis solutions to help companies listen to conversations and extract insights from social media data through human analysis.
This document provides an introduction to recommender systems. It discusses how recommender systems can help users filter through large amounts of information and options in an era of information overload. It describes different types of recommender systems, including content-based, collaborative filtering, and context-based recommender systems. The document also discusses challenges like sparsity in data and scaling to large datasets, and how modeling approaches can help address these challenges.
This document provides an overview of using statistics in Python with Pandas. It discusses general considerations for using Python for statistics rather than exporting data to another program. Useful Python packages for statistics like NumPy, SciPy, statsmodels, and matplotlib are introduced. The document demonstrates how to work with Pandas dataframes, including descriptive statistics, plotting, and linear regression. An upcoming exercise will provide hands-on practice of these skills.
This document discusses last week's coding exercise on data harvesting and storage. It provides step-by-step explanations of code used to extract and analyze data from a JSON file. Examples include printing video titles, calculating average tags per video, determining the most commented on porn category, and finding the most frequently used words. The document also covers APIs, scrapers, file formats like JSON and CSV, and how to store extracted data.
This document provides an introduction to basic Python programming concepts like datatypes, functions, modifying lists and dictionaries, and indentation. It explains that Python uses indentation through spaces or tabs to structure code blocks that are executed repeatedly or under certain conditions. Examples are given for defining functions, appending and merging lists, adding keys to dictionaries, and using indentation with for, if/elif/else, and try/except blocks.
This document outlines an introductory course on big data and automated content analysis. It covers using the Linux command line, writing and running Python code, and announces upcoming meetings. The course will introduce tools like the Linux terminal and Python, explain why they are useful for big data tasks, and provide exercises for students to practice these skills, such as writing simple Python programs. Upcoming meetings are scheduled for weeks 2 and 3 to continue lectures and lab sessions on using Python.
This document summarizes a presentation on unsupervised and supervised machine learning techniques for automated content analysis. It recaps types of automated content analysis, describes unsupervised techniques like principal component analysis (PCA) and latent Dirichlet allocation (LDA), and supervised machine learning techniques like regression. It provides examples of applying these techniques to cluster Facebook messages and predict newspaper reading. The document concludes by noting the presenter will use a portion of labeled data to estimate models and check predictions against the remaining labeled data.
The document discusses web scraping and outlines a step-by-step process for scraping comments from a Dutch website called GeenStijl. It begins with using regular expressions to scrape the comments, but notes that existing parsers can make the process more elegant, especially for complex websites. It then demonstrates using the lxml module and XPath to scrape reviews from another site in a more structured way. The document provides remarks on regular expressions and XPath, and encourages exploring different scraping techniques.
This document provides an overview of a presentation on automated content analysis using regular expressions and natural language processing. The presentation covers topics like bottom-up vs top-down analysis, what regular expressions are and how they can be used in Python, stemming, parsing sentences, and combining techniques like stemming and stopword removal. Examples are given on using regular expressions to count actors in articles and check the number of a document from LexisNexis. The takeaway message is about an upcoming take-home exam and future meetings.
This document discusses a lecture on data harvesting and storage. It covers APIs, RSS feeds, scraping and crawling as methods for collecting data from various sources. It also discusses storing data in formats like CSV, JSON, and XML. The document provides code examples for working with JSON data and discusses tools for long-term data collection like DMI-TCAT.
This document summarizes a presentation on the basics of Python programming. It introduces fundamental Python concepts like datatypes, functions, methods, and indentation-based code structuring. It also announces an exercise for the attendees to practice these basics and previews upcoming meetings that will involve working with structured datasets in Python.
1) Traditional assumptions about how people consume news through a fixed set of outlets are incorrect in today's fragmented media environment.
2) News consumption involves three layers - media type, individual outlets, and the gateway through which people access the outlet (e.g. website, app, social media).
3) Researchers need to account for this third layer when studying people's news repertoires to fully capture how news flows in the digital age.
This document summarizes a presentation on analyzing political communication data from Twitter. It discusses analyzing the structure of Twitter data by examining things like retweet networks and interaction patterns, versus analyzing the content of tweets by looking at topics, sentiments, and word frequencies. It provides examples of studies that take both structural and content-based approaches. Specifically, it examines studies that analyzed how Twitter discussions relate to televised political debates and who engages in uncivil language online. The presentation concludes that the most insightful approach is often to combine structural and content-based analyses.
This document introduces the topic of studying social media in political communication. It discusses how social media may contribute to selective exposure, fragmentation, and polarization due to people selectively exposing themselves to ideologically congruent content. It also examines how politicians are using social media to directly communicate with citizens. The case study aims to leverage computational social science methods and large-scale data analysis to provide new insights into these topics. Students will work on a project analyzing political content on social media over the course of the case study.
Slides for the sixth meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
This document summarizes research on filter bubbles and personalized communication. It finds that while people do engage in some self-selected personalization of news sources, exposure to diverse opinions through traditional and social media is still common. Personalization appears to have small effects on polarization and political learning, but its long term impacts are uncertain due to changing media consumption patterns. Overall, the evidence does not strongly support concerns about "filter bubbles" isolating the public sphere.
More from Department of Communication Science, University of Amsterdam (18)
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
BDACA1516s2 - Lecture4
1. Different types of analysis Sentiment analysis Stopword removal Next meetings
Big Data and Automated Content Analysis
Week 4 – Monday
»Sentiment Analysis«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
18 April 2016
Big Data and Automated Content Analysis Damian Trilling
2. Different types of analysis Sentiment analysis Stopword removal Next meetings
Today
1 Different types of analysis
What can we do?
Systematizing analytical approaches
2 Data analysis 1: Sentiment analysis
What is it?
Bag-of-words approaches
Advanced approaches
A sentiment analysis tailored to your needs!
3 NLP-preview: Stopword removal
Natural language processing
A simple algorithm
4 Take-home message, next meetings, & exam
Big Data and Automated Content Analysis Damian Trilling
3. Different types of analysis Sentiment analysis Stopword removal Next meetings
What we already can do
with regard to data collection:
• query a (JSON-based) API (GoogleBooks, Twitter)
• handle CSV files
• handle JSON files
Big Data and Automated Content Analysis Damian Trilling
4. Different types of analysis Sentiment analysis Stopword removal Next meetings
What we already can do
with regard to data collection:
• query a (JSON-based) API (GoogleBooks, Twitter)
• handle CSV files
• handle JSON files
with regard to analysis:
Not much. We counted some frequencies and calculated some
averages.
Big Data and Automated Content Analysis Damian Trilling
5. Different types of analysis Sentiment analysis Stopword removal Next meetings
What can we do?
Data analysis 0: Overview
What can we do?
Big Data and Automated Content Analysis Damian Trilling
6. Different types of analysis Sentiment analysis Stopword removal Next meetings
What can we do?
Data analysis 0: Overview
What can we do?
What do you think? What are interesting methods to
analyze large data sets (like, e.g., social media data? What
questions can they answer?)
Big Data and Automated Content Analysis Damian Trilling
7. Different types of analysis Sentiment analysis Stopword removal Next meetings
What can we do?
What else can we do?
For example
• sentiment analysis
• automated coding with regular expressions
• natural language processing
• supervised and unsupervised machine learning
• network analysis
Big Data and Automated Content Analysis Damian Trilling
8. Different types of analysis Sentiment analysis Stopword removal Next meetings
What can we do?
What else can we do?
Or ideally. . .
. . . a combination of these techniques.
Big Data and Automated Content Analysis Damian Trilling
9. Different types of analysis Sentiment analysis Stopword removal Next meetings
Systematizing analytical approaches
Overview
Systematizing analytical approaches
Big Data and Automated Content Analysis Damian Trilling
10. Different types of analysis Sentiment analysis Stopword removal Next meetings
Systematizing analytical approaches
Systematizing analytical approaches
Taking the example of Twitter:
Analyzing the structure
• Number of Tweets over time
• singleton/retweet ratio
• Distribution of number of Tweets per user
• Interaction networks
Bruns, A., & Stieglitz, S. (2013). Toward more systematic Twitter analysis: metrics for tweeting activities.
International Journal of Social Research Methodology. doi:10.1080/13645579.2012.756095
Big Data and Automated Content Analysis Damian Trilling
11. Different types of analysis Sentiment analysis Stopword removal Next meetings
Systematizing analytical approaches
Systematizing analytical approaches
Taking the example of Twitter:
Analyzing the structure
• Number of Tweets over time
• singleton/retweet ratio
• Distribution of number of Tweets per user
• Interaction networks
⇒ Focus on the amount of content and on the question who
interacts with whom, not on what is said
Bruns, A., & Stieglitz, S. (2013). Toward more systematic Twitter analysis: metrics for tweeting activities.
International Journal of Social Research Methodology. doi:10.1080/13645579.2012.756095
Big Data and Automated Content Analysis Damian Trilling
12. Different types of analysis Sentiment analysis Stopword removal Next meetings
Systematizing analytical approaches
Systematizing analytical approaches
Taking the example of Twitter:
Analyzing the content
• Sentiment analysis
• Word frequencies, searchstrings
• Co-word analysis (⇒frames)
Big Data and Automated Content Analysis Damian Trilling
13. Different types of analysis Sentiment analysis Stopword removal Next meetings
Systematizing analytical approaches
Systematizing analytical approaches
Taking the example of Twitter:
Analyzing the content
• Sentiment analysis
• Word frequencies, searchstrings
• Co-word analysis (⇒frames)
⇒ Focus on what is said
Big Data and Automated Content Analysis Damian Trilling
14. Different types of analysis Sentiment analysis Stopword removal Next meetings
Systematizing analytical approaches
Systematizing analytical approaches
⇒ It depends on your reserach question which approach is
more interesting!
Big Data and Automated Content Analysis Damian Trilling
15. Different types of analysis Sentiment analysis Stopword removal Next meetings
What is it?
Data analysis 1:
Sentiment analysis
Big Data and Automated Content Analysis Damian Trilling
16. Different types of analysis Sentiment analysis Stopword removal Next meetings
What is it?
What is sentiment analysis?
Extracting subjective information from texts
Big Data and Automated Content Analysis Damian Trilling
17. Different types of analysis Sentiment analysis Stopword removal Next meetings
What is it?
What is sentiment analysis?
Extracting subjective information from texts
• the author’s attitude towards the topic of the text
Big Data and Automated Content Analysis Damian Trilling
18. Different types of analysis Sentiment analysis Stopword removal Next meetings
What is it?
What is sentiment analysis?
Extracting subjective information from texts
• the author’s attitude towards the topic of the text
• polarity: negative—positive
Big Data and Automated Content Analysis Damian Trilling
19. Different types of analysis Sentiment analysis Stopword removal Next meetings
What is it?
What is sentiment analysis?
Extracting subjective information from texts
• the author’s attitude towards the topic of the text
• polarity: negative—positive
• subjectivity: neutral—subjective *
Big Data and Automated Content Analysis Damian Trilling
20. Different types of analysis Sentiment analysis Stopword removal Next meetings
What is it?
What is sentiment analysis?
Extracting subjective information from texts
• the author’s attitude towards the topic of the text
• polarity: negative—positive
• subjectivity: neutral—subjective *
• advanced approaches: different emotions
Big Data and Automated Content Analysis Damian Trilling
21. Different types of analysis Sentiment analysis Stopword removal Next meetings
What is it?
What is sentiment analysis?
Extracting subjective information from texts
• the author’s attitude towards the topic of the text
• polarity: negative—positive
• subjectivity: neutral—subjective *
• advanced approaches: different emotions
* Less sophisticated approaches do not see this as a sperate dimension but
simply calculate objectivity = 1 − (negativity + positivity)
Big Data and Automated Content Analysis Damian Trilling
22. Different types of analysis Sentiment analysis Stopword removal Next meetings
What is it?
Applications
Who uses it?
• Companies
• especially for Web Analytics
• Social Scientists
• applications in data journalism, politics, . . .
Many references to examples in Mostafa (2013).
⇒ Cases in which you have a huge amount of data or real-time
data and you want to get an idea of the tone.
Mostafa, M. M. (2013). More than words: Social networks’ text mining for consumer brand sentiments. Expert
Systems with Applications, 40(10), 4241– 4251. doi:10.1016/j.eswa.2013.01.019
Big Data and Automated Content Analysis Damian Trilling
23. Different types of analysis Sentiment analysis Stopword removal Next meetings
What is it?
Example
1 >>> sentiment("Great service by @NSHighspeed")
2 (0.8, 0.75)
3 >>> sentiment("Bad service by @NSHighspeed")
4 (-0.6166666666666667, 0.6666666666666666)
(polarity, subjectivity) with
−1 ≤ polarity ≤ +1
0 ≤ subjectivity ≤ +1 )
This is the module pattern.nl, available for Python 2 only.
De Smedt, T., & Daelemans W. (2012). Pattern for Python. Journal of Machine Learning Research, 13, 2063-2067.
Big Data and Automated Content Analysis Damian Trilling
24. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Data analysis 1: Sentiment analysis
Bag-of-words approaches
Big Data and Automated Content Analysis Damian Trilling
25. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Bag-of-words approaches
How does it work?
• We take each word of a text and look if it’s positive or
negative.
Big Data and Automated Content Analysis Damian Trilling
26. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Bag-of-words approaches
How does it work?
• We take each word of a text and look if it’s positive or
negative.
• Most simple way: compare it with a list of negative words and
with a list of positive words (That’s what Mostafa (2013) did)
Big Data and Automated Content Analysis Damian Trilling
27. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Bag-of-words approaches
How does it work?
• We take each word of a text and look if it’s positive or
negative.
• Most simple way: compare it with a list of negative words and
with a list of positive words (That’s what Mostafa (2013) did)
• More advanced: look up a subjectivity score from a table
Big Data and Automated Content Analysis Damian Trilling
28. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Bag-of-words approaches
How does it work?
• We take each word of a text and look if it’s positive or
negative.
• Most simple way: compare it with a list of negative words and
with a list of positive words (That’s what Mostafa (2013) did)
• More advanced: look up a subjectivity score from a table
• e.g., add up the scores and average them.
Big Data and Automated Content Analysis Damian Trilling
29. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
How to do this
If you were to run an analyis like the one by Mostafa (2013), how
could you do this?
Big Data and Automated Content Analysis Damian Trilling
30. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
How to do this
(given a string tekst that you want to analyze and two lists of strings with negative
and positive words, lijstpos=["great","fantastic",...,"perfect"] and
lijstneg)
1 sentiment=0
2 for woord in tekst.split():
3 if woord in lijstpos:
4 sentiment=sentiment+1 #same as sentiment+=1
5 elif woord in lijstneg:
6 sentiment=sentiment-1 #same as sentiment-=1
7 print (sentiment)
Big Data and Automated Content Analysis Damian Trilling
31. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Do we need to have the lists in our program itself?
No.
You could have them in a separate text file, one per row, and then
read that file directly to a list.
1 poslijst=open("filewithonepositivewordperline.txt").read().splitlines()
2 neglijst=open("filewithonenegativewordperline.txt").read().splitlines()
Big Data and Automated Content Analysis Damian Trilling
32.
33. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
More advanced versions
• CSV files or similar tables with weights
• Or some kind of dict?
Big Data and Automated Content Analysis Damian Trilling
34.
35.
36. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Mustafa 2013: Interpreting the output
Big Data and Automated Content Analysis Damian Trilling
37. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Mustafa 2013: Interpreting the output
Your thoughts?
Big Data and Automated Content Analysis Damian Trilling
38. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Mustafa 2013: Interpreting the output
Your thoughts?
• each word
counts equally
(1)
• many tweets
contain no
words from the
list. What does
this mean?
• Ways to
improve BOW
approaches?
Big Data and Automated Content Analysis Damian Trilling
39. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Bag-of-words approaches
e.g., Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar
verklarende factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande
politici en politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam.
Big Data and Automated Content Analysis Damian Trilling
40. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Bag-of-words approaches
pro
• easy to implement
• easy to modify:
• add or remove words
• make new lists for other languages, other categories (than
positive/negative), . . .
• easy to understand (transparency, reproducability)
e.g., Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar
verklarende factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande
politici en politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam.
Big Data and Automated Content Analysis Damian Trilling
41. Different types of analysis Sentiment analysis Stopword removal Next meetings
Bag-of-words approaches
Bag-of-words approaches
con
• simplistic assumptions
• e.g., intensifiers cannot be interpreted ("really" in "really
good" or "really bad")
• or, even more important, negations.
Big Data and Automated Content Analysis Damian Trilling
42. Different types of analysis Sentiment analysis Stopword removal Next meetings
Advanced approaches
Data analysis 1: Sentiment analysis
Advanced approaches
Big Data and Automated Content Analysis Damian Trilling
43. Different types of analysis Sentiment analysis Stopword removal Next meetings
Advanced approaches
Improving the BOW approach
Example: The Sentistrenght algorithm
• −5 . . . − 1 and +1 . . . + 5
• spelling correction
• "booster word list" for strengthening/weakening the effect of
the following word
• interpreting repeated letters ("baaaaaad"), CAPITALS and !!!
• idioms
• negation
• ldots
Thelwall, M., Buckley, K., & Paltoglou, G. (2012). Sentiment strength detection for the social Web. Journal of the
American Society for Information Science and Technology, 63(1), 163-173.
Big Data and Automated Content Analysis Damian Trilling
44. Different types of analysis Sentiment analysis Stopword removal Next meetings
Advanced approaches
Advanced approaches
Take the structure of a text into account
• Try to apply linguistics concepts to identify sentence structure
• can identify negations
• can interpret intensifiers
Big Data and Automated Content Analysis Damian Trilling
45. Different types of analysis Sentiment analysis Stopword removal Next meetings
Advanced approaches
Example
1 from pattern.nl import sentiment
2 >>> sentiment("Great service by @NSHighspeed")
3 (0.8, 0.75)
4 >>> sentiment("Really")
5 (0.0, 1.0)
6 >>> sentiment("Really Great service by @NSHighspeed")
7 (1.0, 1.0)
(polarity, subjectivity) with
−1 ≤ polarity ≤ +1
0 ≤ subjectivity ≤ +1 )
Unlike in pure bag-of-words approaches, here, the overall sentiment is not just
the sum or the average of its parts!
De Smedt, T., & Daelemans W. (2012). Pattern for Python. Journal of Machine Learning Research, 13, 2063-2067.
Big Data and Automated Content Analysis Damian Trilling
46. Different types of analysis Sentiment analysis Stopword removal Next meetings
Advanced approaches
Advanced approaches
Big Data and Automated Content Analysis Damian Trilling
47. Different types of analysis Sentiment analysis Stopword removal Next meetings
Advanced approaches
Advanced approaches
pro
• understand intensifiers or negation
• thus: higher accuracy
Big Data and Automated Content Analysis Damian Trilling
48. Different types of analysis Sentiment analysis Stopword removal Next meetings
Advanced approaches
Advanced approaches
pro
• understand intensifiers or negation
• thus: higher accuracy
con
• Black box? Or do we understand the algorithm?
• Difficult to adapt to own needs
• really much better results?
Big Data and Automated Content Analysis Damian Trilling
49. Different types of analysis Sentiment analysis Stopword removal Next meetings
A sentiment analysis tailored to your needs!
Data analysis 1: Sentiment analysis
A sentiment analysis tailored to your needs!
Big Data and Automated Content Analysis Damian Trilling
50. Different types of analysis Sentiment analysis Stopword removal Next meetings
A sentiment analysis tailored to your needs!
A sentiment analysis tailored to your needs!
Identifying suicidal texts
• Bag-of-words-approach with very specific dictionary
• added negation
• added regular expression search for key phrases
• Very specific design requirements: False positives are OK,
false negatives not!
Huang, Y.-P., Goh, T., & Liew, C.L. (2007). Hunting suicide notes in web 2.0 – preliminary findings. Ninth IEEE
International Symposium on Multimedia. Retrieved from
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4476021
Big Data and Automated Content Analysis Damian Trilling
51. Different types of analysis Sentiment analysis Stopword removal Next meetings
A sentiment analysis tailored to your needs!
Already this still relatively simple approach seems to work
satisfactory, but if 106 scientists from 24 competing teams (!)
work on it, they can
Pestian, J.P.; Matykiewicz, P., Linn-Gust, M., South, B., Uzuner, O., Wiebe, J., Cohen, K.B., Hurdle, J., & Brew,
C. (2012). Sentiment analysis of suicide notes: A shared task. Biomedical Informatics Insights, 5(1), p. 3-16.
Retrieved from http://europepmc.org/articles/PMC3299408?pdf=render
Big Data and Automated Content Analysis Damian Trilling
52. Different types of analysis Sentiment analysis Stopword removal Next meetings
A sentiment analysis tailored to your needs!
Already this still relatively simple approach seems to work
satisfactory, but if 106 scientists from 24 competing teams (!)
work on it, they can
group suidide notes by these characteristics:
• swear
• family
• friend
• positive emotion
• negative emotion
• anxiety
• anger
• sad
• cognitive process
• biology
• sexual
• ingestion
• religion
• death
Pestian, J.P.; Matykiewicz, P., Linn-Gust, M., South, B., Uzuner, O., Wiebe, J., Cohen, K.B., Hurdle, J., & Brew,
C. (2012). Sentiment analysis of suicide notes: A shared task. Biomedical Informatics Insights, 5(1), p. 3-16.
Retrieved from http://europepmc.org/articles/PMC3299408?pdf=render
Big Data and Automated Content Analysis Damian Trilling
53. Different types of analysis Sentiment analysis Stopword removal Next meetings
A sentiment analysis tailored to your needs!
Sidenote
An alternative state-of-the-art approach:
Use supervised machine learning
• Instead of defining rules, hand-code (“annotate”) the
sentiment of some tweets manually and let the computer find
out which words or characters (“features”) predict sentiment
• Then use this model to predict sentiment for other tweets
• Essentially the same like what you know since the second year
of your Bachelor: regression analysis (but now with DV
sentiment and IV’s word occurrences)
• ⇒ week 8
Gonzalez-Bailon, S., & Paltoglou, G. (2015). Signals of public opinion in online communication: A comparison of
methods and data sources. The ANNALS of the American Academy of Political and Social Science, 659(1),
95-–107.
Big Data and Automated Content Analysis Damian Trilling
54. Different types of analysis Sentiment analysis Stopword removal Next meetings
Natural Language Processing preview:
Stopword removal
Big Data and Automated Content Analysis Damian Trilling
55. Different types of analysis Sentiment analysis Stopword removal Next meetings
Natural Language Processing preview:
Stopword removal
Why now? — Because the logic of the algorithm is very much
related to the one of our first simple sentiment analysis
Big Data and Automated Content Analysis Damian Trilling
56. Different types of analysis Sentiment analysis Stopword removal Next meetings
Natural language processing
Stopword removal: What and why?
Why remove stopwords?
• If we want to identify key terms (e.g., by means of a word
count), we are not interested in them
• If we want to calculate document similarity, it might be
inflated
• If we want to make a word co-occurance graph, irrelevant
information will dominate the picture
Big Data and Automated Content Analysis Damian Trilling
57. Different types of analysis Sentiment analysis Stopword removal Next meetings
A simple algorithm
Stopword removal: How
1 testo=’He gives her a beer and a cigarette.’
2 testonuovo=""
3 stopwords=[’and’,’the’,’a’,’or’,’he’,’she’,’him’,’her’]
4 for verbo in testo.split():
5 if verbo not in stopwords:
6 testonuovo=testonuovo+verbo+" "
What do we get if we do:
1 print (testonuovo)
Can you explain the algorithm?
Big Data and Automated Content Analysis Damian Trilling
58. Different types of analysis Sentiment analysis Stopword removal Next meetings
A simple algorithm
We get:
1 >>> print (testonuovo)
2 ’He gives beer cigarette. ’
Why is "He" still in there?
How can we fix this?
Big Data and Automated Content Analysis Damian Trilling
59. Different types of analysis Sentiment analysis Stopword removal Next meetings
A simple algorithm
Stopword removal
1 testo=’He gives her a beer and a cigarette.’
2 testonuovo=""
3 stopwords=[’and’,’the’,’a’,’or’,’he’,’she’,’him’,’her’]
4 for verbo in testo.split():
5 if verbo.lower() not in stopwords:
6 testonuovo=testonuovo+verbo+" "
Big Data and Automated Content Analysis Damian Trilling
60. Different types of analysis Sentiment analysis Stopword removal Next meetings
Take-home message
Mid-term take-home exam
Next meetings
Big Data and Automated Content Analysis Damian Trilling
61. Different types of analysis Sentiment analysis Stopword removal Next meetings
Take-home messages
What you should be familiar with:
• You should have completely understood last week’s exercise.
Re-read it if neccessary.
• Approaches to the analysis (e.g., structure vs. content)
• Types of sentiment analysis, application areas, pros and cons
Big Data and Automated Content Analysis Damian Trilling
62. Different types of analysis Sentiment analysis Stopword removal Next meetings
Mid-term take home exam
Week 5: Monday, 25 April, to Sunday, 1 May
• You get the exam on Monday at the end of the meeting
• Answers have to be handed in no later than Sunday, 23.59
• 30% of final grade
• 3 questions:
1 Literature question: Different methods (“Explain how. . . is
done”) and/or epistemological or theoretical implications
(“What does this mean for social-scientific research?”)
2 Empirical question (conceptual)
3 Empirical question (actual programming task)
If you fully understood all exercises until now, it shouldn’t be
difficult and won’t take too long. But give yourself a lot of
buffer time!!!
Big Data and Automated Content Analysis Damian Trilling
63. Different types of analysis Sentiment analysis Stopword removal Next meetings
Next meetings
Wednesday, 20–4
We work together on the examples in Chapter 5. You can bring
your own data (you will probably learn more!), but then already
think about how to write some script to read the data (as we did
last week and as described in Chapter 4).
Big Data and Automated Content Analysis Damian Trilling