The document describes a scalable topic-specific influence analysis model called Followship-LDA (FLDA) for analyzing microblog networks. FLDA extends Latent Dirichlet Allocation to model both the textual content and social links of each user, and identifies influential users for different topics. The document outlines FLDA's generative process, describes a distributed Gibbs sampling algorithm for efficient inference, and presents a search framework that takes topic queries as input and outputs influential users. Experimental results on a Twitter dataset show FLDA can identify relevant influential users for various topics.
This document summarizes tag-based recommenders and social tagging systems. It discusses:
1) Social tagging systems allow users to collaboratively tag and categorize content. Popular social tagging sites include Delicious, Flickr, YouTube, etc. Tagging systems have features like tag sharing and selection.
2) Tag recommenders aim to encourage tagging and reuse of common tags. Recommender techniques discussed include most popular, collaborative filtering, tensor factorization, and graph-based methods.
3) The document presents the speaker's work on tag-based collaborative filtering which improves neighbor selection by considering tag semantic similarity between users. Their IUI 2008 paper shows their tag-based approach improves recommendation performance over traditional collaborative filtering.
The document provides an overview of question answering systems, including their evolution from information retrieval, common evaluation benchmarks like TREC and CLEF, and examples of major QA projects like Watson. It also discusses the movement towards leveraging semantic technologies and linked open data to power next generation QA systems, as seen in projects like SINA which transform natural language queries into formal queries over structured knowledge bases.
Tutorial on Relationship Mining In Online Social Networkspjing2
This document provides a tutorial on relationship mining in online social networks. It begins with introductions to basic concepts like defining the relationship mining task and relationship concepts from sociology. It then discusses how text mining can help with relationship mining by extracting features from text data. It outlines several sub-fields for relationship mining, including data acquisition/storage, different relationship mining approaches, and associating user attributes with relationships. The document concludes by discussing specific relationship mining systems.
The document summarizes collaborative filtering in cloud computing. It discusses key concepts like cloud computing, collaborative filtering, and Hadoop. It then covers different types of collaborative filtering like user-based, item-based, memory-based, and model-based approaches. Specific algorithms like nearest neighbor and top-N recommendations are explained. Challenges like cold start, ratings, interfaces and metrics to evaluate performance are also summarized.
This document provides an overview of tag-based social recommender systems. It discusses different types of recommender systems including content-based, collaborative filtering, and hybrid recommender systems. It also describes how recommender systems work using cosine similarity, Pearson correlation, and TF-IDF models. Popular datasets like MovieLens and Flickr are also summarized that are used to implement and evaluate recommender system algorithms.
Slides of the presentation given at the 22nd International Conference on the World Wide Web.
URL: http://www2013.org/program/561-reactive-crowdsourcing/
More information on the Crowdsearcher project available at
crowdsearcher.search-computing.com
This document summarizes tag-based recommenders and social tagging systems. It discusses:
1) Social tagging systems allow users to collaboratively tag and categorize content. Popular social tagging sites include Delicious, Flickr, YouTube, etc. Tagging systems have features like tag sharing and selection.
2) Tag recommenders aim to encourage tagging and reuse of common tags. Recommender techniques discussed include most popular, collaborative filtering, tensor factorization, and graph-based methods.
3) The document presents the speaker's work on tag-based collaborative filtering which improves neighbor selection by considering tag semantic similarity between users. Their IUI 2008 paper shows their tag-based approach improves recommendation performance over traditional collaborative filtering.
The document provides an overview of question answering systems, including their evolution from information retrieval, common evaluation benchmarks like TREC and CLEF, and examples of major QA projects like Watson. It also discusses the movement towards leveraging semantic technologies and linked open data to power next generation QA systems, as seen in projects like SINA which transform natural language queries into formal queries over structured knowledge bases.
Tutorial on Relationship Mining In Online Social Networkspjing2
This document provides a tutorial on relationship mining in online social networks. It begins with introductions to basic concepts like defining the relationship mining task and relationship concepts from sociology. It then discusses how text mining can help with relationship mining by extracting features from text data. It outlines several sub-fields for relationship mining, including data acquisition/storage, different relationship mining approaches, and associating user attributes with relationships. The document concludes by discussing specific relationship mining systems.
The document summarizes collaborative filtering in cloud computing. It discusses key concepts like cloud computing, collaborative filtering, and Hadoop. It then covers different types of collaborative filtering like user-based, item-based, memory-based, and model-based approaches. Specific algorithms like nearest neighbor and top-N recommendations are explained. Challenges like cold start, ratings, interfaces and metrics to evaluate performance are also summarized.
This document provides an overview of tag-based social recommender systems. It discusses different types of recommender systems including content-based, collaborative filtering, and hybrid recommender systems. It also describes how recommender systems work using cosine similarity, Pearson correlation, and TF-IDF models. Popular datasets like MovieLens and Flickr are also summarized that are used to implement and evaluate recommender system algorithms.
Slides of the presentation given at the 22nd International Conference on the World Wide Web.
URL: http://www2013.org/program/561-reactive-crowdsourcing/
More information on the Crowdsearcher project available at
crowdsearcher.search-computing.com
Introduction to question answering for linked data & big dataAndre Freitas
This document discusses question answering (QA) systems in the context of big data and heterogeneous data scenarios. It outlines the motivation and challenges for developing natural language interfaces for databases. The document covers the basic concepts and taxonomy of QA systems, including question types, answer types, data sources, and domains. It also discusses the anatomy and components of a typical QA system.
Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...Marco Brambilla
Web users are increasingly relying on social interaction to complete and validate the results of their search activities. While search systems are superior machines to get world-wide information, the opinions collected within friends and expert/local communities can ultimately determine our decisions: human curiosity and creativity is often capable of going much beyond the capabilities of search systems in scouting “interesting” results, or suggesting new, unexpected search directions. Such personalized interaction occurs in most times aside of the search systems and processes, possibly instrumented and mediated by a social network; when such interaction is completed and users resort to the use of search systems, they do it through new queries, loosely related to the previous search or to the social interaction.
In this paper we propose CrowdSearcher, a novel search paradigm that embodies crowds as first-class sources for the information seeking process. CrowdSearcher aims at filling the gap between generalized search systems, which operate upon world-wide information - including facts and recommendations as crawled and indexed by computerized systems – with social systems, capable of interacting with real people, in real time, to capture their opinions, suggestions, emotions. The technical contribution of this paper is the discussion of a model and architecture for integrating computerized search with human interaction, by showing how search systems can drive and encapsulate social systems. In particular we show how social platforms, such as Facebook, LinkedIn and Twitter, can be used for crowdsourcing search-related tasks; we demonstrate our approach with several prototypes and we report on our experiment upon real user communities.
Choosing the right crowd. Expert finding in social networks. edbt 2013Marco Brambilla
The document discusses using social networks and Q&A websites as platforms for crowd-searching in addition to traditional crowdsourcing platforms. It proposes a model for crowd-searching that utilizes social interactions on these platforms to find experts and get feedback on queries. The model involves initially searching for information, then promoting queries on social platforms to find friends and experts, and aggregating the responses. It provides examples of how this process may work for a job search query. Experimental results showed that questions posted on social networks received more responses than random questions, and that engagement depended on the difficulty and type of task.
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
This document summarizes Mohamed Ben Ellefi's PhD thesis defense on profile-based dataset recommendation for RDF data linking. The thesis proposes two approaches: a topic profile-based approach and an intensional profile-based approach. The topic profile-based approach models datasets as topics and recommends target datasets based on similarity between source and target topic profiles, achieving an average recall of 81% and reducing the search space by 86%. The approach shows better performance than baselines but needs improvement on precision.
Developing the korean_internet_network_miner_changeHan Woo PARK
The document describes the development of an e-research tool called the Korean Internet Network Miner (KINM) to analyze social networks on Korean blogs. [1] It modified existing social network analysis tools to process Korean text and extract networks from blog comments. [2] It tested the tool on over 900 comments from a blog, finding both correctly and incorrectly identified names, and evaluated techniques to improve name disambiguation. [3] The goal is to advance tools for automated social network discovery and analysis of online Korean communities.
13 An Introduction to Stochastic Actor-Oriented Models (aka SIENA)dnac
This document provides an introduction to Stochastic Actor-Oriented Models (SAOMs), also known as SIENA models. It discusses when SAOMs are appropriate to use, provides an overview of the general SAOM form, and covers key components like the network and behavior objective functions and rate functions. The presentation also outlines how SAOMs are estimated and fitted to data, provides an empirical example, and discusses extensions. SAOMs model how networks and behaviors change over time as actors make micro-level decisions to maximize their objective functions.
A QA system takes in a natural language question, analyzes it to understand the type of question and information sought, searches structured and unstructured data sources for relevant information, and generates a natural language answer. It consists of modules for question analysis, information retrieval from knowledge bases and documents, answer generation, and response formatting. The goal is to delegate more interpretation work to machines so users can get direct answers to complex questions over heterogeneous data.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
The document discusses the emergence of the social web and the relationship between Web 2.0 and the Semantic Web. It describes how blogs, wikis, and social networks enabled new forms of user-generated content and social interaction online in the early 2000s. The document also explains how Semantic Web technologies could enhance Web 2.0 by enabling the standardized exchange and combination of user data and services.
10 More than a Pretty Picture: Visual Thinking in Network Studiesdnac
Visualization has been important in network science since its beginnings to make invisible structures visible. While metrics can describe networks, visualizations allow researchers to see relationships and patterns across multiple dimensions that numbers alone cannot reveal. Effective network visualizations communicate insights that would be difficult to understand otherwise, by depicting global patterns and local details simultaneously in a way that builds intuition about the network's structure and generating processes. However, challenges include lack of consistent display frameworks, integrating too much multidimensional information, and issues of scale for large and dynamic networks.
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
The document summarizes a presentation on question answering systems. It begins by providing context on information overload and defining question answering. It then discusses the evolution of QA systems from early databases to today's open-domain systems. The presentation focuses on IBM's Watson system, providing an overview of its unprecedented ability to answer open-domain questions as well as the massive resources required for its development. It concludes by arguing that open-domain QA remains unsolved and that closed-domain, interactive QA may be more practical for real-world applications.
This document discusses considerations for collecting social network data through surveys. It addresses research design elements like defining the relevant population boundaries and sampling approaches. For surveys specifically, it covers informed consent, name generator questions to identify social ties, response formats, and balancing depth of network detail collected versus sample size. The key challenges are defining the theoretical population of interest, collecting a sufficiently large and representative network sample, and designing survey questions that accurately capture social ties within time and resource constraints.
Recommender systems are knowledge-based systems which support human decision-making. In an era of overwhelming choice, they help us decide which
products, services and information to consume. The focus of attention in recommender systems research and development has been on making recommendations to individual consumers. These places focus on the easier case, but ignore the fact that it is as common, if not more common, for us to consume items in groups such as couples, families and parties of friends. The choice of a date movie, a family holiday destination, or a restaurant for a celebration meal all require the balancing of the preferences of multiple consumers
01 Introduction to Networks Methods and Measuresdnac
This document provides an introduction to social network analysis. It discusses how networks matter through two fundamental mechanisms: connections and positions. Connections refer to the flow of things through networks, viewing networks as pipes. Positions refer to relational patterns and networks capturing role behavior, viewing networks as roles. The document also covers basic network data structures including nodes, edges, directed/undirected ties, binary/valued ties, and different levels of analysis such as ego networks and complete networks. It provides examples of one-mode and two-mode network data.
Social media recommendation based on people and tags (final)es712
1) The document proposes methods to generate personalized recommendations in social media platforms based on people relationships and tags.
2) An evaluation of three recommendation approaches that utilize direct tags, indirect tags through related items, and incoming tags from other users found that a combination of direct tags and incoming tags most accurately represented a user's interests.
3) A user study tested five recommendation approaches and found that combining people relationships and tags into a user profile achieved the highest ratings for interesting recommendations and lowest for non-interesting items.
Temporal and semantic analysis of richly typed social networks from user-gene...Zide Meng
We propose an approach to detect topics, overlapping communities of interest, expertise, trends and activities in user-generated content sites and in particular in question-answering forums such as StackOverflow. We first describe QASM (Question & Answer Social Media), a system based on social network analysis to manage the two main resources in question-answering sites: users and content. We also introduce the QASM vocabulary used to formalize both the level of interest and the expertise of users on topics. We then propose an efficient approach to detect communities of interest. It relies on another method to enrich questions with a more general tag when needed. We compared three detection methods on a dataset extracted from the popular Q&A site StackOverflow. Our method based on topic modeling and user membership assignment is shown to be much simpler and faster while preserving the quality of detection. We then propose an additional method to automatically generate a label for a detected topic by analyzing the meaning and links of its bag of words. We conduct a user study to compare different algorithms to choose a label. Finally we extend our probabilistic graphical model to jointly model topics, expertise, activities and trends. We performed experiments with real-world data to confirm the effectiveness of our joint model, studying user behaviors and topic dynamics.
http://www-sop.inria.fr/members/Zide.Meng/
IEEE ISM 2008: Kalman Graffi: A Distributed Platform for Multimedia CommunitiesKalman Graffi
Online community platforms and multimedia content delivery are merging in recent years. Current platforms like Facebook and YouTube are client-server based which result in high administration costs for the provider. In contrast to that peer-to-peer systems offer scalability and low costs, but are limited in their functionality. In this paper we present a framework for peer-to-peer based multimedia online communities.We identified the key challenges for this new application of the peer-to-peer paradigm and built a plugin based, easily extendible and multifunctional framework. Further, we identified distributed linked lists as valuable data structure to implement the user profiles, friend lists, groups, photo albums and more. Our framework aims at providing the functionality of common online community platforms combined with the multimedia delivery capabilities of modern peer-to-peer systems, e.g. direct multimedia delivery and access to a distributed multimedia pool.
This is a walk-through and discussion of Cyclopath, an open source geo-wiki, a user editable map, that has been up and running in Minneapolis-St Paul for several years. It's interesting because anyone can add data -- points, tags, ratings, notes, even streets -- to the map, and the routing algorithm can immediately take the user-added into account. Cyclopath supports cyclists, but the technology itself is very general and has numerous uses.
Introduction to question answering for linked data & big dataAndre Freitas
This document discusses question answering (QA) systems in the context of big data and heterogeneous data scenarios. It outlines the motivation and challenges for developing natural language interfaces for databases. The document covers the basic concepts and taxonomy of QA systems, including question types, answer types, data sources, and domains. It also discusses the anatomy and components of a typical QA system.
Answering Search Queries with CrowdSearcher: a crowdsourcing and social netwo...Marco Brambilla
Web users are increasingly relying on social interaction to complete and validate the results of their search activities. While search systems are superior machines to get world-wide information, the opinions collected within friends and expert/local communities can ultimately determine our decisions: human curiosity and creativity is often capable of going much beyond the capabilities of search systems in scouting “interesting” results, or suggesting new, unexpected search directions. Such personalized interaction occurs in most times aside of the search systems and processes, possibly instrumented and mediated by a social network; when such interaction is completed and users resort to the use of search systems, they do it through new queries, loosely related to the previous search or to the social interaction.
In this paper we propose CrowdSearcher, a novel search paradigm that embodies crowds as first-class sources for the information seeking process. CrowdSearcher aims at filling the gap between generalized search systems, which operate upon world-wide information - including facts and recommendations as crawled and indexed by computerized systems – with social systems, capable of interacting with real people, in real time, to capture their opinions, suggestions, emotions. The technical contribution of this paper is the discussion of a model and architecture for integrating computerized search with human interaction, by showing how search systems can drive and encapsulate social systems. In particular we show how social platforms, such as Facebook, LinkedIn and Twitter, can be used for crowdsourcing search-related tasks; we demonstrate our approach with several prototypes and we report on our experiment upon real user communities.
Choosing the right crowd. Expert finding in social networks. edbt 2013Marco Brambilla
The document discusses using social networks and Q&A websites as platforms for crowd-searching in addition to traditional crowdsourcing platforms. It proposes a model for crowd-searching that utilizes social interactions on these platforms to find experts and get feedback on queries. The model involves initially searching for information, then promoting queries on social platforms to find friends and experts, and aggregating the responses. It provides examples of how this process may work for a job search query. Experimental results showed that questions posted on social networks received more responses than random questions, and that engagement depended on the difficulty and type of task.
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
This document summarizes Mohamed Ben Ellefi's PhD thesis defense on profile-based dataset recommendation for RDF data linking. The thesis proposes two approaches: a topic profile-based approach and an intensional profile-based approach. The topic profile-based approach models datasets as topics and recommends target datasets based on similarity between source and target topic profiles, achieving an average recall of 81% and reducing the search space by 86%. The approach shows better performance than baselines but needs improvement on precision.
Developing the korean_internet_network_miner_changeHan Woo PARK
The document describes the development of an e-research tool called the Korean Internet Network Miner (KINM) to analyze social networks on Korean blogs. [1] It modified existing social network analysis tools to process Korean text and extract networks from blog comments. [2] It tested the tool on over 900 comments from a blog, finding both correctly and incorrectly identified names, and evaluated techniques to improve name disambiguation. [3] The goal is to advance tools for automated social network discovery and analysis of online Korean communities.
13 An Introduction to Stochastic Actor-Oriented Models (aka SIENA)dnac
This document provides an introduction to Stochastic Actor-Oriented Models (SAOMs), also known as SIENA models. It discusses when SAOMs are appropriate to use, provides an overview of the general SAOM form, and covers key components like the network and behavior objective functions and rate functions. The presentation also outlines how SAOMs are estimated and fitted to data, provides an empirical example, and discusses extensions. SAOMs model how networks and behaviors change over time as actors make micro-level decisions to maximize their objective functions.
A QA system takes in a natural language question, analyzes it to understand the type of question and information sought, searches structured and unstructured data sources for relevant information, and generates a natural language answer. It consists of modules for question analysis, information retrieval from knowledge bases and documents, answer generation, and response formatting. The goal is to delegate more interpretation work to machines so users can get direct answers to complex questions over heterogeneous data.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
The document discusses the emergence of the social web and the relationship between Web 2.0 and the Semantic Web. It describes how blogs, wikis, and social networks enabled new forms of user-generated content and social interaction online in the early 2000s. The document also explains how Semantic Web technologies could enhance Web 2.0 by enabling the standardized exchange and combination of user data and services.
10 More than a Pretty Picture: Visual Thinking in Network Studiesdnac
Visualization has been important in network science since its beginnings to make invisible structures visible. While metrics can describe networks, visualizations allow researchers to see relationships and patterns across multiple dimensions that numbers alone cannot reveal. Effective network visualizations communicate insights that would be difficult to understand otherwise, by depicting global patterns and local details simultaneously in a way that builds intuition about the network's structure and generating processes. However, challenges include lack of consistent display frameworks, integrating too much multidimensional information, and issues of scale for large and dynamic networks.
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
The document summarizes a presentation on question answering systems. It begins by providing context on information overload and defining question answering. It then discusses the evolution of QA systems from early databases to today's open-domain systems. The presentation focuses on IBM's Watson system, providing an overview of its unprecedented ability to answer open-domain questions as well as the massive resources required for its development. It concludes by arguing that open-domain QA remains unsolved and that closed-domain, interactive QA may be more practical for real-world applications.
This document discusses considerations for collecting social network data through surveys. It addresses research design elements like defining the relevant population boundaries and sampling approaches. For surveys specifically, it covers informed consent, name generator questions to identify social ties, response formats, and balancing depth of network detail collected versus sample size. The key challenges are defining the theoretical population of interest, collecting a sufficiently large and representative network sample, and designing survey questions that accurately capture social ties within time and resource constraints.
Recommender systems are knowledge-based systems which support human decision-making. In an era of overwhelming choice, they help us decide which
products, services and information to consume. The focus of attention in recommender systems research and development has been on making recommendations to individual consumers. These places focus on the easier case, but ignore the fact that it is as common, if not more common, for us to consume items in groups such as couples, families and parties of friends. The choice of a date movie, a family holiday destination, or a restaurant for a celebration meal all require the balancing of the preferences of multiple consumers
01 Introduction to Networks Methods and Measuresdnac
This document provides an introduction to social network analysis. It discusses how networks matter through two fundamental mechanisms: connections and positions. Connections refer to the flow of things through networks, viewing networks as pipes. Positions refer to relational patterns and networks capturing role behavior, viewing networks as roles. The document also covers basic network data structures including nodes, edges, directed/undirected ties, binary/valued ties, and different levels of analysis such as ego networks and complete networks. It provides examples of one-mode and two-mode network data.
Social media recommendation based on people and tags (final)es712
1) The document proposes methods to generate personalized recommendations in social media platforms based on people relationships and tags.
2) An evaluation of three recommendation approaches that utilize direct tags, indirect tags through related items, and incoming tags from other users found that a combination of direct tags and incoming tags most accurately represented a user's interests.
3) A user study tested five recommendation approaches and found that combining people relationships and tags into a user profile achieved the highest ratings for interesting recommendations and lowest for non-interesting items.
Temporal and semantic analysis of richly typed social networks from user-gene...Zide Meng
We propose an approach to detect topics, overlapping communities of interest, expertise, trends and activities in user-generated content sites and in particular in question-answering forums such as StackOverflow. We first describe QASM (Question & Answer Social Media), a system based on social network analysis to manage the two main resources in question-answering sites: users and content. We also introduce the QASM vocabulary used to formalize both the level of interest and the expertise of users on topics. We then propose an efficient approach to detect communities of interest. It relies on another method to enrich questions with a more general tag when needed. We compared three detection methods on a dataset extracted from the popular Q&A site StackOverflow. Our method based on topic modeling and user membership assignment is shown to be much simpler and faster while preserving the quality of detection. We then propose an additional method to automatically generate a label for a detected topic by analyzing the meaning and links of its bag of words. We conduct a user study to compare different algorithms to choose a label. Finally we extend our probabilistic graphical model to jointly model topics, expertise, activities and trends. We performed experiments with real-world data to confirm the effectiveness of our joint model, studying user behaviors and topic dynamics.
http://www-sop.inria.fr/members/Zide.Meng/
IEEE ISM 2008: Kalman Graffi: A Distributed Platform for Multimedia CommunitiesKalman Graffi
Online community platforms and multimedia content delivery are merging in recent years. Current platforms like Facebook and YouTube are client-server based which result in high administration costs for the provider. In contrast to that peer-to-peer systems offer scalability and low costs, but are limited in their functionality. In this paper we present a framework for peer-to-peer based multimedia online communities.We identified the key challenges for this new application of the peer-to-peer paradigm and built a plugin based, easily extendible and multifunctional framework. Further, we identified distributed linked lists as valuable data structure to implement the user profiles, friend lists, groups, photo albums and more. Our framework aims at providing the functionality of common online community platforms combined with the multimedia delivery capabilities of modern peer-to-peer systems, e.g. direct multimedia delivery and access to a distributed multimedia pool.
This is a walk-through and discussion of Cyclopath, an open source geo-wiki, a user editable map, that has been up and running in Minneapolis-St Paul for several years. It's interesting because anyone can add data -- points, tags, ratings, notes, even streets -- to the map, and the routing algorithm can immediately take the user-added into account. Cyclopath supports cyclists, but the technology itself is very general and has numerous uses.
This document provides an introduction to nonparametric methods in machine learning. Nonparametric methods, unlike parametric methods, do not assume a fixed global model and instead allow the model to change locally based on the training data. Common nonparametric techniques described include kernel density estimation, k-nearest neighbors classification and regression, and condensed nearest neighbor algorithms. The document also discusses how to select parameters like k in k-NN and h in kernel density estimation using cross-validation.
Intro to XPages for Administrators (DanNotes, November 28, 2012)Per Henrik Lausten
This document introduces XPages for administrators. It discusses:
- What XPages are and examples of XPages applications
- The administrator's important role in the application lifecycle in helping developers and users
- Tips for maximizing performance such as hardware configuration, server settings, caching, and preloading applications
- Application development best practices including supported Dojo and OneUI versions
- Configuring and administering Domino Directory, Internet sites, and security settings
- Tools for troubleshooting, monitoring, and impressing developers like the Extension Library and demo app
Impact of loyalty programs in retailing business in India for creating long t...Love Suryavanshi
Impact of loyalty programs in retailing business in India for creating long term relationships. Various topic on sutomer loyalty and loyalty program has been covered.
Notes from 2016 bay area deep learning school Niketan Pansare
Slide-deck for the lunch talk at IBM Almaden Research Center on Oct 11, 2016.
Abstract: In this lunch talk, I will give a high-level summary of bay area deep learning school which was held at Stanford on Sept 24 and 25. The videos and slides of the lectures are available online at http://www.bayareadlschool.org/. I will also give a very brief introduction of deep learning.
Cognitive Work Assistants - Vision and Open ChallengesHamid Motahari
Cognitive assistants aim to augment human intelligence by performing administrative tasks and providing guidance, advice, and assistance to humans. Key challenges for cognitive assistants include building extensive domain knowledge, adapting to new domains, evaluating system performance, addressing user privacy and trust, and enabling natural language interaction. Developing cognitive assistants that can understand tasks, take proactive actions, and interact contextually remains an important area for future research.
An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...Philippe Laborie
We present a non-intrusive iterative algorithm for extracting Minimal Unsatisfiable Cores in black-box constraint networks. The problem can be generalized as the one of finding a minimal subset satisfying an upward-closed property P. If performance is measured as the number of infeasibility property checks, we show that the proposed algorithm, ADEL, is optimal both for small and for large MUCs and that it consistently outperforms existing approaches in between those two extremal cases.
Accelerating the Development of Efficient CP Optimizer ModelsPhilippe Laborie
The IBM Constraint Programming optimization system CP Optimizer was designed to provide automatic search and a simple modeling of discrete optimization problems, with a particular focus on scheduling applications. It is used in industry for solving operational planning and scheduling problems. We will give an overview of CP Optimizer and then describe in further detail a set of features such as input/output file format, warm-start or conflict refinement that help accelerate the development of efficient models.
Deep learning is the fastest growing field in artificial intelligence. It has the potential to transform industries like electricity did 100 years ago. The document highlights five stories of how AI and deep learning are accelerating innovation: 1) Baidu open sourced its deep learning platform PaddlePaddle to attract talent, 2) Machine learning will push data science to increase relevance, 3) UC Berkeley created artificial intelligence graders to cut grading time by 75%, 4) A startup developed an algorithm to recognize objects in photos and link them to items for sale, 5) Deep learning could help football coaches with strategic insights.
This presentation introduces CP Optimizer a model & run optimization engine for solving discrete combinatorial problems with a particular focus on scheduling problems.
SSO - single sign on solution for banks and financial organizationsMohammad Shahnewaz
The document discusses biometric secure single sign-on (SSO) software that can eliminate passwords and increase security for banks and financial services. It allows centralized password management and single sign-on access to applications while protecting data from unauthorized access. The software provides strong authentication through biometrics like fingerprints and smart cards to replace insecure passwords. This reduces help desk calls and protects organizations from costly data breaches.
The document provides information about IBM Business Process Management and Royal Cyber's expertise in this area. It discusses what BPM is and why it is important, as well as Royal Cyber's capabilities in areas like business monitoring, operational decision management, process automation and integrity, and process discovery and design using IBM BPM tools. The document also outlines Royal Cyber's BPM implementation plan, services offerings, support workflow, trainings, and success stories with clients.
Best Practices for Enterprise Social Media Management by the Social Media Dre...Sprinklr
Let’s say you are working at one of the world’s 5000 largest businesses.
You know you need to “be social.”
You are faced with the key question of:
How do we scale social across our entire enterprise in a manageable, measurable, and effective way?
Meet The Social Media Dream Team.
Packaged up in one FREE PDF for you to download, Sprinklr has engaged the Enterprise Social Media Dream Team to help.
Who's on it?
Chris Brogan, Jason Falls, Joseph Jaffe, David Meerman Scott, David Armano, Rohit Bhargava, Mitch Joel, Peter Shankman, Mack Collier, Michael Brito, Jay Baer, Edward Boches, Ann Handley, Nilofer Merchant, Ted Coine, David Weinberger, Shelly Palmer, Mark Earls, Renee Blodget, Augie Ray, Brett Petersel, Ted Rubin, Sarah Evans, Jeff Bullas, Jay Baer, Amy Vernon, Matt Dickman, Thomas Baekdal, Venkatesh Rao, Richard Stacy, , Hugh MacLeod and Doc Searls.
And what will you learn in this eBook?
The Dream Team covers topics such as:
-Branding in a Social@Scale World
-Content & Conversation to be Social@Scale
-Social@Scale Organizational Models
-Tools and Tactics to be Social@Scale
-How to Think Social@Scale
The document describes a proposed approach for inferring implicit topical interests of users on Twitter. It discusses related work on detecting user interests from social media using bag-of-words, topic modeling, and bag-of-concepts approaches. The proposed approach models user interests as a graph-based link prediction problem over a heterogeneous graph incorporating user followerships, explicit interests, and topic relatedness. It evaluates different variants of the model and finds semantic relatedness of topics to be most effective for identifying implicit user interests.
Klout as an Example Application of Topics-oriented NLP APIsTyler Singletary
Klout in its iterations is a prime example of leveraging large scale NLP data science with topical assignment. Klout makes this available through its website, http://klout.com, and also through its developer API, http://developers.klout.com
Lyft developed Amundsen, an internal metadata and data discovery platform, to help their data scientists and engineers find data more efficiently. Amundsen provides search-based and lineage-based discovery of Lyft's data resources. It uses a graph database and Elasticsearch to index metadata from various sources. While initially built using a pull model with crawlers, Amundsen is moving toward a push model where systems publish metadata to a message queue. The tool has increased data team productivity by over 30% and will soon be open sourced for other organizations to use.
A Flexible Recommendation System for Cable TVFrancisco Couto
1. The document proposes a flexible recommendation system for cable TV to address issues like information overflow and dissatisfaction from users.
2. It describes extracting implicit feedback from users and engineering contextual features to create a large-scale dataset for learning recommendations.
3. An evaluation of the recommendation system shows that a learning to rank approach with contextual information outperforms other methods in accuracy while maintaining diversity and novelty, though recommending new programs requires more investigation.
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...Sri Ambati
This document discusses how Quora uses machine learning to improve user experience. It summarizes that Quora aims to share and grow the world's knowledge using machine learning algorithms for tasks like answer ranking, feed ranking, topic and user recommendations, related question detection, and spam detection. It describes how Quora uses features about users, content, and their interactions to build models like logistic regression, decision trees and neural networks to complete these tasks at scale and through experimentation.
(1) Amundsen is a data discovery platform developed by Lyft to help users find, understand, and use data.
(2) The platform addresses challenges around data discovery such as lack of understanding about what data exists and where to find it.
(3) Amundsen provides searchable metadata about data resources, previews of data, and usage statistics to help data scientists and others explore and understand data.
The document summarizes key topics from a recommender systems conference, including:
1. Many major companies like Netflix, Quora, and Amazon consider recommendations to be a core part of their user experience.
2. Adaptive and interactive recommendations were discussed, including how Netflix personalizes content rows based on a user's predicted mood.
3. Text modeling algorithms like word2vec were discussed for generating recommendations from content like tweets, search queries, or product descriptions.
Relationships Matter: Using Connected Data for Better Machine LearningNeo4j
Relationships are highly predictive of behavior, yet most data science models overlook this information because it's difficult to extract network structure for use in machine learning (ML).
With graphs, relationships are embedded in the data itself, making it practical to add these predictive capabilities to your existing practices.
That’s why we’re presenting and demoing the use of graph-native ML to make breakthrough predictions. This will cover:
- Different approaches to graph feature engineering, from queries and algorithms to embeddings
- How ML techniques leverage everything from classical network science to deep learning and graph convolutional neural networks
- How to generate representations of your graph using graph embeddings, create ML models for link prediction or node classification, and apply these models to add missing information to an existing graph/incoming data
- Why no-code visualization and prototyping is important
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleXavier Amatriain
The document summarizes Netflix's approach to machine learning and recommender systems. It discusses how Netflix uses algorithms like SVD and Restricted Boltzmann Machines on a massive scale to power highly personalized recommendations. Over 75% of what people watch on Netflix comes from recommendations. Netflix collects a huge amount of data from over 40 million subscribers and uses both offline, online, and nearline computation across cloud services to train models and power recommendations in real-time at scale. The key is combining more data, smarter models, accurate metrics, and optimized system architectures.
[CS570] Machine Learning Team Project (I know what items really are)Kunwoo Park
This document summarizes a team's approach to predicting which items users might be interested in using a recommendation system. It describes extracting features from user and item metadata to train an SVM model, but this was too computationally expensive. Instead, the team used logistic regression with stochastic gradient descent. They tested features like age, gender and network similarities. Their combined model outperformed random prediction baselines on the KDD Cup 2012 Track 1 dataset.
Cikm 2013 - Beyond Data From User Information to Business ValueXavier Amatriain
- The document discusses Netflix's approach to using data and algorithms to provide personalized recommendations to users. It summarizes Netflix's transition from simple ranking algorithms to personalized recommendations based on user behavior data.
- Netflix runs hundreds of A/B tests on algorithms and designs simultaneously to evaluate how changes impact user engagement and retention. Both online and offline testing is used to evaluate recommendations before and after deployment.
- A variety of algorithms are used for recommendations, including matrix factorization, restricted Boltzmann machines, and learning to rank approaches. Feature engineering and algorithm development are ongoing areas of research at Netflix to improve diversity, novelty, and accuracy of recommendations.
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationTamikaTannis
Tamika Tannis gave a presentation on Lyft's open source data discovery tool called Amundsen. She discussed Lyft's data ecosystem and the challenges of data discovery. Amundsen addresses these challenges through search, metadata, and visualization capabilities powered by a graph database backend. The tool has been hugely successful at Lyft and an active open source community is contributing to its ongoing development and new features.
This document summarizes a study that analyzed meme diffusion on Twitter and developed an agent-based model to simulate meme competition for limited user attention. The study collected Twitter retweet data, analyzed statistical properties of meme diffusion networks and user behavior, and found that users' limited attention and social network structure were sufficient to reproduce the observed patterns. The model demonstrated that strong or weak user attention failed to match real-world data, indicating attention level affects meme popularity distributions.
Immersive Recommendation incorporates cross-platform and diverse personal digital traces into recommendations. Our context-aware topic modeling algorithm systematically profiles users' interests based on their traces from different contexts, and our hybrid recommendation algorithm makes high-quality recommendations by fusing users' personal profiles, item profiles, and existing ratings. The proposed model showed significant improvement over the state-of-the-art algorithms, suggesting the value of using this new user-centric recommendation model to improve recommendation quality, including in cold-start situations.
Amundsen is a metadata-driven application developed by Lyft to solve data discovery challenges. It provides a search-based UI and uses a distributed architecture with various microservices to index and serve metadata from multiple sources. Key components include a metadata service using Neo4j, a search service using Elasticsearch, and a frontend. The tool has been hugely successful at Lyft and is now open source. Future work includes expanding metadata coverage and integrating with other tools.
2017 10-10 (netflix ml platform meetup) learning item and user representation...Ed Chi
1) Learning user and item representations is challenging due to sparse data and shifting preferences in recommender systems.
2) The presentation outlines research at Google to address sparsity through two approaches: focused learning, which develops specialized models for subsets of data like genres or cold-start items, and factorized deep retrieval, which jointly embeds items and their features to predict preferences for fresh items.
3) The techniques have improved overall viewership and nomination of candidates, demonstrating their effectiveness in production recommender systems.
The document discusses Lyft's data discovery tool called Amundsen. It provides an overview of Amundsen's architecture including its use of a graph database and Elasticsearch for metadata storage and search. It describes the challenges of data discovery that Amundsen addresses like time spent searching for data. The document outlines Amundsen's key components like its databuilder, metadata and search services. It discusses Amundsen's impact and popularity at Lyft and its open source community. Future roadmap plans include additional metadata types and deeper integrations with other tools.
Summary of a Recommender Systems Survey paperChangsung Moon
This is the summary of the following paper:
J. Bobadilla, F. Ortega, A. Hernando and A. Gutierrez, “Recommender Systems Survey,” Knowledge Based Systems, Vol. 26, 2013, pp. 109-132.
Similar to Scalable Topic-Specific Influence Analysis on Microblogs (20)
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
I am Yuanyuan tian from IBM research. Today, in this talk I will focus on one recent project on scalable topic-specific influence analysis. I chose this project talk about because it touches on both analytics and system aspects of research, therefore It is a good representative on the kinds of work we are doing in the information management group in IBM Almaden.
Microblogging services, such as Twitter (twitter.com), have gained tremendous popularity in recent years. A large amount of microblog data has been accumulated overtime. According to a March 2012 report, Twitter had over 500 million users creating over 340 million tweets daily. Note the data contains both textual content reflected in the tweets, as well as social relationship information reflected by the follwer and follwee relationship. The rich text and social information in microblogs has become a
popular resource for marketing campaigns to monitor the opinions of consumers on particular products and to launch viral advertising. Identifying key influencers in microblogs is required for such marketing activities.
Although a lot of work has been done on social influence analysis, most of these studies infer influence only from the network structure, while ignoring the valuable text content that the users created. One of the most well-known studies is influcence maximazation work by Jon Kleinberg et al. As a result, the learned influence of each user is only global, with no way to assess the influence in a particular aspect of life (topic). For example, no one can deny that Presendent Obama is a key influencer in general, but if you want to advertise a database product, he is unlikely to influential, Whereas a database expert who won’t be identified as a general key influencer, probably has more say on this subject. So, what we want is Topic-spefic influcen analysis that can differentiate the influence in different aspects of life or different topics. In order to do that, we need to analyze not just the network structure but also the valuable textual content.
A number of PageRank-based methods, such as Topic-Sensitive PageRank [13] and TwitterRank [25], are able to compute per-topic ranks scores, but they require a separate process to create the topics from the text, and then for each topic apply the influcence analysis using the network structure. As content and links are related to each other in a microblog network, the separation between the analysis on content and the analysis on the network structure usually leads to inferior performance. A few existing works can analyze content and network together, such as Link-LDA. However, they were all designed for citation and hyperlink networks, and assume that links are soly caused by content. This assumption clearly does not apply to microblogs, since it is prevalent for a user to follow celebrities simply because of their fame and stardom, with nothing to do with what he/she actually tweets about.
The goal of this work is to support searching for topic-specific key influceners on microblogs. Altimately, we want to provide a search framework where users can simply type in key words to express his/her interested topic or combination of topics, and the search engine returns a ranked list of users who are influential in the corresponding subjects. In order to do that we need to first correctly model the topic-specific influence in microblog networks, and then learn the influence efficiently. These two are the focus of this talk. I will also briefly talk about how to put everything together in a search framework for topic-specfic influencers.
To meet the computational challenge posed by rapid growing microblog data, propose a distributed Gibbs Sampling algorithm to the FLDA model. Then we incorporate our proposed method in a general search framework for topic-specif influcencer. After that, I will present some experimental results. Now, let’s first talk about the new FLDA model.
Before I go into the details of the topic specific model, Let me first provide the intuition behindthe model. In a micrblog network, each user has both content which is a bag of words, and link structure whch is a set of followees. A user tweets in multiple topics. Based on the content of Alice, looks like she likes tweets about technology and food. Then a topic can be viewed as a mixture of words. Eg. The words web and cloud are likely to appear in the technology topic. As for the relationships among the users in a microblog network, there are dfferent reasons why a user follows another. Sometimes, a followship is content-based, because the followee tweets in similar topics. Other times, it is completely content-independent, because it is very prevalat for a user to follow a celebrity just because of the fame and stardom with nothing to do with what the follower tweets about. Finally, each topic should have a mixture of followees. In other words, given a topic, some users are more likely to be followed than others. For example, Mark Zuckerberg is more likely to be followed for the technology topic. We measure the topic-specific influence by the probability of a user u being followed for a given topic. So, if a user has a higher probabiity to be followed given an topic t, he/she is also more influentual in that topic.
To correctly model topic-specific influence on microblogs, we propose a new Bayesian model, called Followship-LDA (FLDA). It is A Bayesian generative model that extends Latent Dirichlet Allocation (LDA). The reason why call it a generative model is because this model specifies a probablistic procedure based on our described intuition by which the content and links of each user are generated in a microblog network. In order to explain the creation of content and links, we introduce some hidden structure or latent variables in our genertive process, including topics, the reasons of followships, the topic-specific influcence. Finally, given the model, and observed data, the goal is to reverse the generative process and find out what hidden structure is most likely to have generated the observed data.
Now let’s look at what hidden structures we have introduced in the generative model. Each user tweets in different topics. . Then each user has a topic distribution indicating how likely he/she tweets in different topics. Suppose, there are 3 latent topics, Tech, food and politics For example, user Alice tweets about tech 80% of time, food 19% time, and politics for the remaining 1% of time. Now, each topic is a mixture of words. So, each topic has a per-topic word distribution, indicating how likely different words are used in this topic. For example, for tech topic, the word web will be used 30% time, cookie 10% time, etc. As I mentioned before, there different reasons why a user follows another. So, each user has a preference of followship. For example, Alice follows for content 75% time, and the other 25% time she follows for popularity. Since some followships are content based. for each topic, we have a followee distribution, indicating the probabality of user being followed for a given topic. For example, for the tech topic, Mark Zuckerberg will be followed 70% time. Each number in this table is the probability of a user being followed by someone given a topic. This is exactly the topic-specific influence score we are after. If a user has a higher probability of being followed by someone, then this user has a higher influence on this topic. At the end, we also a global followee distribution, indicating the probabliity of a user being followed for content-independent reasons. For example, if afollowship is totally content-independent, then 50% of time, obabma will be the followee. It measures the global popularity of each user.
Now let’s describe the generative process of FLDA. This figure shows the plate notation of the FLDA model. (The boxes represent replication. The outer box represents repetation for the users, the inner right box represents repeated generations of words and the left inner box represents the repeated creation of links.) Don’t worry, if you don’t know the plate notation. You can still understand the generative process The generative process first repeat for each user. Now for the m-th user, Say Alice. It first pick a per-user topic distribution from a Dirichlet prior. For example, …. In addition, it also picks a per-user followhsip preference for this user. Now, we first generate the content of this user. To generate each word, we first choose a topic based on the topic-distribution, say we choose Tech, then pick a word to represent this topic from the per-topic word distribution. In our example, web is chosen. The process continues for the remaining words. After the content generation, we now generate the followees. For each followee, we first choose the reaon of the followship based on the per-user followship preference. For example, this followship is based on content. Then we pick a topic based on the same topic-distribution as in the content generation, followed by picking a followee who well address the picked topic from the per-topic followee distribution. For a different link, the generative process may decide that the followship is content-independent. In this case, a followee is chosen based on the global popularity distribution. So, this is how content and links are created in generative process of FLDA.
Now we have the topic specific influcen model, let’s look at how to learn the model based on the observed data.
FLDA a probablistic procedure with introduced latent variables to generate content and links of each user in a microblog network. Now given the observed text and links, we want to find out the various distributions of latent variables. We used the Gibbs sampling to do the inference because it is the mostly widely used approach for approximate the distribution of latent variables especially for high dimensional data
To learn the various distributions in the FLDA model, we use the Gibbs sampling method. Gibbs sampling is a Markov chain Monte Carlo algorithm to approximate distributions of latent variables based on the observed data. The Gibbs sampling process usually starts with some initial value for each variable, then iteratively sample each variable conditioned on the current values of the remaining variables, and then update the vairable with its new value. This process will repeat for 100s of iterations. And at the end, the produced samples can be used to approximate the distributions of latent variablw. The key and also the most challenging part of Gibbs sampling is to derive the conditional distributions of each latent variable. Here are the derived conditional probabilities for the FLDA model. They look very complicated. But essentially, what we derived are..
Gibbs sampling is a widely used approach to Bayesian Inference. We also use it here to learn the various distributions.
Gibbs sampling is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximately from a specified multivariate probability distribution,This sequence can be used to approximate the joint distribution and approximate the marginal distribution of latent variables. The Gibbs sampling algorithm generates an instance from the distribution of each variable in turn, conditional on the current values of the other variables. We begin with some initial value for each variable, for each sample, sample each varibale from the conditional distribution. That is sample each variable from the distribution of that variable condidtioned on all other variables, making use of the most recent values and updating the varibales with its new value as soon as it has been sampled.. Then samples then approximate the join distribution of all variables. The idea is that observed data is incorporated into the sampling process by creating separate variables for each piece of observed data and fixing the variables in question to their observed values, rather than sampling from those variables. The distribution of the remaining variables is then effectively a posterior distribution conditioned on the observed data. A collapsed Gibbs sampler integrates out (marginalizes over) one or more variables when sampling for some other variable.
Now I will give an overview of the gibbs sampling for flda. After initialized of all the latent variables, the algorithm exexutes in iterations. Each iteration, the algrithm will make one pass of the data. During each pass, for each user, we first sample on all the words, for the nth word of the mth user, we have the observed value of the word. We sample a new topic assignment of t’. After all words, we start to sample on the followees. For each followee, we first sample a new followship preference, if the proference is content based, we then sample a topic for the link. During the sample process, we keep track a number of counters, because they are used in definition of the conditional distributions for the sampling process. After the sampling process, again, we use the counters to estimate the posteriro distributions of latent variables, such as the per-user topic distributions.
The Gibbs sampling process we have described so far is inherently sequential. Each sample step relies on the most recent values of all the other variables. sequential algorithm does scale. For example, Sequential algorithm would run for 21 days, on a high end server (192 GB RAM, 3.2GHz processor) for a Twitter data set with 1.8M users, 2.4B words and 183M links. So, we definitely need to parallelize the computation. How can we parallelize a inherently sequential process. We notice that for our problem, there a huge number of words and large number of links. The dependency of between different variable is relatively weak. So, we proposed to relax the sequential constraint for Gibbs sampling and propose a distributed gibbs sampling method. We implemented our distributed algorithm on top of spark, which is a distributed processing framwork for iterative machine learning workloads developed by Bekerley Amplab.
Here is an overview of how it works. In a Spark cluster, there is a master and number of workers. We partition the set of users on to different workers. For each user, we hold the last topic assignment of each word. The last preference assignment and the last topic assignment for each link. Finally, each user holds the user-local counters, such as the # times a word is assigned to a topic for this user. The master is responsible for keeping track of the global counters, e.g. the #times a word is assigned to a topic across all users. The algorithm executes in 1 number ofiterations, at the beginning of each iteration, the master first broadcasts the global counters to all workers. Then each worker makes use of the global counters to sample and update the data of all its users. As the process goes on, the global counters become out of date. So, at the end of the iteration, the workers send their new local counters to the master. Then the master use them to update the global counters. After that the new iteration begins.
The distributed algorithm seems pretty simple, but there are a number of issues that we need to handel with special care. First of all, we need to ensure the fault tolerance of the algorithm, because a gibbs sampling requires 100s of iterations, we don’t want to restart whenever failure occurs. At time we developed the distributed algorithm. Spark relied on lineage for fault toleration. So, if a worker fails, it needs to restart from the very beginning. This is not desirable for an algorithm that needs to run 100s of iterations. So, we implemented a chekcpointing mechanism in spark. But we found out in the latest release of spark, a checking pointing mechnisam was introduced. The 2nd issue is to the freqency of global synchronization. As I mentioned before, we synchronize all the global counters for each iteration. We also tried out various other frequencies and found out even with synchronizing every 10 iterations, the quality of result is not affected. Another more sutle issue if the random number generator. Because, workers are spawned at roughly same time, if we just use the jave random number generators, we will have correlations between the psedo random numbers generated across the works. This will jeopardized the qualitof the retured results. In order to guaranee the correctness of the distributed monta carlo simulation, we need provable independent multiple streams of uniform numbers. We used a long-period, jump ahead random number generator. The last issue deals with the efficient local computation. We took extra care to take advantage of local memory hierachy and avoid random memory access by sample in a particular order.
Finally, to put everything togrther, we incorporate our FLDA model into a search frame work for topic-specific influencers, called SKIT. Using skit, a user simple enters a set of key words to express his/her interest, and SKIT returns a list of key influencers that satisfy the user’s intent. SKIT is a general search framework. Besides FLDA, it can also pluggin other key influencer methods, such as link-lda, topic-sensitive pagerank, twitterrank. Now, I will only focus on how it is implemented using FLDA. In order to support the search, SKIT first needs to derive the interested topics from the key words. In FLDA, we can simply treat the key words as the content of a new user and using the “folding in” approach to quickly compute the topic distribution of is new user. Each value indicates the probablity the query is on a particular topic. From LDA, we also get the per-topic influnce score for each user. So, to compute the influence score of a user on the key word query, we simply need to compute the weighted sum across all topics. At the end, we will sort the users by their influence score and return the top influencers.
Now I will present some performance numbers.
We first check whether the top influencers returned by our method make sense or not. Here, we use a Twitter dataset crawled in 2010. It contains …. In this table, we showed some example topics with their top key words and top influencers produced by FLDA. Here we named the topics for better presentation. Intuitively, it is clear that the influencers are very relevant to the corresponding topics. For example, one would expect O’Reily publishers, Gartner research, and popular software bloggers to be influential for an IT-related topic. And pro-cycling atheletes, pro cycling team and team director to be influential for a topic related to cycling and running. FLDA separated the “globally” popular users from the content specific influencers, and elected that 15% of all links were content independent. In other words, 15% of the time, these popular users were followed regardless of what people tweet about. And the top-5 global popular users are singers, actors and talk show host.
anecdotal evidence is very hard to generalize and quantify. Luckily, 2012 KDD Cup provided us with the data needed to objectively measure the quality of FLDA and other approaches. This Tencent Weibo dataset contains ….. A very nice feature of the Weibo dataset is the set of provided VIP users. These VIP users are organized in categories. One example category is …. These VIP users are manually labelled “key influencers” in their corresponding category. Therefore, we can use them as “ground truth” to evaluate the quality of results. Note that the category doesn’t not necessary align with the topics we detected in FLDA. They represent a combinition of topics. If we use all the content of a VIP user as the search input, then we can check how many of the top-k results are the other VIP users in the same category and use the percentage as the precison. Overall, we can compute the mean Average Precison of all the VIP users across all categories. This chart compares the FLDA with existing approaches, such as Link-lda, TSPR and twitter rank. As we can see, FLDA produces significantly better results than existing approaches. It is over 2x better than TSPR and TwitterRank and 1.6x better than Link-LDA. We also compare the result of the sequential algorithm against the distributed algorithm. They produce comparable result, which confirms our intuition.
Now we know sequential algorithm and the distributed algorithm produce comparable results, we next compare the execution time of the two. The sequential algorithm was run on a high end server … and the distributed algorithm was run on a Spark cluster with 27 servers. Again, we run on the Weibo and Twitter dataset. Note that Twitter dataset is the larger dataset. although the twitter dataset has fewer number of users, but it has significant more # words and #links. This table shows the execution time for the sequential and the distributed flda algorithms running 500 iterations. For the Weibo dataset, distributed algorithm reduces the runtime from 4.6days to 8 hours, whereas for twitter dataset, it reducers the runtime from 21 days to 1.5 days, more than an order of mangnitude faster.
We now evaluate the scalability of the distributed algorithm long three dimensions: data size, number of topics, and the number of con-
current workers.
. We explore a wide range of sizes (from 12.5% all the way up to 100%), number of topics (from 25 to 200) and number of workers (from 25 to 200). The figure shows that the distributed FLDA scales well along all dimensions.
To summarize, we proposed a novel flda model for topic-specific influence analysis. This model combines content and link structure in the same generative process and is able to differenetiate the different reasons why one follows another. in order to apply FLDA to a webscale microblog network, we design a distributed Gibbs sampling algorithm for FLDA. Finally, the FLDA model is incorporated in a proposed general search framework for topic-specific key influencers. Through experiments on two real-world microblog datasets, we demonstrate that FLDA significantly outperforms the state-of-the art methods in terms of precision. Furthermore, the distributed Gibbs sampling algorithm for FLDA provides excellent speed-up to hundreds of workers.