We extracted a high volume of discussion from an online help forum and were able to significantly predict behavior based on the type of topics users were discussing, we are currently applying this to reduce churn and increase ARPU at major telco provider
This presentation includes a introduction to semantic analysis using Radim Řehůřek's Gensim and D3. We will be discussing the statistical principals of LDA, its application using Ipython notebook and interrogation of the results using D3 framework.
BIO: Conor Duke is the Insights manager at Fabrikatyr Analytics. He likes coffee, doing stuff social good and the outdoors. @conr / ie.linkedin.com/in/conorduke
Predicting Consumer Behaviour via HadoopSkillspeed
This Hadoop Tutorial will unravel the complete Introduction to Big Data and Hadoop, HDFS, Predictive Analytics & Applications. Additionally, we will also extensively cover MapReduce & Usage.
At the end, you'll have strong knowledge regarding Predicting Consumer Behaviour via Hadoop.
PPT Agenda
✓ Introduction to Big Data & Hadoop
✓ Hadoop Characteristics
✓ Hadoop Ecosystem
✓ Predictive Analysis
✓ Applications of Predictive Analysis
✓ MapReduce Scenarios
✓ Traditional vs MapReduce Solutions
✓ Advantages of MapReduce
----------
What is Hadoop?
Hadoop is an open source Java-based programming framework that supports the processing of large data sets across clusters of distributed commodity servers. It enables you to store, process and gain insight from big data at low cost and huge scale.
----------
Hadoop has the following components:
1. MapReduce
2. The Hadoop Distributed File System (HDFS)
3. Apache Hive
4. HBase
5. Zookeeper
----------
Applications of Predictive Analysis
1. Analytical Customer Relationship Management (CRM)
2. Decision support systems
3. Customer satisfaction & retention
4. Direct marketing
5. Fraud detection
6. Risk management & assessment
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
New Directions 2015 – Changes in Content Best Practicesdclsocialmedia
The Center for Information-Development Management (CIDM) and Data Conversion Laboratories (DCL) announce the results of our 2015 Industry Trends Survey. Comparisons with these surveys in previous years provides you with a comprehensive view of what is the same and what is changing in technical information best practices.
Content Recommendation using factorisation machines ; Pycon Ireland 2016Conor Duke
Short talk on deploying factorisation machines using Dato / Turi to recommend content across a social network. Compares FM with other recommendation algorithms.
Predicting Consumer Behaviour via HadoopSkillspeed
This Hadoop Tutorial will unravel the complete Introduction to Big Data and Hadoop, HDFS, Predictive Analytics & Applications. Additionally, we will also extensively cover MapReduce & Usage.
At the end, you'll have strong knowledge regarding Predicting Consumer Behaviour via Hadoop.
PPT Agenda
✓ Introduction to Big Data & Hadoop
✓ Hadoop Characteristics
✓ Hadoop Ecosystem
✓ Predictive Analysis
✓ Applications of Predictive Analysis
✓ MapReduce Scenarios
✓ Traditional vs MapReduce Solutions
✓ Advantages of MapReduce
----------
What is Hadoop?
Hadoop is an open source Java-based programming framework that supports the processing of large data sets across clusters of distributed commodity servers. It enables you to store, process and gain insight from big data at low cost and huge scale.
----------
Hadoop has the following components:
1. MapReduce
2. The Hadoop Distributed File System (HDFS)
3. Apache Hive
4. HBase
5. Zookeeper
----------
Applications of Predictive Analysis
1. Analytical Customer Relationship Management (CRM)
2. Decision support systems
3. Customer satisfaction & retention
4. Direct marketing
5. Fraud detection
6. Risk management & assessment
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
New Directions 2015 – Changes in Content Best Practicesdclsocialmedia
The Center for Information-Development Management (CIDM) and Data Conversion Laboratories (DCL) announce the results of our 2015 Industry Trends Survey. Comparisons with these surveys in previous years provides you with a comprehensive view of what is the same and what is changing in technical information best practices.
Content Recommendation using factorisation machines ; Pycon Ireland 2016Conor Duke
Short talk on deploying factorisation machines using Dato / Turi to recommend content across a social network. Compares FM with other recommendation algorithms.
This panel event will be about trying to understand the do's and don'ts of inbound to ensure you convert your leads, rather than build you webtraffic
The Consequences of a Faceless Social Media Campaign
Heidi Grimwood - Campaign specialist Social Media Skills Club
The journey to modern marketing
Peter Reynolds - Marketing Cloud Alliances for Western Europe ; Oracle
Inbound Marketing - Secrets to Success
Alan Lynam - International Account Manager at HubSpot
Using predictive analytics to increase consumer response rate - PyCon Irelan...Conor Duke
We have recently taken millions inbound and outbound text messages and used predictive techniques to increase the user response rate to outbound messaging.
The talk will include data wrangling, stack set-up, feature creation and extraction and various machine learning techniques and there application to the business problem
Topic Modelling: Tutorial on Usage and ApplicationsAyush Jain
This is a tutorial on topic modelling techniques - that informs the reader about the basic ingredients of all topic models, and allows them to develop a new model in the end.
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...Alexis Perrier
Dans cette presentation je montre comment appliquer des techniques de topic modeling a un fil twitter en utilisant gensim, python et en comparant certains algorithmes: LSA, LSA ...
Paper presentation for the final course Advanced Concept in Machine Learning.
The paper is @Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data"
http://jmlr.org/proceedings/papers/v32/chenf14.pdf
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016Jonathan Sedar
Slides from my presentation at ODSC and Python Quants meetup in London on 13 Apr 2016. Very lightly covering a demo project on topic modelling and network analysis of the Enron Email Corpus.
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Vasily Leksin
This slides describes our solution for the RecSys Challenge 2016. In the challenge, several datasets were provided from a social network for business XING. The goal of the competition was to use these data to predict job postings that a user will interact positively with (click, bookmark or reply). Our solution to this problem includes three different types of models: Factorization Machine, item-based collaborative filtering, and content-based topic model on tags. Thus, we combined collaborative and content-based approaches in our solution.
Our best submission, which was a blend of ten models, achieved 7th place in the challenge's final leaderboard with a score of 1677898.52. The approaches presented in this paper are general and scalable. Therefore they can be applied to another problem of this type.
Big Data and Marketing: Data Activation and ManagementConor Duke
Data Management and Activation
Crevan O’Malley – Evangelist, Oracle Marketing Cloud
Modern Marketers rely on data-driven marketing solutions to deliver more personalised customer experiences across every channel—helping attract and retain the ideal customers who become brand advocates. Discover how to aggregate, enrich, and analyze all your customer data on a single data management platform.
Why Marketers need to know about Data
Tara Grehan - Managing Director at Datalytics
Why Marketers need to know about Data
Tara Grehan - Managing Director at Datalytics
Despite starting out as a qualitative researcher, roles and projects frequently brought me back to data. And so I decided to tackle it and have developed some interesting insights into data management along the way.
Having worked in Marketing both agency and client side for fifteen years now in a variety of roles from Market Research and Customer Insights to Change Management, being comfortable with data has made all the difference and this evening I’ll tell you why.
Using Big Data to Grow on a Budget
Michael Waldron - Marketing and Sales Manager at AYLIEN
AYLIEN is an Artificial Intelligence content analysis startup and Mike will be speaking on their growth journey over the past 6 months. With a focus on how they have delivered growth by optimising their budget, focusing on Data Points that matter and what to points to obsess on through the marketing funnel.
Predictive analytics is touching more and more lives every day. Machine Learning lets you predict and change the future. Do you know that Microsoft products like Xbox and Bing integrate some machine learning capabilities in their workflows? Come to the session and take a look of the new cloud-based machine learning platform called AzureML from a BI architect perspective, without all the data scientist knowledge.
This panel event will be about trying to understand the do's and don'ts of inbound to ensure you convert your leads, rather than build you webtraffic
The Consequences of a Faceless Social Media Campaign
Heidi Grimwood - Campaign specialist Social Media Skills Club
The journey to modern marketing
Peter Reynolds - Marketing Cloud Alliances for Western Europe ; Oracle
Inbound Marketing - Secrets to Success
Alan Lynam - International Account Manager at HubSpot
Using predictive analytics to increase consumer response rate - PyCon Irelan...Conor Duke
We have recently taken millions inbound and outbound text messages and used predictive techniques to increase the user response rate to outbound messaging.
The talk will include data wrangling, stack set-up, feature creation and extraction and various machine learning techniques and there application to the business problem
Topic Modelling: Tutorial on Usage and ApplicationsAyush Jain
This is a tutorial on topic modelling techniques - that informs the reader about the basic ingredients of all topic models, and allows them to develop a new model in the end.
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...Alexis Perrier
Dans cette presentation je montre comment appliquer des techniques de topic modeling a un fil twitter en utilisant gensim, python et en comparant certains algorithmes: LSA, LSA ...
Paper presentation for the final course Advanced Concept in Machine Learning.
The paper is @Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data"
http://jmlr.org/proceedings/papers/v32/chenf14.pdf
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016Jonathan Sedar
Slides from my presentation at ODSC and Python Quants meetup in London on 13 Apr 2016. Very lightly covering a demo project on topic modelling and network analysis of the Enron Email Corpus.
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Vasily Leksin
This slides describes our solution for the RecSys Challenge 2016. In the challenge, several datasets were provided from a social network for business XING. The goal of the competition was to use these data to predict job postings that a user will interact positively with (click, bookmark or reply). Our solution to this problem includes three different types of models: Factorization Machine, item-based collaborative filtering, and content-based topic model on tags. Thus, we combined collaborative and content-based approaches in our solution.
Our best submission, which was a blend of ten models, achieved 7th place in the challenge's final leaderboard with a score of 1677898.52. The approaches presented in this paper are general and scalable. Therefore they can be applied to another problem of this type.
Big Data and Marketing: Data Activation and ManagementConor Duke
Data Management and Activation
Crevan O’Malley – Evangelist, Oracle Marketing Cloud
Modern Marketers rely on data-driven marketing solutions to deliver more personalised customer experiences across every channel—helping attract and retain the ideal customers who become brand advocates. Discover how to aggregate, enrich, and analyze all your customer data on a single data management platform.
Why Marketers need to know about Data
Tara Grehan - Managing Director at Datalytics
Why Marketers need to know about Data
Tara Grehan - Managing Director at Datalytics
Despite starting out as a qualitative researcher, roles and projects frequently brought me back to data. And so I decided to tackle it and have developed some interesting insights into data management along the way.
Having worked in Marketing both agency and client side for fifteen years now in a variety of roles from Market Research and Customer Insights to Change Management, being comfortable with data has made all the difference and this evening I’ll tell you why.
Using Big Data to Grow on a Budget
Michael Waldron - Marketing and Sales Manager at AYLIEN
AYLIEN is an Artificial Intelligence content analysis startup and Mike will be speaking on their growth journey over the past 6 months. With a focus on how they have delivered growth by optimising their budget, focusing on Data Points that matter and what to points to obsess on through the marketing funnel.
Predictive analytics is touching more and more lives every day. Machine Learning lets you predict and change the future. Do you know that Microsoft products like Xbox and Bing integrate some machine learning capabilities in their workflows? Come to the session and take a look of the new cloud-based machine learning platform called AzureML from a BI architect perspective, without all the data scientist knowledge.
Delivered at Pittsburgh Tech Fest - 6/10/2017
Knowledge is power, but is it if you're not using it? What if the application you delivered to your customers was extremely intelligent? It could retrieve, analyze and use the massive amounts of data that businesses are generating at an astronomical rate.
It could analyze business deals, predict potential issues, proactively recommend business decisions and estimate profit, loss and risks.
Those things provide direct benefits to your company. Churning through that data by hand doesn't. Enter Azure Machine Learning.
In this session you will learn how to integrate Azure Machine Learning into your existing applications and workflows with REST services. You will learn how to deliver a modular, maintainable solution to your customers that allows them to analyze their data.
You will learn to:
* Numerous ways to abstract business rules, workflows, AI (Machine Learning) and more into your applications
* How to Integrate Azure Machine Learning into your existing Applications and Processes
* Create Azure Machine Learning Experiments
* Retrieve the Score from an Azure Machine Learning Experiment and integrate it into your applications and processes
* Integrate numerous Machine Learning Experiments from the Azure Machine Learning Marketplace into your existing applications and processes
* Learn various concepts for abstracting and managing services and api's.
Building machine learning muscle in your team & transitioning to make them do machine learning at scale. We also discuss about Spark & other relevant technologies.
10 Limitations of Large Language Models and Mitigation OptionsMihai Criveti
10 Limitations of Large Language Models and ways to overcome them. Dealing with hallucinations, performance,
costs, stale training data, injecting private data, token limits and contextual memory, text conversion, lack of
transparency, ethical concerns and training costs.
The Heart of Data Modeling: The Best Data Modeler is a Lazy Data ModelerDATAVERSITY
We're under pressure to do more with fewer resources. And organizations are often short on experienced data modelers. So why should we spend time doing things that can be done by robots. Well, not robots, but automation.
In this month's webinar, Karen demonstrates the types of automation techniques available in leading data modeling tools such as ERwin, ER/Studio and PowerDesigner. She will also leave you with 10 tips on being more lazy. What webinar last promised that, anyway?
Doing Analytics Right - Building the Analytics EnvironmentTasktop
Implementing analytics for development processes is challenging. As in discussed in the previous webinars, the right analytics are determined by the goals of the organization, not by the available data. So implementing your analytics solutions will require an efficient analytics and data architecture, including the ability to combine and stage data from heterogeneous sources. An architecture that excludes the ability to gain access to the necessary data will create a barrier to deploying your newly designed analytics program, and will force you back into the “light is brighter here” anti-pattern.
This webinar will describe the technical considerations of implementing the data architecture for your analytics program, and explain how Tasktop can help.
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...Michael Mortenson
This presentation presents recent research into definitions of analytics through analysis of related job adverts. The results help us identify a new categorisation of analytics methodologies, and discusses the implications for the operational research community.
Text Analytics can be used in business for various purposes. Business managers, and students, should have a clear idea of the use cases and a sound general understanding of the technical basics to be competent for business innovation and development. This set of slides (excerpts) is my approach to teach the subject. Comments welcome.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
1. Text mining – Text mining or text data mining is a process to e.docxstilliegeorgiana
1. Text mining – Text mining or text data mining is a process to extract high-quality information from the text. It is done through patterns and trends devised using statistical pattern learning. Firstly, the input data is structured. After structuring, patterns are derived from this structured data and finally, the output is evaluated and interpreted. The main applications of text mining include competitive intelligence, E-Discovery, National Security, and social media monitoring. It is a trending topic for the thesis in data mining.
Some research needs
Problem definition – In the first phase problem definition is listed i.e. business aims and objectives are determined taking into consideration certain factors like the current background and future prospective.
Data exploration – Required data is collected and explored using various statistical methods along with identification of underlying problems.
Data preparation – The data is prepared for modeling by cleansing and formatting the raw data in the desired way. The meaning of data is not changed while preparing.
Modeling – In this phase the data model is created by applying certain mathematical functions and modeling techniques. After the model is created it goes through validation and verification.
Evaluation – After the model is created, it is evaluated by a team of experts to check whether it satisfies business objectives or not.
Deployment – After evaluation, the model is deployed and further plans are made for its maintenance. A properly organized report is prepared with the summary of the work done.
Research paper Policy
· APA format
. https://apastyle.apa.org/
. https://owl.purdue.edu/owl/research_and_citation/apa_style/apa_formatting_and_style_guide/general_format.html
· Min number of pages are 15 pages
· Must have
. Contents with page numbers
. Abstract
. Introduction
. The problem
4. Are there any sub-problems?
4. Is there any issue need to be present concerning the problem?
. The solutions
5. Steps of the solutions
. Compare the solution to other solution
. Any suggestion to improve the solution
. Conclusion
. References
· Missing one of the above will result -5/30 of the research paper
· Paper does not stick to the APA will result in 0 in the research paper
Spring 2020 Name: ______________________________
MATH 175 – Test 2 (Show Your Work )
7. Given
5
cos2
18
q
=-
and
180270
q
<<
oo
, find values of
sin
q
and
cos
q
.
8. Verify that each of the following is a trigonometric identity.
22
1sin
sec2sectantan
1sin
q
qqqq
q
-
=-+
+
9. Give the exact value of
4
cos2arctan
3
æö
ç÷
èø
without using a calculator.
10. Solve
2cos2cos2
qq
=
for all exact solutions in degrees.
PAGE
1
_1234567891.unknown
_1234567893.unknown
_1234567895.unknown
_1234567896.unknown
_1234567894.unknown
_1234567892.unknown
_1234567890.unknown
Information Systems for Business and Beyond (2019)
Information System.
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...Michael Mortenson
This presentation presents recent research into definitions of analytics through analysis of related job adverts. The results help us identify a new categorisation of analytics methodologies, and discusses the implications for the operational research community.
Understanding voice of the member via text miningChi-Yi Kuan
The 14th Text Analytics Summit - June 15, 2015 in New York
Today, Businesses around the world are increasingly collecting tremendous amount of unstructured data in the form of text – from multiple channels such as product reviews, market research, customer care conversations, and social media. In this talk, we will share how LinkedIn has built a text-mining platform to derive insights and create value for our members from the massive amount of data we have within our ecosystem. We will cover the following topics in our talk:
1) Topic modeling
2) Text categorization using NLP features
3) Topic-based sentiment analysis and attribution
The talk will be appropriate for business leaders, researchers and practitioners.
Delivered @ MusicCityCode 6/2/2017
Knowledge is power, but is it if you're not using it? What if the application you delivered to your customers was extremely intelligent? It could retrieve, analyze and use the massive amounts of data that businesses are generating at an astronomical rate.
It could analyze business deals, predict potential issues, proactively recommend business decisions and estimate profit, loss and risks.
Those things provide direct benefits to your company. Churning through that data by hand doesn't. Enter Azure Machine Learning.
In this session you will learn how to integrate Azure Machine Learning into your existing applications and workflows with REST services. You will learn how to deliver a modular, maintainable solution to your customers that allows them to analyze their data.
You will learn to:
* Numerous ways to abstract business rules, workflows, AI (Machine Learning) and more into your applications
* How to Integrate Azure Machine Learning into your existing Applications and Processes
* Create Azure Machine Learning Experiments
* Retrieve the Score from an Azure Machine Learning Experiment and integrate it into your applications and processes
* Integrate numerous Machine Learning Experiments from the Azure Machine Learning Marketplace into your existing applications and processes
* Learn various concepts for abstracting and managing services and api's.
Similar to Topic Modelling to identify behavioral trends in online communities (20)
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
3. Explanation of Topic
Modelling
A BRIEF INTRODUCTION TO THE SEMANTIC WEB
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
3
4. Why is it
Important?
• Discover topics
in large groups
of documents
• Use these
labels to
understand the
body of text and
documents
more effectively
What is Semantic Analysis?
Some use cases:
•Consumer Insight
• Recommender
• Social Media Monitoring
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
4
5. What is Topic Modelling?
Grouping
documents based
on the probability of
words occurring in
each document
http://people.cs.umass.edu/~wallach/talks/priors.pdf08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
5
6. 08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
6
Transforming raw data to insight for a particular audience is not
about algorithms alone
Data
Insight
Good Data Science makes ‘The Gap’ as small as possible
7. Finding the most suitable application of Topic
modelling for ‘discussion’ is critical
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
7
Topic
modelling
Semantic
Subject matter
corpus
General Corpus
Statistical
Word
probability
Paragraph
structure
Word distance
Mixture of all?
Analysing political debate
discourse has the following issues
• Few / little ‘training’ texts
• Highly variable sentence
length
• Distinct word distributions
• Statistical word probability
has readily available
implementations and can
resolve these challenges
8. What is Gensim?
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
8
Gensim is a free Python library designed to automatically extract semantic topics from
documents, as efficiently (computer-wise) and painlessly (human-wise) as possible. Gensim aims
at processing raw, unstructured digital texts (“plain text”).
• Offers more precise modelling options than ‘topicModels’ in R or
MALLET
• Wider function set
• Somewhat complex to optimise
• Dependencies: numpy and scipy
Radim Rehurek
9. Application using
Gensim
HOW TO USE GENSIM TO UNDERSTAND LARGE VOLUMES
OF TEXT EFFECTIVELY
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
9
10. Preparing the
data
• No data set is ever
ready to operate on
‘out-of-the-box’
• Challenges
included:
• Character encoding
• Multiple fields in a
column
• Timestamps
DATA
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
10
11. What is a Text Corpus and a ‘Bag-of-Words’?
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
11
Bag-of-words (BOW) converts each
response into a set unordered single
words
This Method:
• does not parse sentences,
• does not care about word order, and
• does not "understand" grammar or syntax
12. 08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
12
The optimum number of topics can be selected by calculating the
model with the smallest measure of Chaos / Entropy
Least amount
of disorder in
the topics
Harmonic
Mean
AIC
Entropy
“Sum of Lowest average probability”
for each topic distribution
Balance of “Harmonic
mean “ against model
complexity
Least amount
of disorder in
the topics
Using Kullback–Leibler divergence we can
spot local minimum and pick the optimum
number based on how many topics we want
to ‘name’
Local minimums provide a chance to explore the
Trade-off between granularity and consistency
13. Latent Dirichlet Allocation
LDA repeatedly examines the
probability of the words in each
response and establish ‘common sets’
(topics)
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
13
14. The topic words associated can be extracted
Each comment is be assigned to a single topic
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
14
LDA.print_topic extracts the words in each topic
NP.Max gets the
most likely topic for
each comment
15. Sample Results
INTERROGATING A COMMUNITY FORUM DISCUSSION
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
15
16. How to use it?
There are 7 key stages to model topics effectively
1
• Collat
e text
2
• Creat
e
Corpu
s
3
• Creat
e ‘bag
of
words
’
4
• Optim
um
topics
5
• Establ
ish
keywo
rd
group
ings
6
• Name
Topic
s
7
• Visual
ise
1
Get
Data
2
Create
Corpus
3
Feature
review
4
Optimum
topics
5
Review
6
Name &
Visualise
7
Deliver
insight
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
16
17. Sample set : 11.3K posts to a Teleco
help forum
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
17
Corpus
5,000 Questions
3,000 Users
3 years of data
Kudos
Device
Thread size
User Age
Views
Maximum user posts
Data Features
18. Classifying users will help identify
admin versus users
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
18
19. We then use ‘Regression Forest’ to further identify
post features which drive ‘Views’
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
19
20. Removing the ‘Admin’ outlier ‘Kudos’ seems
to be the driving feature
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
20
Kudos Response no User Age Thread Size
21. 08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
21
Optimum topic number across the different user segments ensures our
grouping assumptions are reasonable
Using Kullback–Leibler
divergence we can spot local
minimum and pick the optimum
number based on how many
topics we want to ‘name’
Local minimums provide
a chance to explore the
Trade-off between
granularity and
consistency
22. Amount of posts in each topic and length of post
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
22
We examine the structure of the corpus and the lengths of
the posts to validate our model
Response
count
Length of
Post
23. 08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
23
Word probability distributions, corpus and domain
knowledge allow for topics to be named
Topic Topic Name Word tokens and probability
11 Internet setting internet setting data phone work
30% 27% 31% 12% 5%
12 Number Transfer number 48 sim support old
36% 35% 14% 6% 3%
13
General new account
query
phone sim go solution solved
26% 29% 24% 9% 6%
14 Roaming text roaming call send eu
23% 25% 22% 8% 5%
15 General chat im like think good dont
13% 12% 10% 4% 2%
16 Referral Bonus press key navi highlight select
27% 32% 31% 15% 9%
17 Network Issues network phone problem im internet
12% 11% 13% 5% 3%
18 Blackberry Problems problem blackberry mine get thanks
11% 11% 13% 5% 2%
24. Posts get ‘views’ for any number of reason, we
need to identifying topics are important
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
24
Using Random Forest of predicting ‘Views’
Topic ‘name’
Topic
number
Internet setting 11
Number Transfer 12
General new account query 13
Referal Bonus 16
Network Issues 17
Blackberry Problems 18
Only 5 Topics which drive views
This suggests these topics get ‘repeat’ visits
This is NOT the most ‘viewed’ topics, but the ones which people refer to
16 18 13 12 17 11
25. 08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
25
We then compare key topics posts over time to
understand the patterns
26. Using ‘Named Entity Recognition’(NER), Topic
Modelling can be used to understand how consumers
are interacting with brands
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
26
Brands mentions
only occur in 2%
of the entire
corpus, making
any assignment
of topics trivial
27. Conclusion
THINGS TO THINK ABOUT
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
27
28. 2nd Generation of ‘Listening’ tools will be less metric and more Qualitative
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
28
29. Context is Key
Blind application of
complex modelling
will yield results which
deliver incorrect
classification
The final deliverable
and key features must
be defined before
embarking on the
analysis
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
29
30. There is an infinite amount of data, harvesting it is the key
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
30
32. Comparison of LDA
implementations
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
32
Learning rate – (decay)
To ‘bootstrap’ small bodies of
text
‘Passes’ of the Bayesian sampling function can also effect the model
•Gensim in Python currently has
the most extensive set of
parameters however
topicmodels in R has some good
visualisation examples
•‘Online’ LDA implementations
are crucial for ‘social listening’
for evolving political commentary
The ‘Number of Topics’ is the key parameter however there are a few
other parameters which are important.
Priors Matter
Function of document
count and length
‘Honourable mention’ implementations
• Vowpal Wabbit – machine learning
• Mallet – Focus on text modelling
• Stanford - great resource
33. The Model still
needs to be
visualised
Again we use Kullback-Leibler
divergence to map the topics
against each other. Each word
has a measure of Saliency
Saliency is a compromise
between a word's overall
frequency and it's distinctiveness.
A word's distinctiveness is a
measure of that word's
distribution over topics
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
33
By visualising the word distributions in each topic
we understand them better
34. Why Priors
Matter!
Careful thinking about priors can yield new
insights
– e.g., priors and STOPWORD handling are
related
For LDA the choice of prior is surprisingly
important:
– Asymmetric prior for document-specific topic
distributions
– Symmetric prior for topic-specific word
distributions
Almost all work on LDA uses symmetric Dirichlet priors
– Two scalar concentration parameters: α and β
● Concentration parameters are usually set heuristically
● Some recent work on inferring optimal concentration
parameter values from data (Asuncion et al., 2009)
08 Apr 2015
TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE
DISCUSSION
34
Editor's Notes
In gensim, a corpus is an iterable that returns its documents as sparse vectors. (A sparse vector is just a compact way of storing large vectors that are mostly zeroes.)
The course on Day 1 & 2 are set to weed out those who haven't trained sufficiently or aren’t properly prepared. After day 2 you the nerves go away and you actually start to sleep properly and the 5 degree’s of cold doesn’t bother you so much .
By day 3 you get accustomed to the terrain and the heat, your training kicks in, and each day seems to get easier. Everyday you cross the finish line you think the run tomorrow as impossible, you can’t do it. Then you get the feet up and have a 800 calorie meal and a pop tart, and you think, eh, maybe I will just walk the 50KM tomorrow. By the time the morning rolls around you rock up to the finish line rearing to go, excited to get out into the breath-taking scenery again.
For LDA the choice of prior is surprisingly important:
– Asymmetric prior for document-specific topic distributions
– Symmetric prior for topic-specific word distributions