This document discusses using machine learning and data analysis techniques to improve community management and content moderation on an online question and answer site. It describes building models to automatically rate answers, challenges faced such as lack of training data and complex user behavior, and efforts taken to address these such as collecting more data, using more complex neural network models, and developing tools for moderators. It also covers expanding these techniques to related tasks like question rating, building a taxonomy of tags, and using insights from data and models to further improve the user experience.
18. Learnings
● Kafka is crazy robust
● Parquet accelerates reads
● Spark faster than Hadoop (development and production)
● Backpressure needed for MySQL and Elasticsearch
● Reliable Analytics and A/B Testing pipeline
23. Learnings
● Tableau easy to learn...
● … but hard to master
● Spark Thriftserver crashes all the time ( somebody same experience? )
● Non - Dev Departments love it
26. Goal: Automatically rate answers
● Sort answers for each question
● Hide bad answers
● Report really bad answers
-> Algorithm to rate answers
27. Solution: Logistic Regression
● Supervised Machine Learning algorithm
● 2000 training & test examples
● Trained first and simple model
● Brought it to production
29. Problems
● Model not complex enough
● Similar inputs, different outputs
● Not enough training data
● Missing definition of a good answer
30. Problems
● Model not complex enough
● Similar inputs, different outputs
● Not enough training data
● Missing definition of a good answer
-> Collect more data
-> Use more features
-> More complex model
35. Project Angmar
● Tried a lot of Supervised Learning Methods
● Feature Engineering: Most crucial part
● Analyse the domain, chart everything
36. Features
Content
length
syntactic complexity
number of links
probability of deletion
Social
votes
most helpful answer
number of comments
answered by expert
Author
gained votes
credibility score
role
ratio of deleted answers
number of answers
number of comments
ratio of reported answers
37. The winner: A simple Neural Net
Answer
vector
AV
normalized
Input
layer
21
3
1
0.2
0.4
0.1
2 0.8
wordCount
voteUp
voteDown
n
Hidden
layer
Output
layer
2n
Score 0.2
54. Goal: Directed Acyclic Graph
● Model Hierachy
● Use our content
○ Co-occurence with top tags
○ Repeat with those tags
○ Refinement manually
● Use Cases:
○ Recommender
○ Answer Score
○ Search
○ Experts
Computer
root
Sport
Games Fußball
Fifa 17