10. Iterating on a title
Original (50.4%): Gmail API Pull out plain text email body
11. Iterating on a title
Original (50.4%): Gmail API Pull out plain text email body
Suggested (63.9%): How can I extract plain text from an email sent to
me from a specific source, without forwarding the email to myself?
12. Iterating on a title
Original (50.4%): Gmail API Pull out plain text email body
Suggested (63.9%): How can I extract plain text from an email sent to
me from a specific source, without forwarding the email to myself?
Edited (74.0%): How can I extract plain text from an email sent to me?
Hi, I’m Tennesse Joyce, and I built a Chrome extension called TitleWave that will improve the titles your Stack Overflow questions.
Stack Overflow is ubiquitous in the programming world as a place where people can ask questions to a community other programmers. In 2019, over 5,000 questions a day were asked, but only 70% of those were answered. To increase your chance of getting an answer it’s really important to have a compelling title so that people actually click on your question. But this can be tough, especially for new users who aren’t familiar with the conventions on the website.
To solve this problem, I built TitleWave, a Chrome extension that integrates directly into the Stack Overflow website and helps improve your title. Let’s see how it works.
So this is the webpage on Stack Overflow where you can submit a new question, and I’ve just copied one that someone asked last week as an example. My Chrome extensions adds the two buttons right here. When I press ‘Evaluate title’, it tells me the probability that my question will get answered, just by looking at the title. And when I press Suggest a Title, it reads through the text down here, and summarizes it into what it thinks the title should be. That takes about a minute on my laptop, so I’m just going to paste in the output. When I press Evaluate Title again, we see that the suggested title has about a 14% higher chance of getting answered.
So how does this work on the backend? First, I needed to collect a bunch of previous questions, which are available on the Stack Exchange Data Explorer, and you can just download that as a big XML file.
I process that into a Pandas dataframe, and then I do some data cleaning with Regex to remove HTML tags and code.
Now since this is such a huge dataset of almost 20 million questions, we can actually train a deep neural network like Google’s BERT to do feature extraction for us. This is a good choice because it doesn’t just look for keywords, it also considers phrasing of the title and how the different words fit together. I then put those features into a logistic regression to predict if each question gets answered or not, and lastly I backpropagate the error using Pytorch to fine-tune the neural network.
Let’s see how that classifier performs on a test set. This plot shows the distribution of predicted probabilities for the two classes, answered and unanswered questions, and there are actually two main clusters. The good titles have about 80% chance of getting answered, whereas the bad titles are only 50-50. Also the proportion of answered questions increases from left to right, so that tells us the model is working. If you can move your title from bad to good with the help of this tool, that gives you a significant boost in your chance of getting an answer.
For the “Suggest a title” button, I can’t use BERT because it doesn’t output text, it just encodes text into numbers. T5 is another model by Google that has both an encoder and a decoder that decode those numbers back into text at the end, so it’s a good choice for this task. I also fine-tune T5, but this time only on questions that have an accepted answer.
The result is that about a third of the time the suggested title scores better than the original title, measured according to the BERT model. That means maybe around a third of the users on Stack Overflow could benefit from this tool.
Putting it all together, this is the example title I used before in the demo. It only has a 50% chance of getting answered, so not great.
The title suggested by T5 is already a big improvement, and often you can actually make it even better by editing it yourself.
For example, the second part feels extraneous to me, and if I take that out, actually it increases the probability by another 10%. So next time you find yourself asking a question on Stack Overflow, consider using this Chrome Extension to take a more data-driven approach to choosing a title.
My name is Tennesse Joyce, and I’m finishing my PhD in laser physics at CU Boulder. I do quantum simulations of light-matter interaction, and I’m very familiar with using Python for data analysis and visualization of those simulations. I’m looking to apply those skills to data science problems where I can have a potentially bigger impact.