Successfully reported this slideshow.
Your SlideShare is downloading. ×

Josh Wills - Data Labeling as Religious Experience

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 22 Ad

Josh Wills - Data Labeling as Religious Experience

Data Labeling as Religious Experience

One of the most common places to deploy a production machine learning systems is as a replacement for a legacy rules-based system that is having a hard time keeping up with new edge cases and requirements. I'll be walking through the process and tooling we used to help us design, train, and deploy a model to replace a set of static rules we had for handling invite spam at Slack, talk about what we learned, and discuss some problems to solve in order to make these migrations easier for everyone.

Data Labeling as Religious Experience

One of the most common places to deploy a production machine learning systems is as a replacement for a legacy rules-based system that is having a hard time keeping up with new edge cases and requirements. I'll be walking through the process and tooling we used to help us design, train, and deploy a model to replace a set of static rules we had for handling invite spam at Slack, talk about what we learned, and discuss some problems to solve in order to make these migrations easier for everyone.

Advertisement
Advertisement

More Related Content

More from MLconf (20)

Recently uploaded (20)

Advertisement

Josh Wills - Data Labeling as Religious Experience

  1. 1. 1 Data Labeling As Religious Experience
  2. 2. About Me ● Google Engineer (2007- 11) ● Cloudera’s Director of Data Science (2011-15) ● Slack’s Director of Data Engineering (2015- 2017) ● Slack Engineer (2017-
  3. 3. How Does It Feel?
  4. 4. What’s Next?
  5. 5. Talk Outline
  6. 6. My Personal Life
  7. 7. Let’s Talk About My Startup
  8. 8. “”
  9. 9. “”
  10. 10. Remembrance of Things Past
  11. 11. Search Problems: A Comparison 1. Corpus/queries are public. 1. Lots of head queries. 1. Web pages want to be found. 1. Corpus/queries are private. 1. Almost no head queries. 1. Messages don’t care about being found.
  12. 12. The Social Answer
  13. 13. How Did Google Make Search Good?
  14. 14. The Elephant In The Room
  15. 15. Feedback Is Everything.
  16. 16. Invite Spam
  17. 17. How Do We Get Good Labeled Data?
  18. 18. Snorkel And The Rise of Weak Supervision
  19. 19. From Snorkel to Snuba
  20. 20. But But But BERT!
  21. 21. Focus.
  22. 22. 22

Editor's Notes

  • A bit about me. I am presently unemployed.
  • Oh, it feels okay.
  • There is no what’s next, although consulting the trusty Silicon Valley hierarchy of needs chart, I see a number of Medium Thinkpieces in my not too distant future.
  • So what we’re talking about today:
    My personal life
    My startup
    And, at the after party, I will be happy to give you my unique and contrarian take on WeWork.
  • The highly cliched (but essentially accurate) desire of people who leave successful companies is to start another company that implements that one key feature that they thought would make the company but could never actually convince the company to invest in before they left. Because, let’s be honest, if they had convinced the company to do it, they would still be working there.

    Unfortunately, I have several such ideas. And I sort of need to get them out of my system, because the point of this time off is to clear my head and get myself ready for what’s next. And so that’s what we’re going to talk a bit about today.
  • If you don’t know what Slack is, this is Slack. There are these things called channels and people can subscribe to them and then sends messages to one another. It’s sort of like Kafka, but for people.

    My first terrible idea: Slack, but for Jupyter notebooks.
  • My other class of startup ideas are all related to search, and a lot of that is because I spent a good solid year rebuilding Slack search, which you can see my colleague John Gallagher and I talking about here: https://www.youtube.com/watch?v=EQ336PTZfhU

    The good news is that there are already a number of startups that are in this space, and I know this because many of them have tried to hire me, so this talk is my way of giving them all the exact same advice about how I think they should approach the hardest part of doing a really good job of enterprise search, especially for the ones who are coming from a large-scale search background at say Google, or a large e-commerce company.
  • Slack search is only really good at one thing: finding something when a) you know it already exists (possibly because you wrote it yourself), and b) you have a pretty good memory of what terms were involved/what channel it was in/etc. This is often very useful when it is paired with a culture of devops that involves posting pretty much any adhoc command you run into a channel so that the knowledge of the magic can be distributed far and wide.

    But no one should mistake this for Google search, or think that the relevance problem in enterprise search is even remotely solved like it is for the web.
  • A bit about why Slack search is hard and why Google actually has it pretty easy.
  • The blessing and the curse of Slack search: you can always ask someone who knows.
  • The problem we have now is that Google’s position on Maslow’s hierarchy of needs is so far removed from the reality of an enterprise search startup that it leads us to think that the bells and whistles are what matter and we no longer see what all of the infrastructure is built on: high-quality click data.

    Spelling correction, learn-to-rank algorithms, synonym detection, etc., etc. are all based on the strong signal of the core mechanism of the query-click pairing.
  • And this foundation is easy to take for granted; we rarely actually talk about it b/c all of our sophisticated machinery is predicated on its existence. It’s the elephant in the room, the water that fish swim in, the air we breathe.
  • If for no other reason than it gives agency to our users.
  • I get it- labeling data is terrible. You don’t want to do it. You even feel bad conning your interns into doing it for you. Good for you! It shows you have a conscience.
  • https://blog.acolyer.org/2018/08/22/snorkel-rapid-training-data-creation-with-weak-supervision/
  • https://blog.acolyer.org/2019/08/26/snuba/

×