Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

idalab seminar #11 - Jan Saputra Müller -


Published on

Do you know the feeling? All you want is a break-down of last year’s sales numbers and suddenly you find yourself typing tedious heaps of SQL-statements, clicking through complicated dashboards or looking for the right number in an Excel sheet. The vast majority of decision makers has better things to do with their time. That’s why business intelligence software was originally born. But only 20% of users are actually coping with the solutions BI software provides. Instead, the BI team is spammed with ad hoc data requests. What if everyone had a personal virtual data analyst at hand?
2018: Now it’s time to rethink query languages! Computers are getting better at processing natural language and the young company devoted itself to one mission: Using AI and Machine Learning to let everyone query their data themselves, in plain English. The goal: Empower decision makers and free up business intelligence teams to have more meaningful impact beyond the legwork of writing mundane SQL.

Published in: Data & Analytics
  • Be the first to comment

idalab seminar #11 - Jan Saputra Müller -

  1. 1. Jan Saputra Müller Agency for Data Science Machine learning & AI Mathematical modelling Data strategy SELECT*FROM … natural language: databases, we need to talk! idalab seminar #11 | June 22nd 2018
  2. 2. The current problem with traditional Self-Service BI Only 20% of employees have access to Self-Service BI 20% …and due to the complexity, only 17% of employees who have access are actually using it 17% Thousands of data questions remain unanswered
  3. 3. Analytics will be available for everyone BI 1.0 – Reports Wait time: Weeks BI 2.0 Visualization Wait time: Days BI 3.0 Natural Language Wait time: Seconds 1980s-present 1990s-present TODAY “Last weeks best keywords by visits?
  4. 4. How much revenue did we have on each day in the last month? SELECT revenue FROM sales WHERE created_at::date >= “2018-05-01” AND created_at::date <= “2018-05-31” GROUP BY created_at::date ● Input is a string ● Number of tokens is 14 ● The context of words in the sentence matters (e.g. “each day” ≠ “last day”) ● Output is a string ● Number of tokens is 33 ● Output language has a very strict syntax ? The AskBy Machine Learning problem
  5. 5. Let’s bake our own (toy) AskBy! Ingredients: “AskBy, show me the ingredients of AskBy”
  6. 6. How much revenue did we have on each day in the last month? How much revenue did we have on each day in the last month ? But the input of all ML models are vectors, not strings! Also, we usually have only a single fixed length vector as the input of our model. What to do? Ingredient 1: A proper tokenizer
  7. 7. Let’s write down what we ideally want. ● Vectors shouldn’t be too high dimensional ● Words with similar meaning should have similar vectorial representations ● The vectors should capture as much semantic information as possible. => “Bag-of-words” is not the ideal answer! revenue ? How to turn words into vectors?
  8. 8. There are numerous word2vec algorithms out there now! * Mikolov et al. 2013: “Efficient estimation of word representations in vector space.”, Man King Woman Queen “royalty” ● Word2vec* embeddings are calculated such that words appearing in similar contexts are having similar representations ● Interestingly, it turns out that these vectors capture also other semantic relations as well as syntactic relations between words, which can be explored using vector arithmetics Ingredient 2: Word2vec!
  9. 9. King Man Queen Woman “royalty” “femininity” We actually tried that out!
  10. 10. How much revenue did ... ... Our sequences still have arbitrary length! Classical ML models expect a fixed length vector as their input. How can we handle the following questions in the same model? Traffic yesterday from 01.11.2017 to 30.11.2017 show me the visits by entry page from browser chrome ordered by visits desc Well, now we have sequences of vectors. And then? “ “
  11. 11. x h y Unfold in time x1 h1 y1 x2 h2 y2 x3 h3 y3 ... ● Recurrent Neural Networks (RNN) are neural networks with a recurrence in at least one of their layers ● This recurrence can be interpreted as a simple memory that allows the network to remember things while iterating over time ● RNN are Turing complete! Isn’t that kind of similar to finite automatons vs. Turing Machines? Ingredient 3: Recurrent Neural Networks!
  12. 12. How h1 SELECT much h2 revenue revenue h3 FROM ... ? “How” maps to “SELECT” ? How can we know it’s the revenue - we didn’t read it yet! ? h14 - h15 01 ? Oh no, we ran out of input tokens - but our query is not done yet! But unfortunately translating token by token can’t be the answer...
  13. 13. x h h y Unfold in time x1 h1 y1 x2 h2 EOS hn... hn+1 y2 hn+2 EOS hn+m... Encoding Decoding ● Let’s stack two RNN together! ● The first one reads the input sequence completely ● Afterwards, the other one writes the output sequence * Sutskever et al. 2014: “Sequence-to-sequence learning with neural networks”, Ingredient 4: Sequence-to-sequence models*
  14. 14. ● Google Translate switches to a purely sequence-to-sequence based translation model in 2016, only 2 years after the first paper on this topic. It performs extremely well. ● But translating to a formal language is even more challenging: there is absolutely no tolerance for syntactic or semantic errors. Humans are much more tolerant! > SELECT revenue FROM FROM sales; SQL Error: syntax error at or near “FROM” Sequence-to-sequence models are really powerful...
  15. 15. ● Idea: combine the decoder RNN of a sequence-to-sequence model with a finite state machine that restricts our output sequence ● Nefisto: “Neural finite state output” x h h y + Nefisto output layer * Nefisto: unpublished; internally developed at AskBy in 2017 Ingredient 5: Nefisto*
  16. 16. ● The prediction space grows approximately like the faculty function where n is the number of selectable quantities. This is faster than exponential! ● From learning theory we know that we need a training sample set that covers our prediction space well. We finally have a model! What about the training data?
  17. 17. * Larala: internally developed programming language at AskBy in 2017; unpublished ● Larala: “Language randomization language” ● Idea: Build a probabilistic model of natural language queries in a way, that we can sample training data from it. ● To make everything reusable, we developed a whole programming language and are currently building up a standard library for it. Ingredient 6: Larala*
  18. 18. Well, no! Honestly, there is still so much more to do to become an But at least, that’s a good start! :) For more recipes contact Done! ;)