Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Neural Information Retrieval with AquilaDB

182 views

Published on

Live presentation video: https://www.youtube.com/watch?v=-VYpjpLXU5Q

AquilaDB is a document and vector database solution for data scientists and Machine Learning Engineers. It is the muscle memory for your Machine Learning applications. Deploy in minutes and start prototyping your idea right away.

Github: https://github.com/a-mma/AquilaDB
Documentation: https://github.com/a-mma/AquilaDB/wiki
Website: http://aquiladb.xyz

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Neural Information Retrieval with AquilaDB

  1. 1. Neural Information Retrieval with AquilaDB The muscle memory for your Machine Learning applications Slides: https://bit.ly/aquiladb-slides
  2. 2. Session plan Introduction to Machine Learning 45 % Introduction to AquilaDB & Demo 45 % Questions 10 %
  3. 3. Quick introduction to Machine Learning (actually, deep learning) 1.0
  4. 4. AI, ML and DL ● AI (AGI) is still an unsolved problem ● Deep Learning is one of the many techniques in practice and is popular today ● Deep learning make use of computer hardware that we have today for parallel processing. ● Deep learning models are black box algorithms which pushed and still pushing the capabilities of softwares in the industry ● Computers are dumb
  5. 5. The quest for a universal equation
  6. 6. Solve an equation from data
  7. 7. Solve an equation from data A Neural network is one such algorithm
  8. 8. Deep Learning ● Very deep neural networks ● High end machines are required for low latency predictions Examples for some standard deep learning architectures for computer vision: http://josephpcohen.com/w/visualizing-cnn-architectures-side-by-side-with-mxnet/
  9. 9. What is a vector (matrix) ● A vector is a ordered set of numbers ○ [1, 4, 8, 3, 9, 4, 0, 2] ● A vector can represent any kind of data (with some developer tricks) ○ position in space ○ a house ○ your favourite movies ○ a sentence like “hello madam nice to meet you” ○ an image, audio or video ● Vectors can be multidimensional ● Computers are good at processing (transforming) vectors with GPU ● Linear algebra is central to almost all areas of mathematics and it operate on vectors
  10. 10. Linear transformation . = X . M = Y
  11. 11. Linear transformation X . M = Y [x1, x2] . = [y1, y2]
  12. 12. Linear transformation ● If we look closely, and rewrite the previous equation, we get equation of a line (hyperplane) y = mx + b [y vector] = [m transformation matrix] . [x vector] + 0 where b = 0 ● For example, with b = 0, we can transform a line about the origin ● And if you start changing the value of b as well, you can move a transformed line around
  13. 13. Linear transformation Transform a line about origin (change m) Also move a line around (change both b and m)
  14. 14. How to fit a line to data? Ref: ml-cheatsheet.readthedocs.io y = mx + b ● x can be any length vector (dimension) ● y = mx + b creates a hyperplane in that dimension
  15. 15. ● By adjusting the values of m and b in an equation of a line, we can fit a dataset with a line
  16. 16. How a Neural Network works? Ref: ml-cheatsheet.readthedocs.io
  17. 17. How a Neural Network works? Ref: ml-cheatsheet.readthedocs.io
  18. 18. How a Neural Network works? ● It does a series of linear transformations of a hyperplane ○ It is able to do linear fit / separation of data through these transformations ● An activation function (non linearity) is added in between each transformation ○ This introduces curve fitting instead of line fitting. In other words a line is being turned into a curve. a neural network without activation on left <- a neural network with activation on right -> Ref: https://www.spindox.it/en/blog/machine- learning-neural-networks-demystified/
  19. 19. How a Neural Network works? Ref: ml-cheatsheet.readthedocs.io
  20. 20. Let’s be practical ● Use Keras, Tensorflow or PyTorch for Machine Learning. They have automatic differentiation and will do backprop. for you. ● If you are an absolute beginner, try Keras first then move to Tensorflow 2.0 which supports Keras API ● Use Jupyter notebooks for prototyping your idea. ● Use pre-trained models from TFHub and perform transfer learning, whenever possible. If you are limited on data, you must do this. ● Get hands on. Then you will learn how to choose learning rates, handle over/under fitting etc. ● Most importantly, understand the problem and think, with what minimum resources, it can be solved.
  21. 21. Cosine Similarity between two vectors ● It gives a number between 0 and 1 which represents, how similar two vectors are ● If the number is very close to 0, which means those two vectors are parallel and are similar. ● If the number is very close to 1, which means those two vectors are perpendicular and are very dissimilar.
  22. 22. Let’s be practical ● In practical situations, you might make use of cosine similarity in a lot of places. So, we need to be tricky and computationally efficient (latency is an evil in many cases). ● One such example is to find the most similar vectors from a huge data dump given an input vector. Of Course, we need to compare each vector in the dump with input vector efficiently. ● If the elements of the vectors are drawn from a known range, we can eliminate computationally complex denominator from the previous equation. Which means, you can use dot product only to generate the distance value, which is very efficient to do with libraries like numpy. ● Just sort the index based on the result (distance) and choose the first ‘n’ vectors from the sorted index.
  23. 23. Quick introduction to Machine Learning (actually, deep learning) 2.0
  24. 24. A neural network as a sequence of dot products
  25. 25. A neural network as a sequence of dot products
  26. 26. A neural network as a sequence of dot products
  27. 27. A neural network as a sequence of dot products Live demo: https://cs.stanford.edu/people/karpathy/convnetjs/
  28. 28. A neural network as a sequence of dot products ● At each layer, the network checks how much the input match to a learned representation (weights) and filters it to keep the relevant pattern from an input to produce next hidden layer. ● At each layer this filtering process continues. As we move forward, high level but only essential features of the original inputs are being preserved. ● So, if we cut a pre trained neural network into two at any layer, and examine it, we get a generalized representation of any input to the network. This is very useful because, the network removed unwanted features from an input and gives a generalized representation of it.
  29. 29. Transfer learning ● Because we can obtain a generalized representation from any pre trained neural network just by tearing it apart, we can do incredible things with it. ○ Are you short on training data? Just use a pre trained model and do transfer learning. ○ Do you want to access an unstructured data based on its general properties? Just use a pretrained model and index it in a vector database along with the unstructured data (we will get into this later).
  30. 30. Let’s be practical ● Transfer learning is one of the default thing you should do today when you develop your applications or at least during prototyping. Because, you can get started with your development even if you are short on data. ● Language models are one such useful thing when you deal with NLP projects. ○ Embeddings - cut the pre trained model at the lower layer ■ word2vec, GloVe, fasttext, starspace ○ Encodings - cut the pre trained model at the higher layer ■ BERT, GPT, XLNet, RoBERTa ● Also applicable for other data types like images, audio, graphs etc. ● There are different repositories available to get these pretrained models ○ TFHub, Huggingface, PyTorch Hub
  31. 31. Let’s be practical ● High level steps for transfer learning ○ load a tailless pretrained model ○ attach a fake tail ○ freeze the body and train only the attached tail on part of your data ○ Unfreeze the body and fully train the new model on other part of your data ○ It doesn’t matter the difference in datasets much during pre training and post training. ● FastAI courses does explain transfer learning techniques with good examples, if you are interested: http://nlp.fast.ai/
  32. 32. Information retrieval
  33. 33. What is Information Retrieval ● Information retrieval is everywhere in the industry. And you all are familiar with that. It ranges from your PC’s file search, SQL, NoSQL to knowledge graphs. ● A few examples that purely rely on information retrieval are: the most familiar Google, Amazon, Airbnb, Facebook, Shazam, Siri etc. ● Information retrieved can be exact or approximate. SQl, NoSQL, knowledge graph etc. belongs to exact retrieval. ● But there is a big field out there where approximate information retrieval is the main player. We are going to focus on that.
  34. 34. Approximate Information Retrieval? Ha, Convince me! ● Here are some use case examples in the modern era ○ Image / video search ○ Reverse image / video search ○ Document search ○ Chatbots / voice bots ○ Song search ○ Recommend similar videos / photos / anything ○ Classification (sentiment analysis, image classification, ...) ○ ...
  35. 35. What’s wrong with traditional IR ● Traditionally, we need to manually feature engineer the data and write custom logic for each requirement. ● Data storage and retrieval mechanism varies for different use cases resulting in zero reusability of systems. ● Different databases have different API interfaces and implementation practices which introduces unwanted complexities. ● Retrieval is not efficient and slow. There is no measure to estimate the latency added by the IR system to the entire application. ● Interoperability with modern Machine Learning pipelines is poor and a headache to the adapter writers.
  36. 36. So, what’s available there for modern IR Glad you asked me, I can suggest the best method if you are ready to try out deep learning in your project 1. Just train a Deep Learning model over your data and let it learn the essence of (encode) your data. 2. Just dissect it by its head or tail (that’s up to you). 3. Feed your data to it and collect what’s on the other end. 4. Index it inside a vector database (AquilaDB) along with JSON metadata to give more meaning to the encoded vector. 5. And, when you want to perform IR, do the same at step 3 on this new data. And perform k-NN search on vector database (AquilaDB). It’s that simple. No more hard work!
  37. 37. Introducing, AquilaDB The muscle memory for your Machine Learning applications
  38. 38. ● a_മ്മis a FOSS ML interest group focused on Malayalam Language dataset generation and chatbots. ● Eventually, we found some basic and common problems in ML application and engineering practices. ● We decided to solve some of them for our own projects and it turned out to be useful to everyone. ● So, a_മ്മis currently in the evolving stage, trying and will find a perfect place, to serve the mankind. @freakeinstein @smqbit @Haroldgomez777 @JESWINKNINAN a_മ്മ is not ‘amma’ it’s `e mma’ It is an umbrella branding (womb) for multiplefuturisticprojects Logo: blue colour represents male and pink colour represents female being part of one life form. Circles that shifts away represents the process of cell splitting.
  39. 39. AquilaDB What? ● It’s a vector and document database ● It’s the muscle memory for Machine Learning applications ● It’s like redis for ML ● It performs super fast k-NN search over large vector data Why? ● You can’t rely on end to end Machine Learning models for information retrieval (low latency) ● Keep document metadata along with the vectors to add more meaning ● Easy setup. Start using it in minutes. ● Language agnostic. Build your apps in any programming language (not just python).
  40. 40. What’s wrong with traditional IR ● Traditionally, we need to manually feature engineer the data and write custom logic for each requirement. ● Data storage and retrieval mechanism varies for different use cases resulting in zero reusability of systems. ● Different databases have different API interfaces and implementation practices which introduces unwanted complexities. ● Retrieval is not efficient and slow. There is no measure to estimate the latency added by the IR system to the entire application. ● Interoperability with modern Machine Learning pipelines is poor and a headache to the adapter writers. SOLVED ! * It’s a bit exaggerated.
  41. 41. How to use AquilaDB Install it in minutes Prerequisites: docker
  42. 42. How to use AquilaDB Do everything in 6 lines of code (python example) Prerequisites: nothing
  43. 43. How to support? Go to this Github link of AquilaDB and try out: https://github.com/a-mma/AquilaDB
  44. 44. Demo time!

×