2. What is ADAM?
ADAM is an advanced
Named Entity Recognition
module for domain specific
data designed to deal with
the situation of absence of
labelled data.
It leverages concepts from
weak labelling, deep
sequence models and
active learning
3. WHAT IS NAMED ENTITY RECOGNITION?
Sundar Pichai is the CEO of Google.
Person: Sundar Pichai
Position: CEO
Organisation: Google
4. WHAT NER CAN BE IN DOMAIN SPECIFIC
DATA?
Hospital Records…
“You will need to have your uterine bleeding evaluated. This continued
agitation may be caused by intra-parenchymal hemorrhage.”
Symptoms: uterine bleeding, continued agitation
Diagnosis: intra-parenchymal haemorrhage
5. WHAT NER CAN BE IN DOMAIN SPECIFIC
DATA?
Product Description…
“Pedigree is a complete and balanced food for dogs. Pedigree is rich in
proteins and nutrition.”
Brand: Pedigree Type: food for dogs
Nutrients: proteins
6. What will this session cover?
• Motivation for this problem.
• Why off-the-shelf solutions didn’t work for us.
• An approach to entity extraction from product title like text in the absence
of labelled data.
• Comparison with other models
• And takeaways
7. PROBLEM STATEMENT
Product Titles often have references to attributes which play a crucial role in
driving the use-cases we will describe later.
Therefore, there is a need to extract these attributes from the product titles.
Hence ADAM!!!
8. What will ADAM Do?
resolute black vodka 180ml
resoluteBb blackIb vodkaBc 180ml
(Tagged product title)
A.D.A.M.
12. EVOLVING KNOWLEDGE GRAPH
• Evolving Knowledge Graph.
• Graph with entities being
I(individual product or their
attributes) as nodes and edges
describing the relationship
between them
• Given a seed graph, the idea is
to evolve it from the data-set.
.
13. Challenges
• Domain specific data (Product Titles in our case)
• Zero ground truth and no training data available whatsoever.
• Multiple sources of data thus imagine the variance
• KESHKANTI HAIR CLEASNER SILK & SHINE 8 ML [1*960 PC] …….. A well described
stock-item
• N Deo Whitening Talc Touch100gm ……….. A not so well described stock-item
• B. M. W. 2L ………… Could have been a good stock-item desc.
• Short representations and extremely noisy
• A H F SHAMPOO400mlM220 ………. Very hard for a machine to understand
• HW Sandrop Hert 5 Ltr.
14. Some relatable previous works…
• Traditional Algorithms:
• Information extraction using CRF algorithm for generating hand-crafted features.
(Citation: Ajinkya M. , Attribute Extraction from Product Titles in e-Commerce)
• Information Extraction using weak labelling with the help of knowledge base on
twitter data.
(Citation: Alan R., Sam C,. Oren E., Named Entity Recognition in Tweets)
• Deep Learning Models:
• Lample et. al. , Neural Architecture for Deep learning models
• Xuezhe Ma and Edward Hovy, End-to-end Sequence Labeling via Bi-directional
LSTM-CNNs-CRF
15. Some relatable previous works…
• Off-the-shelf Tools:
• Stanford’s NLTK
• Google’s train and play deep learning models
• CRF-suite
16. Why they didn’t work?
• These approaches either leveraged a large knowledge base or a large amount of labelled
data. (Both of which are generally not available for industry applications)
• These models were trained on rather clean data-set with much smaller variance than
what we needed to deal with.
• Some of these approaches used hand-crafted features which couldn’t scale on a data-set
provided by millions of sources.
• Most of the pretrained tools like NLTK is trained on Natural Language Dataset which is
simply not the case with Product Titles.
• The attributes that they can predict are not relevant to us.
17. Off the Shelf tools Limitation
• They were trained on Natural Languages( i.e. Languages with proper
grammatical structure). Hence It works only on those type of sentences
Barrack/NNP Obama/NNP is/VBZ the/DT next/JJ president/NN
Person -> Barrack Obama , Org. -> president
Nestle/NNP Maggi/NNP Noodles/NNP 100gm/CD
Person -> Nestle, Person -> Maggi Noodles
18. Hand crafted feature
Limitation
• Worked well on Walmart Data-
set.
• Was a disaster when tried on our
data-set (because of high
variance)
19. CRF Results on Our Data-set
CRF Trained on hand crafted features on Label data crawled from various
websites
• shri hanuman blended mustard oil 15kg jar
• tata tea elaichi 250gm
• relive fruity jelly war jar 50gm
It did well on popular products:
• maggi tomato ketchup 200g
• veet hair removal cream 100g
20. Why ADAM is better.
• Leverages weak labelled data.
• Leverages active learning.
• Immune to noise.
• Does not require any hand-crafted features as input.
22. ARCHITECTUR
E
The 3 Main Components of ADAM…
• Weak Label Generation
• State of the Art Sequence Tagging Model
• Active Learning approach
23. Weak Label Generation
• How we used an existing knowledge Graph.
• How we improvised using the information from other sources like Amazon
catalogue.
• In addition, How we leveraged the structure of dataset (Stock-item and
Stock-group).
26. Weak Labelling Algorithm
• We used a complicated rule based string matching algorithm. Which
annotates different tokens present in a stock-item using our knowledge
base.
• Then we use some constraints to pick the good quality annotated stock-
items. Using once again a bunch of rules.
28. Seed Data-set Count
• Now We passed around 0.7million stock-items through this process.
• Out of that only 8 thousand were able to pass the quality check.
• And that becomes our seed data-set.
31. WORD EMBEDDING
LAYER
• Why we couldn’t use an
existing embedding space
or pretrained set of vectors
for this?
• How we created our own?
What data did we use?
• Used “Skip-Gram” with
“hierarchical softmax”
optimization (mikolov et.
al)
32. SKIP GRAM
ALGORITHM
Data-set we used.
• Stock-items provided by Tally’s
Product
• Product titles from Amazon’s
Catalogue and GS1 data
• Product titles crawled from
various websites.
• A total size of around 13 million
titles were used
34. BI-LSTM LAYER
Sequential Training Model
• word level encoder which
leverages the sequential
information to encode the
tokens of the sentence.
• Why another word level
encoder used? What more
information does it encode?
35. CONDITIONAL
RANDOM FIELDS
• Like other layers this also uses
context from neighbouring
tokens and labels but,
• Bi-LSTM only leverages input
context.
• CRF is the only layer that
leverages output label context.
37. SOME RESULTS WITH BI-LSTM AS FINAL
LAYER
• brit cow ghee 1ltr
• patanjali kesh kanti natural 200ml
• dabur chawanprash 500gm mrp
• smirnoff green apple triple distilled vodka 750ml
• eveready torch
38. IMPROVED RESULTS BECAUSE OF CRF
• brit cow ghee 1ltr
• patanjali kesh kanti natural 200ml
• dabur chawanprash 500gm mrp
• smirnoff green apple triple distilled vodka 750ml
• eveready torch
41. BASELINE RESULTS NUMBERS HERE WHY
IT WASN’T ENOUGH
On a hold out set of 1321 data points
Surface format Match
• Brand: 50.1%
• Category: 44.4%
Complete Sequence Match: 30.2%
42. ACTIVE LEARNING.
• The first 2 parts give us a good baseline model. But Why Isn’t it good
enough?
• One limitation with our automated weak label generation process is that it
is constrained by the completeness and quality of knowledge base.
• So we need to get some data-points manually labelled.
• Hence, the aim is to generate samples consciously which will lead to
maximal improvement in the model with minimal labelling effort
43. WHAT WE DID…
• Extrinsic Sampling. (Diversity based Sampling technique)
• Why uncertainty based sampling didn’t work?
• Manually labelled those samples.
• Retrain the model using augmented data-points using SGD and lesser no.
of epochs
• Test the model on the hold out set. If the model is improved, Repeat, until
you reach maturity
49. SOME OPTIMISTIC FINAL RESULTS
• n relive fruity jelly box 70ml
• zee citric acid 20g
• shri hanuman blended mustard oil 15kg jar
• shaving cream good morning 70gm
• Sandisc flash harddrive 100TB model TTB200
50. SOME NOT SO OPTIMISTIC RESULTS
• hawkins induction ltr heavy base pr ih
• mixed pickle 200g freshy 36pcs
• nutrela m oil 5ltr
51. METRICS
All the results that we see below is generated on a hold-out set !! i.e. On a
sample set, whose entities were never part of training data
• Baseline accuracy :
• Iterative Accuracy :
• Precision and Recall :
Iter 0 Iter 1 Iter 2 …. Iter n
Iter 0 Iter 1 Iter 2 …. Iter n
54. COMPARISON WITH CRF-SUITE
• CRF suite was very positional bias and couldn’t generalise on relative
positioning
• The results on the holdout set (data-set with entities which were not
present during training) with CRF-suite while It was a great improvement
when we switched to ADAM
• Scaling was impossible with CRF-suite
• <Will update some comparison metrics shortly>
55. OUR CONTRIBUTION
• We propose a novel entity extraction model for domain-specific data – short
and noisy – and in the absence of pre-labelled data.
• Our model builds upon a state-of-the-art model based on deep learning and
CRF and further leverages weak labelling and active learning techniques.
• We propose a novel extrinsic sampling technique for active learning (which
performs better than uncertainty sampling for this task)
56. TAKE-AWAYS:
• How can Industry grade Information Extraction model be made for domain
specific data.
• How to tackle the noise problem in case of textual data.
• Why a deep NN model plays an important role in generalisation.
• Why Active Learning is a really important concept for dealing with the
problem of zero label data.
57. FUTURE SCOPE OF ADAM
• Higher order attribute extraction.
• Good Night Advance Mat 12 pack.
• Britannia Goodday cashew and nuts cookies 50 gm.
• Relationship Extraction
58. CONCLUSION
Deep Learning is surely making a mark in the field of NLP. But its
industrialisation is still an open problem. Mostly because the quality of
textual data-set is not very apt for the models to learn.
Active Learning is an interesting concept to tackle the above situation. Not
only in the field of NLP, but the same concept can be generalised to other
domains of Machine Learning and AI as well.
59. ACKNOWLEDGMENTS
• Mikolov et. al. Word2Vector
• https://arxiv.org/pdf/1707.05928v2.pdf For active learning.
• My teammates: Abishek Ahluwalia( Data Scientist II), Deepak Sharma
(Lead Data Scientist), Ashish Anand Kulkarni (Director Data Science)
1) Why we had to solve NER problem again even though it has been solved in many different ways? What are the real life application of ADAM? Why does its existence matter and helpful not only at Clustr but for many other businesses too?
2) Ditto
3) Will explain through and through why we did what we did? Why and how we achieve our end goal?
4) Will show some results that will justify our approach
5) How this approach can inspire others in ML domain.
1) We at Clustr deal with product titles. Like you see in any e-commerce website
2) <read through>
A good catalogue is defined by its coverage and advanced search and select ability.
Attributes of those product titles can be a great filter for complicated searches.
ADAM can help automate them
Market penetration
Localisation
Trend flow
All can be automated with the help of ADAM
Obviously these ontology needs extracted information to fill in nodes and edges.
Hence if we can automate the extraction, we can build an ontology that can evolve automatically
1) a) Walmart published a paper which showed attribute extraction using CRF algorithm, which uses hand crafted feature which I will describe later.
1) b) There have been a few paper where they generate weak label data automatically using a healthy knowledge base and strict string matching algorithm
2) Few good off the shelf tools like NLTK(s), Google deep-learning models and CRF-suite. Are good with natural language data but not with ours.
<Read it through.>
Refer to the paper of open-tag and how they used very specific domain to test their algorithm
Since our data-set that we have a are produced by millions of users there is no specific way of writing or rules that they follow. Hence no hand crafted features.
We have a special way to leverage a small knowledge base and structure of data to generate labelled data (seed to be exact)
We leverage the concept of active learning whose aim is to get the minimum amount of manual labelling for training purpose. Whose details are followed later.
Why this approach can adapt to noise. (Both because of model and active learning)
No need to generate hand crafted feature.
Put only the picture in the slide and basically focus on the 3 components
Describe why stock groups are important
Use the term resolution. How we use our knowledge base to resolve and talk about constraint
Tell them the numbers to get the idea of constraints
Explain the whole architecture
Introduce the data that we use,
how we removed the digits , and
demonstrate the embedding space mikolov et al. introduce citations
It is an advance form of recurrent neural network.
<Describe the diagram>
Talk about what are its important enhancement. Forget gate.
Talk about an example of differentiating apple as a category when gm is present and apple as brand when something else.
Why this was needed?
Since it takes representation into account, it can differentiate very well between ambiguous words. Which is absent in word embedding.
BILSTM would’ve only used the sequential context of the symbols to make the decision while CRF used the context of sequential states.
Explain the concept of transfer energy and emission energy you are trying to minimise