5th e le_revised

ADAM ……for 5th
Elephant
ATTRIBUTE DETECTION AND ANNOTATION MODULE….. (pitched this
name because we have an “EVE” building up too)

What is ADAM?
ADAM is an advanced
Named Entity Recognition
module for domain specific
data designed to deal with
the situation of absence of
labelled data.
It leverages concepts from
weak labelling, deep
sequence models and
active learning

WHAT IS NAMED ENTITY RECOGNITION?
Sundar Pichai is the CEO of Google.
Person: Sundar Pichai
Position: CEO
Organisation: Google

WHAT NER CAN BE IN DOMAIN SPECIFIC
DATA?
Hospital Records…
“You will need to have your uterine bleeding evaluated. This continued
agitation may be caused by intra-parenchymal hemorrhage.”
Symptoms: uterine bleeding, continued agitation
Diagnosis: intra-parenchymal haemorrhage

WHAT NER CAN BE IN DOMAIN SPECIFIC
DATA?
Product Description…
“Pedigree is a complete and balanced food for dogs. Pedigree is rich in
proteins and nutrition.”
Brand: Pedigree Type: food for dogs
Nutrients: proteins

What will this session cover?
• Motivation for this problem.
• Why off-the-shelf solutions didn’t work for us.
• An approach to entity extraction from product title like text in the absence
of labelled data.
• Comparison with other models
• And takeaways

PROBLEM STATEMENT
Product Titles often have references to attributes which play a crucial role in
driving the use-cases we will describe later.
Therefore, there is a need to extract these attributes from the product titles.
Hence ADAM!!!

What will ADAM Do?
resolute black vodka 180ml
resoluteBb blackIb vodkaBc 180ml
(Tagged product title)
A.D.A.M.

PRODUCT CATALOGUE
• Universal Product Catalogue
with the ability of enhanced
semantic search

AGGREGATION AND MARKET ANALYSIS
Aggregation and Market Analysis
• Lift Insight
• Demand Insight
• Market Penetration
• Predictive Customer Analytics

EVOLVING KNOWLEDGE GRAPH
• Evolving Knowledge Graph.
• Graph with entities being
I(individual product or their
attributes) as nodes and edges
describing the relationship
between them
• Given a seed graph, the idea is
to evolve it from the data-set.
.

Challenges
• Domain specific data (Product Titles in our case)
• Zero ground truth and no training data available whatsoever.
• Multiple sources of data thus imagine the variance
• KESHKANTI HAIR CLEASNER SILK & SHINE 8 ML [1*960 PC] …….. A well described
stock-item
• N Deo Whitening Talc Touch100gm ……….. A not so well described stock-item
• B. M. W. 2L ………… Could have been a good stock-item desc.
• Short representations and extremely noisy
• A H F SHAMPOO400mlM220 ………. Very hard for a machine to understand
• HW Sandrop Hert 5 Ltr.

Some relatable previous works…
• Traditional Algorithms:
• Information extraction using CRF algorithm for generating hand-crafted features.
(Citation: Ajinkya M. , Attribute Extraction from Product Titles in e-Commerce)
• Information Extraction using weak labelling with the help of knowledge base on
twitter data.
(Citation: Alan R., Sam C,. Oren E., Named Entity Recognition in Tweets)
• Deep Learning Models:
• Lample et. al. , Neural Architecture for Deep learning models
• Xuezhe Ma and Edward Hovy, End-to-end Sequence Labeling via Bi-directional
LSTM-CNNs-CRF

Some relatable previous works…
• Off-the-shelf Tools:
• Stanford’s NLTK
• Google’s train and play deep learning models
• CRF-suite

Why they didn’t work?
• These approaches either leveraged a large knowledge base or a large amount of labelled
data. (Both of which are generally not available for industry applications)
• These models were trained on rather clean data-set with much smaller variance than
what we needed to deal with.
• Some of these approaches used hand-crafted features which couldn’t scale on a data-set
provided by millions of sources.
• Most of the pretrained tools like NLTK is trained on Natural Language Dataset which is
simply not the case with Product Titles.
• The attributes that they can predict are not relevant to us.

Off the Shelf tools Limitation
• They were trained on Natural Languages( i.e. Languages with proper
grammatical structure). Hence It works only on those type of sentences
Barrack/NNP Obama/NNP is/VBZ the/DT next/JJ president/NN
Person -> Barrack Obama , Org. -> president
Nestle/NNP Maggi/NNP Noodles/NNP 100gm/CD
Person -> Nestle, Person -> Maggi Noodles

Hand crafted feature
Limitation
• Worked well on Walmart Data-
set.
• Was a disaster when tried on our
data-set (because of high
variance)

CRF Results on Our Data-set
CRF Trained on hand crafted features on Label data crawled from various
websites
• shri hanuman blended mustard oil 15kg jar
• tata tea elaichi 250gm
• relive fruity jelly war jar 50gm
It did well on popular products:
• maggi tomato ketchup 200g
• veet hair removal cream 100g

Why ADAM is better.
• Leverages weak labelled data.
• Leverages active learning.
• Immune to noise.
• Does not require any hand-crafted features as input.

A DEEPER LOOK
AT ADAM
PROBLEMS FACED AND HOW WE SOLVED THEM

ARCHITECTUR
E
The 3 Main Components of ADAM…
• Weak Label Generation
• State of the Art Sequence Tagging Model
• Active Learning approach

Weak Label Generation
• How we used an existing knowledge Graph.
• How we improvised using the information from other sources like Amazon
catalogue.
• In addition, How we leveraged the structure of dataset (Stock-item and
Stock-group).

Knowledge Graph
Examples
• Fig. 1 Our Knowledge
Graph
• Fig. 2 A unit of amazon
data-set

EXAMPLE OF
OUR DATA
SET…
• Id Seg. Stock-Group Stock-item

Weak Labelling Algorithm
• We used a complicated rule based string matching algorithm. Which
annotates different tokens present in a stock-item using our knowledge
base.
• Then we use some constraints to pick the good quality annotated stock-
items. Using once again a bunch of rules.

Label Generation
STOCK_GROUP STOCK_ITEM
Quality
PICKWICK PICKWICK WAFOBIX PINEAPPLE
Bb Bb Bc
Wood Wood T-shirt Man 1000
Bc Bc Bc Bb

Seed Data-set Count
• Now We passed around 0.7million stock-items through this process.
• Out of that only 8 thousand were able to pass the quality check.
• And that becomes our seed data-set.

SEQUENTIAL
MODEL
ARCHITECTU
RE

WORD EMBEDDING
LAYER
• Why we couldn’t use an
existing embedding space
or pretrained set of vectors
for this?
• How we created our own?
What data did we use?
• Used “Skip-Gram” with
“hierarchical softmax”
optimization (mikolov et.
al)

SKIP GRAM
ALGORITHM
Data-set we used.
• Stock-items provided by Tally’s
Product
• Product titles from Amazon’s
Catalogue and GS1 data
• Product titles crawled from
various websites.
• A total size of around 13 million
titles were used

BI-LSTM LAYER
Sequential Training Model
• word level encoder which
leverages the sequential
information to encode the
tokens of the sentence.
• Why another word level
encoder used? What more
information does it encode?

CONDITIONAL
RANDOM FIELDS
• Like other layers this also uses
context from neighbouring
tokens and labels but,
• Bi-LSTM only leverages input
context.
• CRF is the only layer that
leverages output label context.

SOME RESULTS WITH BI-LSTM AS FINAL
LAYER
• brit cow ghee 1ltr
• patanjali kesh kanti natural 200ml
• dabur chawanprash 500gm mrp
• smirnoff green apple triple distilled vodka 750ml
• eveready torch

IMPROVED RESULTS BECAUSE OF CRF
• brit cow ghee 1ltr
• patanjali kesh kanti natural 200ml
• dabur chawanprash 500gm mrp
• smirnoff green apple triple distilled vodka 750ml
• eveready torch

LOSS FUNCTION
Paramet
ers
Description
f() Function that generates emission score
and transition score
I Ith time-step or ith token
lambda
Trainable hyperparameter
Z(x) Normalisation Score

MINIMIZING TOTAL ENERGY OF THE
SEQUENCE

BASELINE RESULTS NUMBERS HERE WHY
IT WASN’T ENOUGH
On a hold out set of 1321 data points
Surface format Match
• Brand: 50.1%
• Category: 44.4%
Complete Sequence Match: 30.2%

ACTIVE LEARNING.
• The first 2 parts give us a good baseline model. But Why Isn’t it good
enough?
• One limitation with our automated weak label generation process is that it
is constrained by the completeness and quality of knowledge base.
• So we need to get some data-points manually labelled.
• Hence, the aim is to generate samples consciously which will lead to
maximal improvement in the model with minimal labelling effort

WHAT WE DID…
• Extrinsic Sampling. (Diversity based Sampling technique)
• Why uncertainty based sampling didn’t work?
• Manually labelled those samples.
• Retrain the model using augmented data-points using SGD and lesser no.
of epochs
• Test the model on the hold out set. If the model is improved, Repeat, until
you reach maturity

ACTIVE LEARNING : TRAINING ITERATION

ITERATIVE IMPROVEMENT IN TRAINING
DATA

SOME OPTIMISTIC FINAL RESULTS
• n relive fruity jelly box 70ml
• zee citric acid 20g
• shri hanuman blended mustard oil 15kg jar
• shaving cream good morning 70gm
• Sandisc flash harddrive 100TB model TTB200

SOME NOT SO OPTIMISTIC RESULTS
• hawkins induction ltr heavy base pr ih
• mixed pickle 200g freshy 36pcs
• nutrela m oil 5ltr

METRICS
All the results that we see below is generated on a hold-out set !! i.e. On a
sample set, whose entities were never part of training data
• Baseline accuracy :
• Iterative Accuracy :
• Precision and Recall :
Iter 0 Iter 1 Iter 2 …. Iter n
Iter 0 Iter 1 Iter 2 …. Iter n

TAG FLIPS (MODEL STABILITY
IMPROVEMENTS)

IMPROVEMENT GRAPHS
• < Will be added shortly>

COMPARISON WITH CRF-SUITE
• CRF suite was very positional bias and couldn’t generalise on relative
positioning
• The results on the holdout set (data-set with entities which were not
present during training) with CRF-suite while It was a great improvement
when we switched to ADAM
• Scaling was impossible with CRF-suite
• <Will update some comparison metrics shortly>

OUR CONTRIBUTION
• We propose a novel entity extraction model for domain-specific data – short
and noisy – and in the absence of pre-labelled data.
• Our model builds upon a state-of-the-art model based on deep learning and
CRF and further leverages weak labelling and active learning techniques.
• We propose a novel extrinsic sampling technique for active learning (which
performs better than uncertainty sampling for this task)

TAKE-AWAYS:
• How can Industry grade Information Extraction model be made for domain
specific data.
• How to tackle the noise problem in case of textual data.
• Why a deep NN model plays an important role in generalisation.
• Why Active Learning is a really important concept for dealing with the
problem of zero label data.

FUTURE SCOPE OF ADAM
• Higher order attribute extraction.
• Good Night Advance Mat 12 pack.
• Britannia Goodday cashew and nuts cookies 50 gm.
• Relationship Extraction

CONCLUSION
Deep Learning is surely making a mark in the field of NLP. But its
industrialisation is still an open problem. Mostly because the quality of
textual data-set is not very apt for the models to learn.
Active Learning is an interesting concept to tackle the above situation. Not
only in the field of NLP, but the same concept can be generalised to other
domains of Machine Learning and AI as well.

ACKNOWLEDGMENTS
• Mikolov et. al. Word2Vector
• https://arxiv.org/pdf/1707.05928v2.pdf For active learning.
• My teammates: Abishek Ahluwalia( Data Scientist II), Deepak Sharma
(Lead Data Scientist), Ashish Anand Kulkarni (Director Data Science)

5th e le_revised

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

5th e le_revised

Editor's Notes