1) The document discusses building a large-scale, production-ready prediction system in Python to classify support tickets.
2) It outlines the challenges including dealing with noisy, unbalanced data and scaling to support millions of users.
3) The proposed solution involves natural language processing, model validation and selection, and making the system scalable through techniques like algorithm selection and building a distributed architecture.
11. Deployment related challenges
Large user base ( .5 million
users )
> 1000 simultaneous requests
Designed for Global Access
inside and outside
organization network
Extremely short time to Go
Live
Deployment
related
challenges
14. Natural Language
Processing
•Processes the input text
Data Pre-processing
•Handles all the data related
activities
Model Building
•Builds the machine learning
model
•Learns from input data as well
as system use ( continuous
learning)
Model Database
•Holds the trained models as
well as other needed data like
logs
Prediction
•Predicts the classes for the
given input data
Key Blocks of Solution
18. • Training data has lot of words which do not add value for
the prediction
• Examples include The, is , or, etc…
• Call below function where text is the string from which the
stop words needed to be removed
• myStopWordList - This is the list of stop words
def removeStopWord(text) :
text = ' '.join([word for word in text.split() if word not in
myStopWordList])
return text
Stop word removal
19. Tried nltk’s Named Entity Recognition
There were many issues of wrong tagging of entities and in
many case not tagging of entities
We needed a simple and fool proof way of tagging
For removing names we got a list of possible names from
our internal systems
We followed a similar approach as Stop Word removal for
this
Removing names
21. • Select features according to the k highest scores.
• The output of the TF-IDF vectorizer can be fed to this to reduce the
number of feature and only retain the ones with the highest scores
• Y_train is a list of labels which are in same oreder as the X_train data
– The first element in Y is the label corresponding to the first sentence in X_train
• Code snippet for this :
• ch2 = SelectKBest(chi2, k='all')
– In this case we have used the chi-squared
– We have also opted to select all the features
• X_train = ch2.fit_transform(X_train, y_train)
• Using chi square test ensures only retaining most relevant features where
most relevant features are those which have higher correlation with the
labels. This test will weed out non-correlated features.
Handling unstructured text – Intelligent
Feature reduction
22. • Xtrain -
– Holds the list of input data to be used for training
– Each item of list is one sentence in training corpus
• Assuming text has gone through required pre-processing for
cleaning we can now - “Convert a collection of raw documents to
a matrix of TF-IDF features.”
• Code snippet for this :
• vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
•
stop_words='english') **
• X_train = vectorizer.fit_transform(X_train)
• Note : ** - There are many parameters available for this call. We
can discuss about the selected parameters.
Handling unstructured text - Vectorization
24. – Given labelled data set has to be split between train and
test sets
– Train sets will be used for training classifier
– Test sets will be used for testing classifier
– Split can be decided based on available labelled data
– We went with 70 : 30 where 70% of the labelled data
was used for training
Selecting training data for the classifier
25. • Sklearn has a huge repository of algorithms
• However not all of them are relevant
• Criteria for selecting
– Is a supervised machine learning algorithm
– Handles text classification
– Can handle the size of data
• Many algorithms satisfied 1 and 2 above however when
used for training they never completed the cycle
Selecting the classifier
26. • Sklearn has a standardized interface for training of a
classifier
• The difference between two classifier are the parameters
available for training
• Code Snippet for classifier training:
– clf = XXXX(param1=val1, param2=val2….)
– clf.fit(X_train, y_train_text)
– Where XXXX above refers to the relevant classifier
• It is advisable to create the code in a way where new
classifiers can easily be prototyped to be tested
Training the classifier
27. • Once classifier has been trained it can be used for predicting
• X_test – Holds the list of held out set to be used for testing
• Code flow for same :
x_test = vectorizer.transform(X_test)
x_test = ch2.transform(X_test)
pred = clf.predict(x_test)
• X_test will be passed through the same pipeline which is the
vectorizer and k-best trainsform which was previously fitted with
the training data
• Pred - Holds the list of predicted labels
Predicting using the classifier
29. • Once we have got the prediction we need to evaluate if
classifier is good enough
• For this we have to see if the precision , recall and f-score
are good enough
• We can use the following code snippet to check this score
metrics.f1_score(y_test, pred)
print("f1-score: %0.3f" % score)
This is a score between 0 and 1.
The higher the score means it is better
For example - f1-score: 0.801
The threshold which we set for accepting this is based on our
understanding of the domain
Model validation
30. • To get a more detailed understanding of how our classifier is performing we can
use
• print(metrics.classification_report(y_test, pred,target_names=categories))
• The above will give an a classwise break up of Precision, Recall , F-score and
Support ( Number of cases available for that case)
• It will also give these scores for the classifier as a whole
• Sample Report
• precision recall f1-score support
• Class1 0.99 0.97 0.98 4558
• Class2 0.56 0.74 0.63 53
• avg / total 0.81 0.81 0.80 19022
• From the above report we can see that classifier as a whole is at 80% F-score.
Class 1 is at very good accuracy. Class2 is performing poorly.
• Hence if there is a need to improve accuracy a dedicated effort can be done to
improve Class2’s score.
Model validation (contd )
31. • Based on benchmarking of different algorithms
the best performing algorithm can be selected
• Parameters for selection will vary from domain to
domain
• Key Parameters which could considered :
– F- Score
– Precision
– Re-call
– Model building Time
– Prediction Time
– Amount of Training data
Algorithm selection
32. • Selecting the final model is an iterative process
• Tuning will be done based on
– Algorithm Selected
– Algorithm Parameters
– Training Data
– Training / Testing Data Ratios
• Once a satisfactory performance has been
reached the model will be built and can be used
Train / Re-Train loop
35. • Sizing the number of instances
– Benchmark maximum capacity for the instance - X
– Benchmark maximum needed simultaneous request – Y
– Calculate Number of instances
• (Y / ( X – .4 X) ) + 2
– Use at only 60% of capacity
– Factor for 2 additional instances
– Size requests from within and outside organizations
– Size requests based on region
– Separate region level farms
– Separate farms for users from within and outside the
company
Building a scalable solution
37. • ~ 50 % reduction in Reassignment index
• Significant savings in efforts due this ( > 100
person months saved ) within just 3 months of
release
• First version of our solution released to
production in under 2 months
Our results
39. Python
• Has excellent libraries for handling machine learning
problems
• Python can be used in Live Production environments
• We are able to achieve the needed scalability and
performance required using python
• The language itself is easy to learn and we can write
maintainable code
Conclusions