SauravKumar-ContentFiltering-InternDay2015

presentation
Internship Report
Saurav Kumar, Software Engineering Intern
July 15, 2015
LinkedIn Bangalore

problem
Emails like these...
2

problem
Emails like these...
3

problem
There was no existing solution to find which specific part
of the text is causing a text to be classified as spam.
My task was to build a tool to solve this problem.
4

tool 1: spam classification tool

spam classification tool
∙ Given a content source, title and body of content, this
tool tabulates the scores of each classiﬁer
6

spam classification tool
tool tabulates the scores of each classiﬁer
∙ Query request is sent to BAM, and the response
summary is presented in a table
6

tool 2: spam token classification tool

spam token classification tool
tool computes the contribution of each word (token)
towards the overall score
9

10

tool computes the contribution of each word (token)
towards the overall score
∙ The UI allows you to set the number of tokens you
want to examine
11

method
∙ Suppose, we have a content with tokens t1, t2, ..., tn
13

method
∙ We need to ﬁnd the effect of each token on the total
score
13

method
score
∙ Let the score of this content be S0
13

method
score
∙ Modify the ith
token to make it a non-word, and
obtain the score Si
13

method
score
∙ Modify the ith
token to make it a non-word, and
obtain the score Si
∙ The difference wi = (S0 − Si) signiﬁes the effect of ith
token on the score
13

method
Coloring
∙ Collect the score for each token and for each
classiﬁer
14

method
Coloring
classiﬁer
∙ Normalize the scores for each classiﬁer
14

method
Coloring
classiﬁer
∙ Normalize the scores for each classiﬁer
∙ Color the top k1% good words with green and top k2%
bad words with red, with intensity proportional to
their scores.
14

Escalation a few weeks ago
16

Benefits
∙ Saves time and effort in finding specific spam text
18

Beneﬁts
∙ More insights with token-wise scores
18

Beneﬁts
∙ Can be used to test performance of a classiﬁer
18

Beneﬁts
∙ Permalink of result is created, so links can be shared
18

Beneﬁts
∙ Content URN not required, so any text can be tested
18

Beneﬁts
∙ Content URN not required, so any text can be tested
∙ Method used is independent of classiﬁer’s model
18

Assumptions
∙ Scoring from classiﬁer should be incremental, and
not 0-1
20

Assumptions
∙ Scoring from classiﬁer should be incremental, and
not 0-1
∙ Same classiﬁers should run for all the requests: New
end-point in BAM ensures this
20

Limitations
∙ For such classiﬁers where total score is either 0 or 1,
this tool cannot extract any meaningful information
21

Limitations
∙ For such classiﬁers where total score is either 0 or 1,
this tool cannot extract any meaningful information
∙ For a large content, signiﬁcant amount of time is
required
21

Figure: Measure of response time vs number of words
22

technologies used
∙ Play Framework
24

technologies used
∙ Play Framework
∙ D2 (Dynamic Discovery) for making RestLi calls
24

technologies used
∙ Play Framework
∙ ParSeq for making parallel requests
24

technologies used
∙ Play Framework
∙ Stork for email
24

technologies used
∙ Play Framework
∙ Stork for email
∙ Couchbase to store responses
24

challenges
∙ Dealing with R2 (Request/Response) timeout
26

challenges
∙ Dealing with R2 (Request/Response) timeout
∙ Running an ofﬂine job after client may have closed
connection
26

Thank You
Credits: Beamer(mtheme), ShareLaTeX
28

SauravKumar-ContentFiltering-InternDay2015

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to SauravKumar-ContentFiltering-InternDay2015

Similar to SauravKumar-ContentFiltering-InternDay2015 (20)

SauravKumar-ContentFiltering-InternDay2015