SlideShare a Scribd company logo
presentation
Internship Report
Saurav Kumar, Software Engineering Intern
July 15, 2015
LinkedIn Bangalore
the problem
problem
Emails like these...
2
problem
Emails like these...
3
problem
There was no existing solution to find which specific part
of the text is causing a text to be classified as spam.
My task was to build a tool to solve this problem.
4
tool 1: spam classification tool
spam classification tool
∙ Given a content source, title and body of content, this
tool tabulates the scores of each classifier
6
spam classification tool
∙ Given a content source, title and body of content, this
tool tabulates the scores of each classifier
∙ Query request is sent to BAM, and the response
summary is presented in a table
6
spam classification tool
7
tool 2: spam token classification tool
spam token classification tool
∙ Given a content source, title and body of content, this
tool computes the contribution of each word (token)
towards the overall score
9
spam token classification tool
10
spam token classification tool
∙ Given a content source, title and body of content, this
tool computes the contribution of each word (token)
towards the overall score
∙ The UI allows you to set the number of tokens you
want to examine
11
method
method
∙ Suppose, we have a content with tokens t1, t2, ..., tn
13
method
∙ Suppose, we have a content with tokens t1, t2, ..., tn
∙ We need to find the effect of each token on the total
score
13
method
∙ Suppose, we have a content with tokens t1, t2, ..., tn
∙ We need to find the effect of each token on the total
score
∙ Let the score of this content be S0
13
method
∙ Suppose, we have a content with tokens t1, t2, ..., tn
∙ We need to find the effect of each token on the total
score
∙ Let the score of this content be S0
∙ Modify the ith
token to make it a non-word, and
obtain the score Si
13
method
∙ Suppose, we have a content with tokens t1, t2, ..., tn
∙ We need to find the effect of each token on the total
score
∙ Let the score of this content be S0
∙ Modify the ith
token to make it a non-word, and
obtain the score Si
∙ The difference wi = (S0 − Si) signifies the effect of ith
token on the score
13
method
Coloring
∙ Collect the score for each token and for each
classifier
14
method
Coloring
∙ Collect the score for each token and for each
classifier
∙ Normalize the scores for each classifier
14
method
Coloring
∙ Collect the score for each token and for each
classifier
∙ Normalize the scores for each classifier
∙ Color the top k1% good words with green and top k2%
bad words with red, with intensity proportional to
their scores.
14
demonstration
spam token classification tool
Escalation a few weeks ago
16
benefits
spam token classification tool
Benefits
∙ Saves time and effort in finding specific spam text
18
spam token classification tool
Benefits
∙ Saves time and effort in finding specific spam text
∙ More insights with token-wise scores
18
spam token classification tool
Benefits
∙ Saves time and effort in finding specific spam text
∙ More insights with token-wise scores
∙ Can be used to test performance of a classifier
18
spam token classification tool
Benefits
∙ Saves time and effort in finding specific spam text
∙ More insights with token-wise scores
∙ Can be used to test performance of a classifier
∙ Permalink of result is created, so links can be shared
18
spam token classification tool
Benefits
∙ Saves time and effort in finding specific spam text
∙ More insights with token-wise scores
∙ Can be used to test performance of a classifier
∙ Permalink of result is created, so links can be shared
∙ Content URN not required, so any text can be tested
18
spam token classification tool
Benefits
∙ Saves time and effort in finding specific spam text
∙ More insights with token-wise scores
∙ Can be used to test performance of a classifier
∙ Permalink of result is created, so links can be shared
∙ Content URN not required, so any text can be tested
∙ Method used is independent of classifier’s model
18
assumptions and limitations
spam token classification tool
Assumptions
∙ Scoring from classifier should be incremental, and
not 0-1
20
spam token classification tool
Assumptions
∙ Scoring from classifier should be incremental, and
not 0-1
∙ Same classifiers should run for all the requests: New
end-point in BAM ensures this
20
spam token classification tool
Limitations
∙ For such classifiers where total score is either 0 or 1,
this tool cannot extract any meaningful information
21
spam token classification tool
Limitations
∙ For such classifiers where total score is either 0 or 1,
this tool cannot extract any meaningful information
∙ For a large content, significant amount of time is
required
21
spam token classification tool
Figure: Measure of response time vs number of words
22
technologies used
technologies used
∙ Play Framework
24
technologies used
∙ Play Framework
∙ D2 (Dynamic Discovery) for making RestLi calls
24
technologies used
∙ Play Framework
∙ D2 (Dynamic Discovery) for making RestLi calls
∙ ParSeq for making parallel requests
24
technologies used
∙ Play Framework
∙ D2 (Dynamic Discovery) for making RestLi calls
∙ ParSeq for making parallel requests
∙ Stork for email
24
technologies used
∙ Play Framework
∙ D2 (Dynamic Discovery) for making RestLi calls
∙ ParSeq for making parallel requests
∙ Stork for email
∙ Couchbase to store responses
24
challenges
challenges
∙ Dealing with R2 (Request/Response) timeout
26
challenges
∙ Dealing with R2 (Request/Response) timeout
∙ Running an offline job after client may have closed
connection
26
Questions?
27
Thank You
Credits: Beamer(mtheme), ShareLaTeX
28

More Related Content

Viewers also liked

ปฏิทินรายเดือน2017
ปฏิทินรายเดือน2017ปฏิทินรายเดือน2017
ปฏิทินรายเดือน2017
Thanaporn choochart
 
ปฏิทินรายปี 2017
ปฏิทินรายปี 2017ปฏิทินรายปี 2017
ปฏิทินรายปี 2017
Pla Judoung
 
ปฏิทินรายเดือน
ปฏิทินรายเดือนปฏิทินรายเดือน
ปฏิทินรายเดือน
Jarinya Chaiyabin
 
Dorresteijn, I. (2012) International Bear Conference in New Delhi
Dorresteijn, I. (2012) International Bear Conference in New DelhiDorresteijn, I. (2012) International Bear Conference in New Delhi
Dorresteijn, I. (2012) International Bear Conference in New Delhi
JSchultner
 
Educationin India
Educationin IndiaEducationin India
Educationin India
University of Portsmouth
 
Milcu, A. (2104). "Equity", Resilience conference in Montpellier.
Milcu, A. (2104). "Equity", Resilience conference in Montpellier.Milcu, A. (2104). "Equity", Resilience conference in Montpellier.
Milcu, A. (2104). "Equity", Resilience conference in Montpellier.
JSchultner
 
Integration by place, case and process: Transdisciplinary sustainability scie...
Integration by place, case and process: Transdisciplinary sustainability scie...Integration by place, case and process: Transdisciplinary sustainability scie...
Integration by place, case and process: Transdisciplinary sustainability scie...
joernfischer
 
Panel29b manlosa final
Panel29b manlosa finalPanel29b manlosa final
Panel29b manlosa final
joernfischer
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using Kafka
Akash Vacher
 
Resume
ResumeResume
Resume
Saurav Kumar
 
ธรรมชาติของภาษา
ธรรมชาติของภาษาธรรมชาติของภาษา
ธรรมชาติของภาษา
kingkarn somchit
 
Reportct1bmm1
Reportct1bmm1Reportct1bmm1
Reportct1bmm1
Lam Wee
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
Amy W. Tang
 

Viewers also liked (13)

ปฏิทินรายเดือน2017
ปฏิทินรายเดือน2017ปฏิทินรายเดือน2017
ปฏิทินรายเดือน2017
 
ปฏิทินรายปี 2017
ปฏิทินรายปี 2017ปฏิทินรายปี 2017
ปฏิทินรายปี 2017
 
ปฏิทินรายเดือน
ปฏิทินรายเดือนปฏิทินรายเดือน
ปฏิทินรายเดือน
 
Dorresteijn, I. (2012) International Bear Conference in New Delhi
Dorresteijn, I. (2012) International Bear Conference in New DelhiDorresteijn, I. (2012) International Bear Conference in New Delhi
Dorresteijn, I. (2012) International Bear Conference in New Delhi
 
Educationin India
Educationin IndiaEducationin India
Educationin India
 
Milcu, A. (2104). "Equity", Resilience conference in Montpellier.
Milcu, A. (2104). "Equity", Resilience conference in Montpellier.Milcu, A. (2104). "Equity", Resilience conference in Montpellier.
Milcu, A. (2104). "Equity", Resilience conference in Montpellier.
 
Integration by place, case and process: Transdisciplinary sustainability scie...
Integration by place, case and process: Transdisciplinary sustainability scie...Integration by place, case and process: Transdisciplinary sustainability scie...
Integration by place, case and process: Transdisciplinary sustainability scie...
 
Panel29b manlosa final
Panel29b manlosa finalPanel29b manlosa final
Panel29b manlosa final
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using Kafka
 
Resume
ResumeResume
Resume
 
ธรรมชาติของภาษา
ธรรมชาติของภาษาธรรมชาติของภาษา
ธรรมชาติของภาษา
 
Reportct1bmm1
Reportct1bmm1Reportct1bmm1
Reportct1bmm1
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
 

Similar to SauravKumar-ContentFiltering-InternDay2015

Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
Felipe
 
We Love Speed 2019 : Retour d’expérience sur 4 ans d’utilisation d’un outil d...
We Love Speed 2019 : Retour d’expérience sur 4 ans d’utilisation d’un outil d...We Love Speed 2019 : Retour d’expérience sur 4 ans d’utilisation d’un outil d...
We Love Speed 2019 : Retour d’expérience sur 4 ans d’utilisation d’un outil d...
Antonio Gomes Rodrigues
 
Machine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift ScienceMachine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift Science
Sift Science
 
C question-bank-ebook
C question-bank-ebookC question-bank-ebook
C question-bank-ebook
etrams1
 
Top C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerTop C Language Interview Questions and Answer
Top C Language Interview Questions and Answer
Vineet Kumar Saini
 
C question-answer-bank
C question-answer-bankC question-answer-bank
C question-answer-bank
REHAN KHAN
 
Test Automation Day 2018
Test Automation Day 2018Test Automation Day 2018
Test Automation Day 2018
Maurício Aniche
 
Cis355 a ilab 2 control structures and user defined methods devry university
Cis355 a ilab 2 control structures and user defined methods devry universityCis355 a ilab 2 control structures and user defined methods devry university
Cis355 a ilab 2 control structures and user defined methods devry university
sjskjd709707
 
The 7 Essential Wordpress Plugins (Wordcamp SLC 2014)
The 7 Essential Wordpress Plugins (Wordcamp SLC 2014)The 7 Essential Wordpress Plugins (Wordcamp SLC 2014)
The 7 Essential Wordpress Plugins (Wordcamp SLC 2014)
brentjacobsen
 
Thinking in software testing
Thinking in software testingThinking in software testing
FInal Project Intelligent Social Media Analytics
FInal Project Intelligent Social Media AnalyticsFInal Project Intelligent Social Media Analytics
FInal Project Intelligent Social Media Analytics
Ashwin Dinoriya
 
Bitcoin Price Prediction
Bitcoin Price PredictionBitcoin Price Prediction
Bitcoin Price Prediction
Kadambini Indurkar
 
Evaluating the Usefulness of IR-Based Fault LocalizationTechniques
Evaluating the Usefulness of IR-Based Fault LocalizationTechniquesEvaluating the Usefulness of IR-Based Fault LocalizationTechniques
Evaluating the Usefulness of IR-Based Fault LocalizationTechniques
Alex Orso
 
CheckPlease: Payload-Agnostic Targeted Malware
CheckPlease: Payload-Agnostic Targeted MalwareCheckPlease: Payload-Agnostic Targeted Malware
CheckPlease: Payload-Agnostic Targeted Malware
Brandon Arvanaghi
 
Generating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All LanguagesGenerating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All Languages
Catalyst
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
Kavita Ganesan
 
Feature Selection for Document Ranking
Feature Selection for Document RankingFeature Selection for Document Ranking
Feature Selection for Document Ranking
Andrea Gigli
 
Python_Interview_Questions.pdf
Python_Interview_Questions.pdfPython_Interview_Questions.pdf
Python_Interview_Questions.pdf
Samir Paul
 
Improve existing code with confidence, supported by unit tests
Improve existing code with confidence, supported by unit testsImprove existing code with confidence, supported by unit tests
Improve existing code with confidence, supported by unit tests
Dattatray Kale
 
Simple Essay Example Amat
Simple Essay Example  AmatSimple Essay Example  Amat
Simple Essay Example Amat
Jennifer Moore
 

Similar to SauravKumar-ContentFiltering-InternDay2015 (20)

Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
 
We Love Speed 2019 : Retour d’expérience sur 4 ans d’utilisation d’un outil d...
We Love Speed 2019 : Retour d’expérience sur 4 ans d’utilisation d’un outil d...We Love Speed 2019 : Retour d’expérience sur 4 ans d’utilisation d’un outil d...
We Love Speed 2019 : Retour d’expérience sur 4 ans d’utilisation d’un outil d...
 
Machine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift ScienceMachine Learning Experimentation at Sift Science
Machine Learning Experimentation at Sift Science
 
C question-bank-ebook
C question-bank-ebookC question-bank-ebook
C question-bank-ebook
 
Top C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerTop C Language Interview Questions and Answer
Top C Language Interview Questions and Answer
 
C question-answer-bank
C question-answer-bankC question-answer-bank
C question-answer-bank
 
Test Automation Day 2018
Test Automation Day 2018Test Automation Day 2018
Test Automation Day 2018
 
Cis355 a ilab 2 control structures and user defined methods devry university
Cis355 a ilab 2 control structures and user defined methods devry universityCis355 a ilab 2 control structures and user defined methods devry university
Cis355 a ilab 2 control structures and user defined methods devry university
 
The 7 Essential Wordpress Plugins (Wordcamp SLC 2014)
The 7 Essential Wordpress Plugins (Wordcamp SLC 2014)The 7 Essential Wordpress Plugins (Wordcamp SLC 2014)
The 7 Essential Wordpress Plugins (Wordcamp SLC 2014)
 
Thinking in software testing
Thinking in software testingThinking in software testing
Thinking in software testing
 
FInal Project Intelligent Social Media Analytics
FInal Project Intelligent Social Media AnalyticsFInal Project Intelligent Social Media Analytics
FInal Project Intelligent Social Media Analytics
 
Bitcoin Price Prediction
Bitcoin Price PredictionBitcoin Price Prediction
Bitcoin Price Prediction
 
Evaluating the Usefulness of IR-Based Fault LocalizationTechniques
Evaluating the Usefulness of IR-Based Fault LocalizationTechniquesEvaluating the Usefulness of IR-Based Fault LocalizationTechniques
Evaluating the Usefulness of IR-Based Fault LocalizationTechniques
 
CheckPlease: Payload-Agnostic Targeted Malware
CheckPlease: Payload-Agnostic Targeted MalwareCheckPlease: Payload-Agnostic Targeted Malware
CheckPlease: Payload-Agnostic Targeted Malware
 
Generating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All LanguagesGenerating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All Languages
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
 
Feature Selection for Document Ranking
Feature Selection for Document RankingFeature Selection for Document Ranking
Feature Selection for Document Ranking
 
Python_Interview_Questions.pdf
Python_Interview_Questions.pdfPython_Interview_Questions.pdf
Python_Interview_Questions.pdf
 
Improve existing code with confidence, supported by unit tests
Improve existing code with confidence, supported by unit testsImprove existing code with confidence, supported by unit tests
Improve existing code with confidence, supported by unit tests
 
Simple Essay Example Amat
Simple Essay Example  AmatSimple Essay Example  Amat
Simple Essay Example Amat
 

SauravKumar-ContentFiltering-InternDay2015

  • 1. presentation Internship Report Saurav Kumar, Software Engineering Intern July 15, 2015 LinkedIn Bangalore
  • 5. problem There was no existing solution to find which specific part of the text is causing a text to be classified as spam. My task was to build a tool to solve this problem. 4
  • 6. tool 1: spam classification tool
  • 7. spam classification tool ∙ Given a content source, title and body of content, this tool tabulates the scores of each classifier 6
  • 8. spam classification tool ∙ Given a content source, title and body of content, this tool tabulates the scores of each classifier ∙ Query request is sent to BAM, and the response summary is presented in a table 6
  • 10. tool 2: spam token classification tool
  • 11. spam token classification tool ∙ Given a content source, title and body of content, this tool computes the contribution of each word (token) towards the overall score 9
  • 13. spam token classification tool ∙ Given a content source, title and body of content, this tool computes the contribution of each word (token) towards the overall score ∙ The UI allows you to set the number of tokens you want to examine 11
  • 15. method ∙ Suppose, we have a content with tokens t1, t2, ..., tn 13
  • 16. method ∙ Suppose, we have a content with tokens t1, t2, ..., tn ∙ We need to find the effect of each token on the total score 13
  • 17. method ∙ Suppose, we have a content with tokens t1, t2, ..., tn ∙ We need to find the effect of each token on the total score ∙ Let the score of this content be S0 13
  • 18. method ∙ Suppose, we have a content with tokens t1, t2, ..., tn ∙ We need to find the effect of each token on the total score ∙ Let the score of this content be S0 ∙ Modify the ith token to make it a non-word, and obtain the score Si 13
  • 19. method ∙ Suppose, we have a content with tokens t1, t2, ..., tn ∙ We need to find the effect of each token on the total score ∙ Let the score of this content be S0 ∙ Modify the ith token to make it a non-word, and obtain the score Si ∙ The difference wi = (S0 − Si) signifies the effect of ith token on the score 13
  • 20. method Coloring ∙ Collect the score for each token and for each classifier 14
  • 21. method Coloring ∙ Collect the score for each token and for each classifier ∙ Normalize the scores for each classifier 14
  • 22. method Coloring ∙ Collect the score for each token and for each classifier ∙ Normalize the scores for each classifier ∙ Color the top k1% good words with green and top k2% bad words with red, with intensity proportional to their scores. 14
  • 24. spam token classification tool Escalation a few weeks ago 16
  • 26. spam token classification tool Benefits ∙ Saves time and effort in finding specific spam text 18
  • 27. spam token classification tool Benefits ∙ Saves time and effort in finding specific spam text ∙ More insights with token-wise scores 18
  • 28. spam token classification tool Benefits ∙ Saves time and effort in finding specific spam text ∙ More insights with token-wise scores ∙ Can be used to test performance of a classifier 18
  • 29. spam token classification tool Benefits ∙ Saves time and effort in finding specific spam text ∙ More insights with token-wise scores ∙ Can be used to test performance of a classifier ∙ Permalink of result is created, so links can be shared 18
  • 30. spam token classification tool Benefits ∙ Saves time and effort in finding specific spam text ∙ More insights with token-wise scores ∙ Can be used to test performance of a classifier ∙ Permalink of result is created, so links can be shared ∙ Content URN not required, so any text can be tested 18
  • 31. spam token classification tool Benefits ∙ Saves time and effort in finding specific spam text ∙ More insights with token-wise scores ∙ Can be used to test performance of a classifier ∙ Permalink of result is created, so links can be shared ∙ Content URN not required, so any text can be tested ∙ Method used is independent of classifier’s model 18
  • 33. spam token classification tool Assumptions ∙ Scoring from classifier should be incremental, and not 0-1 20
  • 34. spam token classification tool Assumptions ∙ Scoring from classifier should be incremental, and not 0-1 ∙ Same classifiers should run for all the requests: New end-point in BAM ensures this 20
  • 35. spam token classification tool Limitations ∙ For such classifiers where total score is either 0 or 1, this tool cannot extract any meaningful information 21
  • 36. spam token classification tool Limitations ∙ For such classifiers where total score is either 0 or 1, this tool cannot extract any meaningful information ∙ For a large content, significant amount of time is required 21
  • 37. spam token classification tool Figure: Measure of response time vs number of words 22
  • 40. technologies used ∙ Play Framework ∙ D2 (Dynamic Discovery) for making RestLi calls 24
  • 41. technologies used ∙ Play Framework ∙ D2 (Dynamic Discovery) for making RestLi calls ∙ ParSeq for making parallel requests 24
  • 42. technologies used ∙ Play Framework ∙ D2 (Dynamic Discovery) for making RestLi calls ∙ ParSeq for making parallel requests ∙ Stork for email 24
  • 43. technologies used ∙ Play Framework ∙ D2 (Dynamic Discovery) for making RestLi calls ∙ ParSeq for making parallel requests ∙ Stork for email ∙ Couchbase to store responses 24
  • 45. challenges ∙ Dealing with R2 (Request/Response) timeout 26
  • 46. challenges ∙ Dealing with R2 (Request/Response) timeout ∙ Running an offline job after client may have closed connection 26