Text mining 
Fuzzy document classification 
Using Elasticsearch 
Lev Ozeryansky
Identity Card 
• Merging of Ankor and We! 
• Owned by Hilan (publicly traded in Tel Aviv Stock exchange) 
• Fast growing IT integration company 
• Over 2000 systems installed and maintained 
• Over 1000 leading customers - Hi-tech, Industry, Academy, Banks, 
Insurance, 
• Strong technological team – over 45 engineers, professional services 
and project managers 
• Over 120 employees 
• Four main divisions – Infrastructure, Big Data, Cloud, Cyber
Technology Edge
What is classification 
• Document classification as document categorization. 
• Using classification. 
• Our classification data source. 
• What we do with? 
• Java programmer. 
• .NET programmer.
Data source
The mathematics 
• Let be class set 
• Let be documents set 
• Classification function
Classification method 
• Cosine similarity 
• Function
Build document class 
vector 
• Java programmer 
• Java 
• 5 
• Hibernate 
• .NET programmer 
• C# 
• 5 
• Nhibernate
Let index classificators 
• Add weight manually. 
• For Java programmer: 
• Java = 0.7 
• 5 = 0.5 
• Hibernate = 0.3 
• For .NET programmer 
• C# = 0.7 
• 5 = 0.5 
• Nhibernate = 0.3
DEMO
w-shingling 
• In natural language processing a w-shingling is a set of 
unique "shingles"—contiguous subsequences of tokens in 
a document. (Wikipedia) 
• Tokenization 
• Elasticsearch analyze mechanism
DEMO
Classification process 
• Tokens array. 
• Classification query. 
• Use terms query when terms array == tokens array 
• Two vectors 
• Vector of filtered tokens 
• Classification vector
DEMO
Classification process 
• SciPy to calculate distance.
Q&A

Dev ops-presentation

  • 1.
    Text mining Fuzzydocument classification Using Elasticsearch Lev Ozeryansky
  • 2.
    Identity Card •Merging of Ankor and We! • Owned by Hilan (publicly traded in Tel Aviv Stock exchange) • Fast growing IT integration company • Over 2000 systems installed and maintained • Over 1000 leading customers - Hi-tech, Industry, Academy, Banks, Insurance, • Strong technological team – over 45 engineers, professional services and project managers • Over 120 employees • Four main divisions – Infrastructure, Big Data, Cloud, Cyber
  • 3.
  • 4.
    What is classification • Document classification as document categorization. • Using classification. • Our classification data source. • What we do with? • Java programmer. • .NET programmer.
  • 5.
  • 6.
    The mathematics •Let be class set • Let be documents set • Classification function
  • 7.
    Classification method •Cosine similarity • Function
  • 8.
    Build document class vector • Java programmer • Java • 5 • Hibernate • .NET programmer • C# • 5 • Nhibernate
  • 9.
    Let index classificators • Add weight manually. • For Java programmer: • Java = 0.7 • 5 = 0.5 • Hibernate = 0.3 • For .NET programmer • C# = 0.7 • 5 = 0.5 • Nhibernate = 0.3
  • 10.
  • 11.
    w-shingling • Innatural language processing a w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document. (Wikipedia) • Tokenization • Elasticsearch analyze mechanism
  • 12.
  • 13.
    Classification process •Tokens array. • Classification query. • Use terms query when terms array == tokens array • Two vectors • Vector of filtered tokens • Classification vector
  • 14.
  • 15.
    Classification process •SciPy to calculate distance.
  • 16.