SlideShare a Scribd company logo
 
Column Name Classification using Probabilistic
Models
Quinn Tran
Introduction:
User input for column names is not standardized. This means that users enter almost
anything they want as a column name and this creates a challenge on how to categorize,
store, and analyze user columns. The Data Governance team screens data to make sure it is
labeled and encrypted. Currently the team has an analyst manually checking column names
and column data if those contain sensitive information. An example use case is uploading a
table without any metadata. An example input is a list of column names with no column data.
The example output would be each column name with the labels: PII or OTHER and category
or none.
Information is classified into 17 categories, or buckets based on meaning, such as
social security number. Column data is easy to check because sensitive data in a specific
format can be classified using regex. Column names are almost gibberish because of
freeform user input. Because of the high volume of data uploaded, column classification has
to be further automated in order to scale storage and data analysis. The goal in particular to
this project is to automate classification of columns by name. Natural Language Processing
concepts will be used to predict by column name whether a column has sensitive data.
Technologies:
PII (Personally Identifiable Information) Columns:​ Columns that contain sensitive
information. Given a set of columns, PII columns have to be identified and categorized.
Affinity propagation: ​ An algorithm to cluster words with similar syntax. It measures how 
similar (in this case edit distance) pairs of column names are and simultaneously determines 
which column names would be exemplars (representatives of their specific cluster). Messages 
are exchanged between column names until a definite set of exemplars and corresponding 
clusters appear.
LDA (Latent Dirichlet Allocation):​ An algorithm that groups words based on frequency and
thus implicitly meaning. It models topics through expectation maximization with a dirichlet
distribution. LDA represents documents as a combination of topics where each topic is a
cluster of related words, and each word contributes to a specific topic by a specific
probability.
word2vec:​ Vector representation of a word. This is used to quickly compare words for
similarity.
 
Implementations:
First Approach:
Classifying:
From a syntactic approach, ​training column names ​and ​test column names
received from user input were grouped according to ​affinity propagation​. This training set
was the set of column names used to find patterns in existing column names to ultimately
make predictions about incoming column names. The test set is the set of column names
used to measure accuracy in the program’s predictions.
The groupings are based on edit distance. This was to group variations of the same
name such as social security, socialsecurity, ssn, socials. Ideally, clusters would have
exemplars (names the represent their respective cluster) that were the names of each of the
15 PII categories and 2 OTHER categories. After clustering, the number of clusters
depended on the number of names. Not only were there more than 17 categories, but
unrelated words were often grouped together because they were similar only syntactically.
For example:
pwd​ is PII and is labelled as PII programmatically. However, the most related names in
decreasing similarity are: ​emailaddress, tekpassword, address, dataid, stateid, opendate,
paymetdate, saledate, address,​ and ​userid​. This is because there is no direct connection
between name meaning and syntax. In this case, since there are fewer characters in “pwd”,
there are fewer features to characterize this name and so there are a lot more names that
can be labelled similar to “pwd” by edit distance.
However, it was a good first step to cluster in order to compare column names to a
training set to classify whether or not each name was PII. ​The accuracy rate for classifying
(not categorizing into PII categories) was ~65%.
Categorizing:
Unnormalized ​cosine similarity​ was used to label each PII column name to a PII
category via ​vector representations​ of each ​test column name​ and each cluster. Here,
each cluster’s vector is made of the sum of each training column name’s vector (in that
respective cluster). This developed a generalized vector representation or a pattern of what a
word would look like for a specific cluster. There were a ton of false positives and okay-ish
categorizations for just syntax. The training sets had PII categories not just based on syntax,
but also on meaning because the classifications for the training sets were also determined by
column data, which couldn’t be accessed.
 
Second Approach:
Classifying:
LDA (Latent Dirichlet Allocation)​ was used to implicitly represent meaning, as it
uses expectation maximization in two scopes: word frequency across documents and word
frequency in relation to the words around it. In this case, there are 3 documents. The first 2
documents were the ​training column names​ separated into two groups: PII names and
OTHER names. The third document had the ​test column names​.
This artificially provides “context clues” for ​LDA​ to associate PII names together and
ultimately cluster names more accurately. This reinforces the “meaning” of the column
names as PII or OTHER and thus dramatically solves the previous problem of having too
many false positives (names that were classified as PII even though they were actually
OTHER).
17 topics, or clusters, were chosen to ideally represent each of the 15 PII categories
and 2 OTHER categories. Thus relatively quickly, clusters were made based on meaning first.
LDA returns topics, or clusters of column names where each cluster has a set of related
column names. Each topic is labeled as PII or OTHER with a probability that it is PII. The
topics were sorted in ascending probability that the topic is PII. Because of precedent (and
configurations) there were 2 OTHER topics and 15 PII topics. The OTHER topics were
configured to be the first 2 topics in the order.
For example:
Input: list of test column names
Output:
17 topics, where each topic has an assigned probability that it is PII, and a list of all the
column names in that topic (training and test set).
PII, TOPIC 7, pii probability 0.202503450299
ipaddress
deviceid
fedtaxid
deviceuniqueidentifi
creditscor
contactphon
userid
ip
mtipaddress
citi
gopphonenumb
digestedpassword
dob
customernam
agencytaxpayerid
smsnumber
authorizationtoken
ueyodleepassword
mtwalletentryid
primarycontact
phonenumb
gopipaddress
gopaccountnumb
sourcebankaccountid
legalnam
useremail
mailingaddressfk
customerid
mtccexpdat
contactemail
yodleeaccountnumberhash
contactnam
iacencryptedssn
weblogin
lasttoken
spousessn
customerbillnam
iacsocialsecuritynumb
assistantnam
spousedateofbirth
 
yodleepassword
privatekey
aunameonccard
merchantuserid
Caveats:
Due to the sheer volume of names and how the 17 topics’ sizes were not equivalent,
adjustments had to be made. The 17 topics had to be scaled up to 200 topics, and the 2
OTHER topics were scaled up to 50 OTHER topics. 50 OTHER topics was chosen as a
configuration because it was right at the boundary of classifying virtually every name as
OTHER or the complement for an arbitrary corpus (​a large and structured set of texts​) size.
Now, each topic is small enough to not lose precision (dropping out names that contribute
~0% to the topic and potentially erasing names from the corpus). Now, virtually every name
was represented by a topic. Since a name can contribute to multiple topics, classifying a
test column name​ involved picking the most likely topic for this particular name in the
corpus. This was done by using ​gensim​’s ​get_term_topics(​word_id,
minimum_probability=None). Even though get_term_topics returns a list of most probable
topics for a ​test column name​ with probabilities of how much the name contributes to each
respective topic, no weighted averages were taken to classify the ​test column name
because averages tend to error towards false positives (classifying a ​test column name​ as
PII when it really isn’t because there are a lot more PII labeled topics than OTHER topics by
construction).
The demo measures how accurate the program is for a set of ​test column names​:
accuracy rate: 0.79
Column Name Actual Program
serviceid OTHER OTHER
otherid OTHER OTHER
subscrstatusid OTHER OTHER
shiptoaddrid PII PII
isautorenew OTHER PII
laststatuschangeeventid OTHER PII
name OTHER PII
shiptoaddrid PII PII
Categorizing:
The assumption was made that each topic, or cluster, represented each category and
would accurately tell what category a ​test column name​ would belong to. Unfortunately,
because the training sets heavily favored the username category, almost every ​test column
name ​was categorized as a username. PII names were actually more easily categorized
syntax-wise or regex-wise since sensitive information is usually formatted a certain way.
Thus, reverting back to the previous method of categorization, unnormalized ​cosine
similarity​ was used to label each PII column name to a PII category and to rank PII
categories in decreasing similarity relative to the PII column name. Now the PII column
names were categorized more accurately.
For example, the assigned categories for each ​test column name​ are:
accountantbusinessnam​: individualnam, financialaccountnumb, ssnorein, secret,
emailaddress, nonbusinessaddress, creditcard, nonbusinessphon, usernam, dateofbirth
billingemail​: emailaddress
 
othermerchantaccountnumb​: financialaccountnumb, individualnam, ssnorein, creditcard,
nonbusinessphon, secret, emailaddress, nonbusinessaddress, usernam, dateofbirth
checknumb​: financialaccountnumb, individualnam, ssnorein, creditcard, nonbusinessphon,
secret, emailaddress, nonbusinessaddress, usernam, dateofbirth
Results:
Approach 1 (Previous) Approach 2 (Current)
Accuracy Rate .08 .79
Speed (5000 column names) ~2-3 minutes ~1 minute
The current approach compares to other approaches by:
Previous Approach​: The current approach is more accurate and spends less processing
time.
Manual Scanning​: The current approach is scalable and is faster.
Regex​: The current approach doesn’t have to encode every single case for naming because
it compares column names by similarity, not exclusively by encoded, definitive rules.
People’s naming conventions are horrible, such as:
zerooffset
spousessn
addr1
Future:
A recurring setback was how suboptimal the training set was in the first place. It was
heavily biased in favor of usernames, so there were a lot of mis-categorizations. Column
names in the training set should only be classified by the actual name, not including column
data. Since names are generally noisy, having column data to categorize a name creates
extra noise in the training set. For example, column names such as “memotext” and
“comment” were classified as PII because the actual data in the column was PII even though
the names don’t exactly suggest that.
If the training sets were dynamically updated, clusters’ vectors should be updated by
adding column name vectors logarithmically or else there will be too much noise from more
popular column names such as “username”.
 
Reference Technologies:
PII (Personally Identifiable Information) Columns:​ Columns that contain sensitive
information
Affinity propagation:​ Clustering words with similar syntax by edit distance.
https://en.wikipedia.org/wiki/Affinity_propagation
LDA (Latent Dirichlet Allocation):​ Topic modeling through expectation maximization with a
dirichlet distribution.
http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
To program:
NLTK (Natural Language Toolkit) for processing user input: ​http://www.nltk.org/
gensim for topic modeling: ​https://radimrehurek.com/gensim/
word2vec:​ Vector representation of a word
https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html

More Related Content

What's hot

Text Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion Mining
Fabrizio Sebastiani
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
ijaia
 
Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)
es712
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
Sai Srinivas Kotni
 
Myanmar Named Entity Recognition with Hidden Markov Model
Myanmar Named Entity Recognition with Hidden Markov ModelMyanmar Named Entity Recognition with Hidden Markov Model
Myanmar Named Entity Recognition with Hidden Markov Model
ijtsrd
 
An in-depth exploration of Bangla blog post classification
An in-depth exploration of Bangla blog post classificationAn in-depth exploration of Bangla blog post classification
An in-depth exploration of Bangla blog post classification
journalBEEI
 
A COMPARISON OF TEXT CATEGORIZATION METHODS
A COMPARISON OF TEXT CATEGORIZATION METHODSA COMPARISON OF TEXT CATEGORIZATION METHODS
A COMPARISON OF TEXT CATEGORIZATION METHODS
ijnlc
 
Bc0041
Bc0041Bc0041
Bc0041
hayerpa
 
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
Enhancing Keyword Query Results Over Database for Improving User Satisfaction Enhancing Keyword Query Results Over Database for Improving User Satisfaction
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
ijmpict
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
IJCSIS Research Publications
 
Java interview questions and answers
Java interview questions and answersJava interview questions and answers
Java interview questions and answers
Madhavendra Dutt
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
ijsc
 
Java interview questions
Java interview questionsJava interview questions
Java interview questions
Shashwat Shriparv
 
Malware analysis
Malware analysisMalware analysis
Malware analysis
Roberto Falconi
 
EEE oops Vth semester viva questions with answer
EEE oops Vth semester viva questions with answerEEE oops Vth semester viva questions with answer
EEE oops Vth semester viva questions with answer
Jeba Moses
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
pathsproject
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
Bhaskar Mitra
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
Drjabez
 

What's hot (19)

Text Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion Mining
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
 
Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Myanmar Named Entity Recognition with Hidden Markov Model
Myanmar Named Entity Recognition with Hidden Markov ModelMyanmar Named Entity Recognition with Hidden Markov Model
Myanmar Named Entity Recognition with Hidden Markov Model
 
An in-depth exploration of Bangla blog post classification
An in-depth exploration of Bangla blog post classificationAn in-depth exploration of Bangla blog post classification
An in-depth exploration of Bangla blog post classification
 
A COMPARISON OF TEXT CATEGORIZATION METHODS
A COMPARISON OF TEXT CATEGORIZATION METHODSA COMPARISON OF TEXT CATEGORIZATION METHODS
A COMPARISON OF TEXT CATEGORIZATION METHODS
 
Bc0041
Bc0041Bc0041
Bc0041
 
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
Enhancing Keyword Query Results Over Database for Improving User Satisfaction Enhancing Keyword Query Results Over Database for Improving User Satisfaction
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
 
Java interview questions and answers
Java interview questions and answersJava interview questions and answers
Java interview questions and answers
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
Java interview questions
Java interview questionsJava interview questions
Java interview questions
 
Malware analysis
Malware analysisMalware analysis
Malware analysis
 
EEE oops Vth semester viva questions with answer
EEE oops Vth semester viva questions with answerEEE oops Vth semester viva questions with answer
EEE oops Vth semester viva questions with answer
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 

Similar to BlogPost

Towards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsTowards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong Students
CITE
 
Paper id 28201441
Paper id 28201441Paper id 28201441
Paper id 28201441
IJRAT
 
Report
ReportReport
Independent Study_Final Report
Independent Study_Final ReportIndependent Study_Final Report
Independent Study_Final Report
Shikha Swami
 
data_mining_Projectreport
data_mining_Projectreportdata_mining_Projectreport
data_mining_Projectreport
Sampath Velaga
 
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
IJERA Editor
 
E017252831
E017252831E017252831
E017252831
IOSR Journals
 
Cs501 intro
Cs501 introCs501 intro
Cs501 intro
Kamal Singh Lodhi
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Geetika Gautam
 
A Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text ClassificationA Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text Classification
Joshua Gorinson
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
Sardhendu Mishra
 
03 object-classes-pbl-4-slots
03 object-classes-pbl-4-slots03 object-classes-pbl-4-slots
03 object-classes-pbl-4-slots
mha4
 
03 object-classes-pbl-4-slots
03 object-classes-pbl-4-slots03 object-classes-pbl-4-slots
03 object-classes-pbl-4-slots
mha4
 
Multi label text classification
Multi label text classificationMulti label text classification
Multi label text classification
raghavr186
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET Journal
 
INTRODUCTION
INTRODUCTIONINTRODUCTION
INTRODUCTION
butest
 
INTRODUCTION
INTRODUCTIONINTRODUCTION
INTRODUCTION
butest
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classification
ijtsrd
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
butest
 
CSPro Workshop P-3
CSPro Workshop P-3CSPro Workshop P-3
CSPro Workshop P-3
prabhustat
 

Similar to BlogPost (20)

Towards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsTowards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong Students
 
Paper id 28201441
Paper id 28201441Paper id 28201441
Paper id 28201441
 
Report
ReportReport
Report
 
Independent Study_Final Report
Independent Study_Final ReportIndependent Study_Final Report
Independent Study_Final Report
 
data_mining_Projectreport
data_mining_Projectreportdata_mining_Projectreport
data_mining_Projectreport
 
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...A Review on Subjectivity Analysis through Text Classification Using Mining Te...
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
 
E017252831
E017252831E017252831
E017252831
 
Cs501 intro
Cs501 introCs501 intro
Cs501 intro
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
 
A Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text ClassificationA Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text Classification
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
03 object-classes-pbl-4-slots
03 object-classes-pbl-4-slots03 object-classes-pbl-4-slots
03 object-classes-pbl-4-slots
 
03 object-classes-pbl-4-slots
03 object-classes-pbl-4-slots03 object-classes-pbl-4-slots
03 object-classes-pbl-4-slots
 
Multi label text classification
Multi label text classificationMulti label text classification
Multi label text classification
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
 
INTRODUCTION
INTRODUCTIONINTRODUCTION
INTRODUCTION
 
INTRODUCTION
INTRODUCTIONINTRODUCTION
INTRODUCTION
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classification
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
CSPro Workshop P-3
CSPro Workshop P-3CSPro Workshop P-3
CSPro Workshop P-3
 

BlogPost

  • 1.   Column Name Classification using Probabilistic Models Quinn Tran Introduction: User input for column names is not standardized. This means that users enter almost anything they want as a column name and this creates a challenge on how to categorize, store, and analyze user columns. The Data Governance team screens data to make sure it is labeled and encrypted. Currently the team has an analyst manually checking column names and column data if those contain sensitive information. An example use case is uploading a table without any metadata. An example input is a list of column names with no column data. The example output would be each column name with the labels: PII or OTHER and category or none. Information is classified into 17 categories, or buckets based on meaning, such as social security number. Column data is easy to check because sensitive data in a specific format can be classified using regex. Column names are almost gibberish because of freeform user input. Because of the high volume of data uploaded, column classification has to be further automated in order to scale storage and data analysis. The goal in particular to this project is to automate classification of columns by name. Natural Language Processing concepts will be used to predict by column name whether a column has sensitive data. Technologies: PII (Personally Identifiable Information) Columns:​ Columns that contain sensitive information. Given a set of columns, PII columns have to be identified and categorized. Affinity propagation: ​ An algorithm to cluster words with similar syntax. It measures how  similar (in this case edit distance) pairs of column names are and simultaneously determines  which column names would be exemplars (representatives of their specific cluster). Messages  are exchanged between column names until a definite set of exemplars and corresponding  clusters appear. LDA (Latent Dirichlet Allocation):​ An algorithm that groups words based on frequency and thus implicitly meaning. It models topics through expectation maximization with a dirichlet distribution. LDA represents documents as a combination of topics where each topic is a cluster of related words, and each word contributes to a specific topic by a specific probability. word2vec:​ Vector representation of a word. This is used to quickly compare words for similarity.
  • 2.   Implementations: First Approach: Classifying: From a syntactic approach, ​training column names ​and ​test column names received from user input were grouped according to ​affinity propagation​. This training set was the set of column names used to find patterns in existing column names to ultimately make predictions about incoming column names. The test set is the set of column names used to measure accuracy in the program’s predictions. The groupings are based on edit distance. This was to group variations of the same name such as social security, socialsecurity, ssn, socials. Ideally, clusters would have exemplars (names the represent their respective cluster) that were the names of each of the 15 PII categories and 2 OTHER categories. After clustering, the number of clusters depended on the number of names. Not only were there more than 17 categories, but unrelated words were often grouped together because they were similar only syntactically. For example: pwd​ is PII and is labelled as PII programmatically. However, the most related names in decreasing similarity are: ​emailaddress, tekpassword, address, dataid, stateid, opendate, paymetdate, saledate, address,​ and ​userid​. This is because there is no direct connection between name meaning and syntax. In this case, since there are fewer characters in “pwd”, there are fewer features to characterize this name and so there are a lot more names that can be labelled similar to “pwd” by edit distance. However, it was a good first step to cluster in order to compare column names to a training set to classify whether or not each name was PII. ​The accuracy rate for classifying (not categorizing into PII categories) was ~65%. Categorizing: Unnormalized ​cosine similarity​ was used to label each PII column name to a PII category via ​vector representations​ of each ​test column name​ and each cluster. Here, each cluster’s vector is made of the sum of each training column name’s vector (in that respective cluster). This developed a generalized vector representation or a pattern of what a word would look like for a specific cluster. There were a ton of false positives and okay-ish categorizations for just syntax. The training sets had PII categories not just based on syntax, but also on meaning because the classifications for the training sets were also determined by column data, which couldn’t be accessed.
  • 3.   Second Approach: Classifying: LDA (Latent Dirichlet Allocation)​ was used to implicitly represent meaning, as it uses expectation maximization in two scopes: word frequency across documents and word frequency in relation to the words around it. In this case, there are 3 documents. The first 2 documents were the ​training column names​ separated into two groups: PII names and OTHER names. The third document had the ​test column names​. This artificially provides “context clues” for ​LDA​ to associate PII names together and ultimately cluster names more accurately. This reinforces the “meaning” of the column names as PII or OTHER and thus dramatically solves the previous problem of having too many false positives (names that were classified as PII even though they were actually OTHER). 17 topics, or clusters, were chosen to ideally represent each of the 15 PII categories and 2 OTHER categories. Thus relatively quickly, clusters were made based on meaning first. LDA returns topics, or clusters of column names where each cluster has a set of related column names. Each topic is labeled as PII or OTHER with a probability that it is PII. The topics were sorted in ascending probability that the topic is PII. Because of precedent (and configurations) there were 2 OTHER topics and 15 PII topics. The OTHER topics were configured to be the first 2 topics in the order. For example: Input: list of test column names Output: 17 topics, where each topic has an assigned probability that it is PII, and a list of all the column names in that topic (training and test set). PII, TOPIC 7, pii probability 0.202503450299 ipaddress deviceid fedtaxid deviceuniqueidentifi creditscor contactphon userid ip mtipaddress citi gopphonenumb digestedpassword dob customernam agencytaxpayerid smsnumber authorizationtoken ueyodleepassword mtwalletentryid primarycontact phonenumb gopipaddress gopaccountnumb sourcebankaccountid legalnam useremail mailingaddressfk customerid mtccexpdat contactemail yodleeaccountnumberhash contactnam iacencryptedssn weblogin lasttoken spousessn customerbillnam iacsocialsecuritynumb assistantnam spousedateofbirth
  • 4.   yodleepassword privatekey aunameonccard merchantuserid Caveats: Due to the sheer volume of names and how the 17 topics’ sizes were not equivalent, adjustments had to be made. The 17 topics had to be scaled up to 200 topics, and the 2 OTHER topics were scaled up to 50 OTHER topics. 50 OTHER topics was chosen as a configuration because it was right at the boundary of classifying virtually every name as OTHER or the complement for an arbitrary corpus (​a large and structured set of texts​) size. Now, each topic is small enough to not lose precision (dropping out names that contribute ~0% to the topic and potentially erasing names from the corpus). Now, virtually every name was represented by a topic. Since a name can contribute to multiple topics, classifying a test column name​ involved picking the most likely topic for this particular name in the corpus. This was done by using ​gensim​’s ​get_term_topics(​word_id, minimum_probability=None). Even though get_term_topics returns a list of most probable topics for a ​test column name​ with probabilities of how much the name contributes to each respective topic, no weighted averages were taken to classify the ​test column name because averages tend to error towards false positives (classifying a ​test column name​ as PII when it really isn’t because there are a lot more PII labeled topics than OTHER topics by construction). The demo measures how accurate the program is for a set of ​test column names​: accuracy rate: 0.79 Column Name Actual Program serviceid OTHER OTHER otherid OTHER OTHER subscrstatusid OTHER OTHER shiptoaddrid PII PII isautorenew OTHER PII laststatuschangeeventid OTHER PII name OTHER PII shiptoaddrid PII PII Categorizing: The assumption was made that each topic, or cluster, represented each category and would accurately tell what category a ​test column name​ would belong to. Unfortunately, because the training sets heavily favored the username category, almost every ​test column name ​was categorized as a username. PII names were actually more easily categorized syntax-wise or regex-wise since sensitive information is usually formatted a certain way. Thus, reverting back to the previous method of categorization, unnormalized ​cosine similarity​ was used to label each PII column name to a PII category and to rank PII categories in decreasing similarity relative to the PII column name. Now the PII column names were categorized more accurately. For example, the assigned categories for each ​test column name​ are: accountantbusinessnam​: individualnam, financialaccountnumb, ssnorein, secret, emailaddress, nonbusinessaddress, creditcard, nonbusinessphon, usernam, dateofbirth billingemail​: emailaddress
  • 5.   othermerchantaccountnumb​: financialaccountnumb, individualnam, ssnorein, creditcard, nonbusinessphon, secret, emailaddress, nonbusinessaddress, usernam, dateofbirth checknumb​: financialaccountnumb, individualnam, ssnorein, creditcard, nonbusinessphon, secret, emailaddress, nonbusinessaddress, usernam, dateofbirth Results: Approach 1 (Previous) Approach 2 (Current) Accuracy Rate .08 .79 Speed (5000 column names) ~2-3 minutes ~1 minute The current approach compares to other approaches by: Previous Approach​: The current approach is more accurate and spends less processing time. Manual Scanning​: The current approach is scalable and is faster. Regex​: The current approach doesn’t have to encode every single case for naming because it compares column names by similarity, not exclusively by encoded, definitive rules. People’s naming conventions are horrible, such as: zerooffset spousessn addr1 Future: A recurring setback was how suboptimal the training set was in the first place. It was heavily biased in favor of usernames, so there were a lot of mis-categorizations. Column names in the training set should only be classified by the actual name, not including column data. Since names are generally noisy, having column data to categorize a name creates extra noise in the training set. For example, column names such as “memotext” and “comment” were classified as PII because the actual data in the column was PII even though the names don’t exactly suggest that. If the training sets were dynamically updated, clusters’ vectors should be updated by adding column name vectors logarithmically or else there will be too much noise from more popular column names such as “username”.
  • 6.   Reference Technologies: PII (Personally Identifiable Information) Columns:​ Columns that contain sensitive information Affinity propagation:​ Clustering words with similar syntax by edit distance. https://en.wikipedia.org/wiki/Affinity_propagation LDA (Latent Dirichlet Allocation):​ Topic modeling through expectation maximization with a dirichlet distribution. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/ http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf To program: NLTK (Natural Language Toolkit) for processing user input: ​http://www.nltk.org/ gensim for topic modeling: ​https://radimrehurek.com/gensim/ word2vec:​ Vector representation of a word https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html