BlogPost

Column Name Classification using Probabilistic
Models
Quinn Tran
Introduction:
User input for column names is not standardized. This means that users enter almost
anything they want as a column name and this creates a challenge on how to categorize,
store, and analyze user columns. The Data Governance team screens data to make sure it is
labeled and encrypted. Currently the team has an analyst manually checking column names
and column data if those contain sensitive information. An example use case is uploading a
table without any metadata. An example input is a list of column names with no column data.
The example output would be each column name with the labels: PII or OTHER and category
or none.
Information is classified into 17 categories, or buckets based on meaning, such as
social security number. Column data is easy to check because sensitive data in a specific
format can be classified using regex. Column names are almost gibberish because of
freeform user input. Because of the high volume of data uploaded, column classification has
to be further automated in order to scale storage and data analysis. The goal in particular to
this project is to automate classification of columns by name. Natural Language Processing
concepts will be used to predict by column name whether a column has sensitive data.
Technologies:
PII (Personally Identifiable Information) Columns: Columns that contain sensitive
information. Given a set of columns, PII columns have to be identified and categorized.
Affinity propagation: An algorithm to cluster words with similar syntax. It measures how
similar (in this case edit distance) pairs of column names are and simultaneously determines
which column names would be exemplars (representatives of their specific cluster). Messages
are exchanged between column names until a definite set of exemplars and corresponding
clusters appear.
LDA (Latent Dirichlet Allocation): An algorithm that groups words based on frequency and
thus implicitly meaning. It models topics through expectation maximization with a dirichlet
distribution. LDA represents documents as a combination of topics where each topic is a
cluster of related words, and each word contributes to a specific topic by a specific
probability.
word2vec: Vector representation of a word. This is used to quickly compare words for
similarity.

Implementations:
First Approach:
Classifying:
From a syntactic approach, training column names and test column names
received from user input were grouped according to affinity propagation. This training set
was the set of column names used to find patterns in existing column names to ultimately
make predictions about incoming column names. The test set is the set of column names
used to measure accuracy in the program’s predictions.
The groupings are based on edit distance. This was to group variations of the same
name such as social security, socialsecurity, ssn, socials. Ideally, clusters would have
exemplars (names the represent their respective cluster) that were the names of each of the
15 PII categories and 2 OTHER categories. After clustering, the number of clusters
depended on the number of names. Not only were there more than 17 categories, but
unrelated words were often grouped together because they were similar only syntactically.
For example:
pwd is PII and is labelled as PII programmatically. However, the most related names in
decreasing similarity are: emailaddress, tekpassword, address, dataid, stateid, opendate,
paymetdate, saledate, address, and userid. This is because there is no direct connection
between name meaning and syntax. In this case, since there are fewer characters in “pwd”,
there are fewer features to characterize this name and so there are a lot more names that
can be labelled similar to “pwd” by edit distance.
However, it was a good first step to cluster in order to compare column names to a
training set to classify whether or not each name was PII. The accuracy rate for classifying
(not categorizing into PII categories) was ~65%.
Categorizing:
Unnormalized cosine similarity was used to label each PII column name to a PII
category via vector representations of each test column name and each cluster. Here,
each cluster’s vector is made of the sum of each training column name’s vector (in that
respective cluster). This developed a generalized vector representation or a pattern of what a
word would look like for a specific cluster. There were a ton of false positives and okay-ish
categorizations for just syntax. The training sets had PII categories not just based on syntax,
but also on meaning because the classifications for the training sets were also determined by
column data, which couldn’t be accessed.

Second Approach:
Classifying:
LDA (Latent Dirichlet Allocation) was used to implicitly represent meaning, as it
uses expectation maximization in two scopes: word frequency across documents and word
frequency in relation to the words around it. In this case, there are 3 documents. The first 2
documents were the training column names separated into two groups: PII names and
OTHER names. The third document had the test column names.
This artificially provides “context clues” for LDA to associate PII names together and
ultimately cluster names more accurately. This reinforces the “meaning” of the column
names as PII or OTHER and thus dramatically solves the previous problem of having too
many false positives (names that were classified as PII even though they were actually
OTHER).
17 topics, or clusters, were chosen to ideally represent each of the 15 PII categories
and 2 OTHER categories. Thus relatively quickly, clusters were made based on meaning first.
LDA returns topics, or clusters of column names where each cluster has a set of related
column names. Each topic is labeled as PII or OTHER with a probability that it is PII. The
topics were sorted in ascending probability that the topic is PII. Because of precedent (and
configurations) there were 2 OTHER topics and 15 PII topics. The OTHER topics were
configured to be the first 2 topics in the order.
For example:
Input: list of test column names
Output:
17 topics, where each topic has an assigned probability that it is PII, and a list of all the
column names in that topic (training and test set).
PII, TOPIC 7, pii probability 0.202503450299
ipaddress
deviceid
fedtaxid
deviceuniqueidentifi
creditscor
contactphon
userid
ip
mtipaddress
citi
gopphonenumb
digestedpassword
dob
customernam
agencytaxpayerid
smsnumber
authorizationtoken
ueyodleepassword
mtwalletentryid
primarycontact
phonenumb
gopipaddress
gopaccountnumb
sourcebankaccountid
legalnam
useremail
mailingaddressfk
customerid
mtccexpdat
contactemail
yodleeaccountnumberhash
contactnam
iacencryptedssn
weblogin
lasttoken
spousessn
customerbillnam
iacsocialsecuritynumb
assistantnam
spousedateofbirth

yodleepassword
privatekey
aunameonccard
merchantuserid
Caveats:
Due to the sheer volume of names and how the 17 topics’ sizes were not equivalent,
adjustments had to be made. The 17 topics had to be scaled up to 200 topics, and the 2
OTHER topics were scaled up to 50 OTHER topics. 50 OTHER topics was chosen as a
configuration because it was right at the boundary of classifying virtually every name as
OTHER or the complement for an arbitrary corpus (a large and structured set of texts) size.
Now, each topic is small enough to not lose precision (dropping out names that contribute
~0% to the topic and potentially erasing names from the corpus). Now, virtually every name
was represented by a topic. Since a name can contribute to multiple topics, classifying a
test column name involved picking the most likely topic for this particular name in the
corpus. This was done by using gensim’s get_term_topics(word_id,
minimum_probability=None). Even though get_term_topics returns a list of most probable
topics for a test column name with probabilities of how much the name contributes to each
respective topic, no weighted averages were taken to classify the test column name
because averages tend to error towards false positives (classifying a test column name as
PII when it really isn’t because there are a lot more PII labeled topics than OTHER topics by
construction).
The demo measures how accurate the program is for a set of test column names:
accuracy rate: 0.79
Column Name Actual Program
serviceid OTHER OTHER
otherid OTHER OTHER
subscrstatusid OTHER OTHER
shiptoaddrid PII PII
isautorenew OTHER PII
laststatuschangeeventid OTHER PII
name OTHER PII
shiptoaddrid PII PII
Categorizing:
The assumption was made that each topic, or cluster, represented each category and
would accurately tell what category a test column name would belong to. Unfortunately,
because the training sets heavily favored the username category, almost every test column
name was categorized as a username. PII names were actually more easily categorized
syntax-wise or regex-wise since sensitive information is usually formatted a certain way.
Thus, reverting back to the previous method of categorization, unnormalized cosine
similarity was used to label each PII column name to a PII category and to rank PII
categories in decreasing similarity relative to the PII column name. Now the PII column
names were categorized more accurately.
For example, the assigned categories for each test column name are:
accountantbusinessnam: individualnam, financialaccountnumb, ssnorein, secret,
emailaddress, nonbusinessaddress, creditcard, nonbusinessphon, usernam, dateofbirth
billingemail: emailaddress

othermerchantaccountnumb: financialaccountnumb, individualnam, ssnorein, creditcard,
nonbusinessphon, secret, emailaddress, nonbusinessaddress, usernam, dateofbirth
checknumb: financialaccountnumb, individualnam, ssnorein, creditcard, nonbusinessphon,
secret, emailaddress, nonbusinessaddress, usernam, dateofbirth
Results:
Approach 1 (Previous) Approach 2 (Current)
Accuracy Rate .08 .79
Speed (5000 column names) ~2-3 minutes ~1 minute
The current approach compares to other approaches by:
Previous Approach: The current approach is more accurate and spends less processing
time.
Manual Scanning: The current approach is scalable and is faster.
Regex: The current approach doesn’t have to encode every single case for naming because
it compares column names by similarity, not exclusively by encoded, definitive rules.
People’s naming conventions are horrible, such as:
zerooffset
spousessn
addr1
Future:
A recurring setback was how suboptimal the training set was in the first place. It was
heavily biased in favor of usernames, so there were a lot of mis-categorizations. Column
names in the training set should only be classified by the actual name, not including column
data. Since names are generally noisy, having column data to categorize a name creates
extra noise in the training set. For example, column names such as “memotext” and
“comment” were classified as PII because the actual data in the column was PII even though
the names don’t exactly suggest that.
If the training sets were dynamically updated, clusters’ vectors should be updated by
adding column name vectors logarithmically or else there will be too much noise from more
popular column names such as “username”.

Reference Technologies:
PII (Personally Identifiable Information) Columns: Columns that contain sensitive
information
Affinity propagation: Clustering words with similar syntax by edit distance.
https://en.wikipedia.org/wiki/Affinity_propagation
LDA (Latent Dirichlet Allocation): Topic modeling through expectation maximization with a
dirichlet distribution.
http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
To program:
NLTK (Natural Language Toolkit) for processing user input: http://www.nltk.org/
gensim for topic modeling: https://radimrehurek.com/gensim/
word2vec: Vector representation of a word
https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html

BlogPost

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to BlogPost

Similar to BlogPost (20)

BlogPost