Building an Identity Extraction Engine

Premise

You can determine the personality profile of
a person based on their browsing habits

Then I Read This…

Us & Them
The Science of Identity
By David Berreby

The Different States of Knowledge

What a person knows

What a person knows they don’t know

What a person doesn’t know they don’t know

Technology was NOT the Solution

Identity and discovery are
NOT a technology solution

Our Subject Material

HTML content is unstructured

You can’t trust that anything
semantically valid will be present

There are some pretty bad web
practices on the interwebz

How We’ll Capture This Data

Start with base linguistics

Extend with available extras

The Basic Pieces

Page Data Keywords Weighting
Scrapey Without all Word diets
Scrapey the fluff FTW

Capture Raw Page Data

Semantic data on the web
is sucktastic

Assume 5 year olds built
the sites
Language is the key

Extract Keywords

We now have a big jumble
of words. Let’s extract

Why is “and” a top word?
Stop words = sad panda

Weight Keywords

All content is not created
equal

Meta and headers and
semantics oh my!

This is where we leech
off the work of others

Questions to Keep in Mind

Should I use regex to parse web
content?

How do users interact with page
content?

What key identifiers can be monitored
to detect interest?

Fetching the Data: The Request

The Simple Way

$html = file_get_contents('URL');

The Controlled Way

$c = curl_init('URL');

Fetching the Data: cURL
$req = curl_init($url);

$options = array(
CURLOPT_URL => $url,
CURLOPT_HEADER => $header,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_AUTOREFERER => true,
CURLOPT_TIMEOUT => 15,
CURLOPT_MAXREDIRS => 10
);

curl_setopt_array($req, $options);

//list of findable / replaceable string characters
$find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' ');

//perform page content modification
$mod_content = preg_replace('#<script(.*?)>(.*?)</
script>#is', '', $page_content);
$mod_content = preg_replace('#<style(.*?)>(.*?)</
style>#is', '', $mod_content);

$mod_content = strip_tags($mod_content);
$mod_content = strtolower($mod_content);
$mod_content = preg_replace($find, $replace, $mod_content);
$mod_content = trim($mod_content);
$mod_content = explode(' ', $mod_content);

natcasesort($mod_content);

//set up list of stop words and the final found stopped list
$common_words = array('a', ..., 'zero');
$searched_words = array();

//extract list of keywords with number of occurrences
foreach($mod_content as $word) {
$word = trim($word);
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}
}

arsort($searched_words, SORT_NUMERIC);

Scraping Site Meta Data

//load scraped page data as a valid DOM document
$dom = new DOMDocument();
@$dom->loadHTML($page_content);

//scrape title
$title = $dom->getElementsByTagName("title");
$title = $title->item(0)->nodeValue;

//loop through all found meta tags
$metas = $dom->getElementsByTagName("meta");

for ($i = 0; $i < $metas->length; $i++){
$meta = $metas->item($i);
if($meta->getAttribute("property")){
if ($meta->getAttribute("property") == "og:description"){
$dataReturn["description"] = $meta->getAttribute("content");
}
} else {
if($meta->getAttribute("name") == "description"){
$dataReturn["description"] = $meta->getAttribute("content");
} else if($meta->getAttribute("name") == "keywords”){
$dataReturn[”keywords"] = $meta->getAttribute("content");
}
}
}

Weighting Important Data

Tags you should care
about: meta (include
OG), title, description, h1+
, header

Bonus points for adding in
content location modifiers

Weighting Important Tags

//our keyword weights
$weights = array("keywords" => "3.0",
"meta" => "2.0",
"header1" => "1.5",
"header2" => "1.2");

//add modifier here
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}

Expanding to Phrases

2-3 adjacent words, making
up a direct relevant callout

Seems easy right? Just like
single words

Language gets wonky
without stop words

Working with Unknown Users

The majority of users won’t
be immediately targetable

Use HTML5 LocalStorage &
Cookie backup

Adding in Time Interactions

Interaction with a site does
not necessarily mean
interest in it

Time needs to also include
an interaction component

Gift buying seasons see
interest variations

Grouping Using Commonality

Common
Interests
Interests Interests
User A User B

Building an Identity Extraction Engine

Building an Identity Extraction Engine

More Related Content

More from Jonathan LeBlanc

Recently uploaded

Building an Identity Extraction Engine

Editor's Notes