Premise




 You can determine the personality profile of
 a person based on their browsing habits
Technology was the Solution!
Then I Read This…



               Us & Them
               The Science of Identity
               By David Berreby
The Different States of Knowledge


 What a person knows


 What a person knows they don’t know


 What a person doesn’t know they don’t know
Technology was NOT the Solution



   Identity and discovery are
   NOT a technology solution
Our Subject Material
Our Subject Material

        HTML content is unstructured


        You can’t trust that anything
        semantically valid will be present


        There are some pretty bad web
        practices on the interwebz
How We’ll Capture This Data




             Start with base linguistics

             Extend with available extras
The Basic Pieces




  Page Data   Keywords      Weighting
   Scrapey    Without all   Word diets
   Scrapey     the fluff      FTW
Capture Raw Page Data


             Semantic data on the web
             is sucktastic

             Assume 5 year olds built
             the sites
             Language is the key
Extract Keywords



              We now have a big jumble
              of words. Let’s extract

              Why is “and” a top word?
              Stop words = sad panda
Weight Keywords


             All content is not created
             equal

             Meta and headers and
             semantics oh my!

             This is where we leech
             off the work of others
Questions to Keep in Mind

   Should I use regex to parse web
   content?

    How do users interact with page
    content?

   What key identifiers can be monitored
   to detect interest?
Fetching the Data: The Request

The Simple Way

  $html = file_get_contents('URL');


The Controlled Way

  $c = curl_init('URL');
Fetching the Data: cURL
 $req = curl_init($url);

 $options = array(
    CURLOPT_URL => $url,
    CURLOPT_HEADER => $header,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_AUTOREFERER => true,
    CURLOPT_TIMEOUT => 15,
    CURLOPT_MAXREDIRS => 10
 );

 curl_setopt_array($req, $options);
//list of findable / replaceable string characters
$find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' ');

//perform page content modification
$mod_content = preg_replace('#<script(.*?)>(.*?)</
      script>#is', '', $page_content);
$mod_content = preg_replace('#<style(.*?)>(.*?)</
      style>#is', '', $mod_content);

$mod_content = strip_tags($mod_content);
$mod_content = strtolower($mod_content);
$mod_content = preg_replace($find, $replace, $mod_content);
$mod_content = trim($mod_content);
$mod_content = explode(' ', $mod_content);

natcasesort($mod_content);
//set up list of stop words and the final found stopped list
$common_words = array('a', ..., 'zero');
$searched_words = array();

//extract list of keywords with number of occurrences
foreach($mod_content as $word) {
   $word = trim($word);
   if(strlen($word) > 2 && !in_array($word, $common_words)){
      $searched_words[$word]++;
   }
}

arsort($searched_words, SORT_NUMERIC);
Scraping Site Meta Data



 //load scraped page data as a valid DOM document
 $dom = new DOMDocument();
 @$dom->loadHTML($page_content);

 //scrape title
 $title = $dom->getElementsByTagName("title");
 $title = $title->item(0)->nodeValue;
//loop through all found meta tags
$metas = $dom->getElementsByTagName("meta");

for ($i = 0; $i < $metas->length; $i++){
  $meta = $metas->item($i);
  if($meta->getAttribute("property")){
    if ($meta->getAttribute("property") == "og:description"){
      $dataReturn["description"] = $meta->getAttribute("content");
    }
  } else {
    if($meta->getAttribute("name") == "description"){
      $dataReturn["description"] = $meta->getAttribute("content");
    } else if($meta->getAttribute("name") == "keywords”){
      $dataReturn[”keywords"] = $meta->getAttribute("content");
    }
  }
}
Weighting Important Data


              Tags you should care
              about: meta (include
              OG), title, description, h1+
              , header

              Bonus points for adding in
              content location modifiers
Weighting Important Tags


//our keyword weights
$weights = array("keywords"   => "3.0",
                 "meta"       => "2.0",
                 "header1"    => "1.5",
                 "header2"    => "1.2");

//add modifier here
if(strlen($word) > 2 && !in_array($word, $common_words)){
   $searched_words[$word]++;
}
Expanding to Phrases


            2-3 adjacent words, making
            up a direct relevant callout

            Seems easy right? Just like
            single words

            Language gets wonky
            without stop words
Working with Unknown Users



            The majority of users won’t
            be immediately targetable

            Use HTML5 LocalStorage &
            Cookie backup
Adding in Time Interactions

             Interaction with a site does
             not necessarily mean
             interest in it

             Time needs to also include
             an interaction component

             Gift buying seasons see
             interest variations
Grouping Using Commonality




                  Common
                  Interests
      Interests               Interests
      User A                    User B
Building an Identity Extraction Engine

Building an Identity Extraction Engine

  • 2.
    Premise You candetermine the personality profile of a person based on their browsing habits
  • 3.
  • 4.
    Then I ReadThis… Us & Them The Science of Identity By David Berreby
  • 5.
    The Different Statesof Knowledge What a person knows What a person knows they don’t know What a person doesn’t know they don’t know
  • 6.
    Technology was NOTthe Solution Identity and discovery are NOT a technology solution
  • 7.
  • 8.
    Our Subject Material HTML content is unstructured You can’t trust that anything semantically valid will be present There are some pretty bad web practices on the interwebz
  • 9.
    How We’ll CaptureThis Data Start with base linguistics Extend with available extras
  • 11.
    The Basic Pieces Page Data Keywords Weighting Scrapey Without all Word diets Scrapey the fluff FTW
  • 12.
    Capture Raw PageData Semantic data on the web is sucktastic Assume 5 year olds built the sites Language is the key
  • 13.
    Extract Keywords We now have a big jumble of words. Let’s extract Why is “and” a top word? Stop words = sad panda
  • 14.
    Weight Keywords All content is not created equal Meta and headers and semantics oh my! This is where we leech off the work of others
  • 16.
    Questions to Keepin Mind Should I use regex to parse web content? How do users interact with page content? What key identifiers can be monitored to detect interest?
  • 17.
    Fetching the Data:The Request The Simple Way $html = file_get_contents('URL'); The Controlled Way $c = curl_init('URL');
  • 18.
    Fetching the Data:cURL $req = curl_init($url); $options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 ); curl_setopt_array($req, $options);
  • 19.
    //list of findable/ replaceable string characters $find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = preg_replace('#<script(.*?)>(.*?)</ script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</ style>#is', '', $mod_content); $mod_content = strip_tags($mod_content); $mod_content = strtolower($mod_content); $mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content); $mod_content = explode(' ', $mod_content); natcasesort($mod_content);
  • 20.
    //set up listof stop words and the final found stopped list $common_words = array('a', ..., 'zero'); $searched_words = array(); //extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; } } arsort($searched_words, SORT_NUMERIC);
  • 21.
    Scraping Site MetaData //load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content); //scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;
  • 22.
    //loop through allfound meta tags $metas = $dom->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } } }
  • 24.
    Weighting Important Data Tags you should care about: meta (include OG), title, description, h1+ , header Bonus points for adding in content location modifiers
  • 25.
    Weighting Important Tags //ourkeyword weights $weights = array("keywords" => "3.0", "meta" => "2.0", "header1" => "1.5", "header2" => "1.2"); //add modifier here if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }
  • 26.
    Expanding to Phrases 2-3 adjacent words, making up a direct relevant callout Seems easy right? Just like single words Language gets wonky without stop words
  • 27.
    Working with UnknownUsers The majority of users won’t be immediately targetable Use HTML5 LocalStorage & Cookie backup
  • 28.
    Adding in TimeInteractions Interaction with a site does not necessarily mean interest in it Time needs to also include an interaction component Gift buying seasons see interest variations
  • 29.
    Grouping Using Commonality Common Interests Interests Interests User A User B

Editor's Notes

  • #4 Technology is the solution!
  • #8 We’ll be looking at unstructured web page data
  • #13 The semantic data movement was an abysmal failure. Strip down the site to its basic components – the language and words used on the page
  • #15 Open graph protocol
  • #18 Different methods for making the request
  • #19 This is why I prefer using cURL: customization of requests, timeouts, allows redirects, etc.
  • #20 Stripping irrelevant data
  • #21 Scraping site keywords