Securing & Personalizing Commerce             Using Identity Data Mining                              Jonathan LeBlanc    ...
The ProblemCommerce Relies on Static Data Contributions
Premise You can determine the personality profile of a person based on their usage habits Personalization == Security
Technology was the Solution!
Then I Read This…               Us & Them               The Science of Identity               By David Berreby
The Different States of Knowledge What a person knows What a person knows they don’t know What a person doesn’t know they ...
Technology was NOT the Solution   Identity and discovery are   NOT a technology solution
Our Subject Material
Our Subject Material        HTML content is poorly structured        You can’t trust that anything        semantically val...
How We’ll Capture This Data             Start with base linguistics             Extend with available extras
The Basic Pieces  Page Data   Keywords      Weighting   Scrapey    Without all   Word diets   Scrapey     the fluff      FTW
Capture Raw Page Data             Semantic data on the web             is sucktastic             Assume 5 year olds built ...
Extract Keywords              We now have a big jumble              of words. Let’s extract              Why is “and” a to...
Weight Keywords             All content is not created             equal             Meta and headers and             sema...
Questions to Keep in Mind   Should I use regex to parse web   content?    How do users interact with page    content?   Wh...
Fetching the Data: The RequestThe Simple Way  $html = file_get_contents(URL);The Controlled Way  $c = curl_init(URL);
Fetching the Data: cURL $req = curl_init($url); $options = array(    CURLOPT_URL => $url,    CURLOPT_HEADER => $header,   ...
//list of findable / replaceable string characters$find = array(/r/, /n/, /ss+/); $replace = array( ,  ,  );//perform page...
//set up list of stop words and the final found stopped list$common_words = array(a, ..., zero);$searched_words = array();...
Scraping Site Meta Data //load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_c...
//loop through all found meta tags$metas = $dom->getElementsByTagName("meta");for ($i = 0; $i < $metas->length; $i++){  $m...
Weighting Important Data              Tags you should care              about: meta (include OG),              title, desc...
Weighting Important Tags//our keyword weights$weights = array("keywords"   => "3.0",                 "meta"       => "2.0"...
Expanding to Phrases            2-3 adjacent words, making            up a direct relevant callout            Seems easy r...
Working with Unknown Users            The majority of users won’t            be immediately targetable            Use HTML...
Adding in Time Interactions             Interaction with a site does             not necessarily mean             interest...
Grouping Using Commonality                  Common                  Interests      Interests               Interests      ...
Thank You! Questions?    www.slideshare.com/jcleblanc                        Jonathan LeBlanc           Developer Evangeli...
Securing and Personalizing Commerce Using Identity Data Mining
Securing and Personalizing Commerce Using Identity Data Mining
Securing and Personalizing Commerce Using Identity Data Mining
Upcoming SlideShare
Loading in...5
×

Securing and Personalizing Commerce Using Identity Data Mining

888

Published on

As we are witnessing our society becoming increasingly more reliant on mobile technology, so are we seeing the mobilization of money. In this new realm of commerce, online identity is becoming significantly more important.

As a payment is processed, it becomes incredibly important to not only understand who a person is, but also to understand what their broader interests and preferences are so that personalized experiences, suggesting new content and merchandise, may be delivered on an individual level.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
888
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Technology is the solution!
  • We’ll be looking at unstructured web page data
  • The semantic data movement was an abysmal failure. Strip down the site to its basic components – the language and words used on the page
  • Open graph protocol
  • Different methods for making the request
  • This is why I prefer using cURL: customization of requests, timeouts, allows redirects, etc.
  • Stripping irrelevant data
  • Scraping site keywords
  • Securing and Personalizing Commerce Using Identity Data Mining

    1. 1. Securing & Personalizing Commerce Using Identity Data Mining Jonathan LeBlanc Developer Evangelist (PayPal) Github: http://github.com/jcleblanc Twitter: @jcleblanc
    2. 2. The ProblemCommerce Relies on Static Data Contributions
    3. 3. Premise You can determine the personality profile of a person based on their usage habits Personalization == Security
    4. 4. Technology was the Solution!
    5. 5. Then I Read This… Us & Them The Science of Identity By David Berreby
    6. 6. The Different States of Knowledge What a person knows What a person knows they don’t know What a person doesn’t know they don’t know
    7. 7. Technology was NOT the Solution Identity and discovery are NOT a technology solution
    8. 8. Our Subject Material
    9. 9. Our Subject Material HTML content is poorly structured You can’t trust that anything semantically valid will be present There are some pretty bad web practices on the interwebz
    10. 10. How We’ll Capture This Data Start with base linguistics Extend with available extras
    11. 11. The Basic Pieces Page Data Keywords Weighting Scrapey Without all Word diets Scrapey the fluff FTW
    12. 12. Capture Raw Page Data Semantic data on the web is sucktastic Assume 5 year olds built the sites Language is the key
    13. 13. Extract Keywords We now have a big jumble of words. Let’s extract Why is “and” a top word? Stop words = sad panda
    14. 14. Weight Keywords All content is not created equal Meta and headers and semantics oh my! This is where we leech off the work of others
    15. 15. Questions to Keep in Mind Should I use regex to parse web content? How do users interact with page content? What key identifiers can be monitored to detect interest?
    16. 16. Fetching the Data: The RequestThe Simple Way $html = file_get_contents(URL);The Controlled Way $c = curl_init(URL);
    17. 17. Fetching the Data: cURL $req = curl_init($url); $options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 ); curl_setopt_array($req, $options);
    18. 18. //list of findable / replaceable string characters$find = array(/r/, /n/, /ss+/); $replace = array( , , );//perform page content modification$mod_content = preg_replace(#<script(.*?)>(.*?)</ script>#is, , $page_content);$mod_content = preg_replace(#<style(.*?)>(.*?)</ style>#is, , $mod_content);$mod_content = strip_tags($mod_content);$mod_content = strtolower($mod_content);$mod_content = preg_replace($find, $replace, $mod_content);$mod_content = trim($mod_content);$mod_content = explode( , $mod_content);natcasesort($mod_content);
    19. 19. //set up list of stop words and the final found stopped list$common_words = array(a, ..., zero);$searched_words = array();//extract list of keywords with number of occurrencesforeach($mod_content as $word) { $word = trim($word); if (preg_match(/[^a-zA-Z]/, $word) == 1){ $word = ; } if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }}arsort($searched_words, SORT_NUMERIC);
    20. 20. Scraping Site Meta Data //load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content); //scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;
    21. 21. //loop through all found meta tags$metas = $dom->getElementsByTagName("meta");for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } }}
    22. 22. Weighting Important Data Tags you should care about: meta (include OG), title, description, h1+, header Bonus points for adding in content location modifiers
    23. 23. Weighting Important Tags//our keyword weights$weights = array("keywords" => "3.0", "meta" => "2.0", "header1" => "1.5", "header2" => "1.2");//add modifier hereif(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++;}
    24. 24. Expanding to Phrases 2-3 adjacent words, making up a direct relevant callout Seems easy right? Just like single words Language gets wonky without stop words
    25. 25. Working with Unknown Users The majority of users won’t be immediately targetable Use HTML5 LocalStorage & Cookie backup
    26. 26. Adding in Time Interactions Interaction with a site does not necessarily mean interest in it Time needs to also include an interaction component Gift buying seasons see interest variations
    27. 27. Grouping Using Commonality Common Interests Interests Interests User A User B
    28. 28. Thank You! Questions? www.slideshare.com/jcleblanc Jonathan LeBlanc Developer Evangelist (PayPal) Github: http://github.com/jcleblanc Twitter: @jcleblanc
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×