Using Operational RedundancyEffective Web Data MiningJonathan LeBlancHead of Developer Evangelism N.A. (PayPal)Github: htt...
PremiseThe interactions of a user can be used topersonalize their experience
Elements of Mining RedundancyWebsiteDataMiningUserEmotionalState MiningUserInteractionMining
Our Subject MaterialHTML content is poorly structuredThere are some pretty bad webpractices on the interwebzYou can’t trus...
How We’ll Capture This DataStart with base linguisticsExtend with available extras
The Basic PiecesPage DataScrapeyScrapeyKeywordsWithout allthe fluffWeightingWord dietsFTW
Capture Raw Page DataSemantic data on the webis sucktasticAssume 5 year olds builtthe sitesLanguage is the key
Extract KeywordsWe now have a big jumbleof words. Let’s extractWhy is “and” a top word?Stop words = sad panda
Weight KeywordsAll content is not createdequalMeta and headers andsemantics oh my!This is where we leechoff the work of ot...
Questions to Keep in MindShould I use regex to parse webcontent?How do users interact with pagecontent?What key identifier...
Fetching the Data: cURL$req = curl_init($url);$options = array(CURLOPT_URL => $url,CURLOPT_HEADER => $header,CURLOPT_RETUR...
//list of findable / replaceable string characters$find = array(/r/, /n/, /ss+/); $replace = array( ,  ,  );//perform page...
//set up list of stop words and the final found stopped list$common_words = array(a, ..., zero);$searched_words = array();...
Scraping Site Meta Data//load scraped page data as a valid DOM document$dom = new DOMDocument();@$dom->loadHTML($page_cont...
//loop through all found meta tags$metas = $dom->getElementsByTagName("meta");for ($i = 0; $i < $metas->length; $i++){$met...
Weighting Important DataTags you should careabout: meta (include OG),title, description, h1+,headerBonus points for adding...
Weighting Important Tags//our keyword weights$weights = array("keywords" => "3.0","meta" => "2.0","header1" => "1.5","head...
Expanding to Phrases2-3 adjacent words, makingup a direct relevant calloutSeems easy right? Just likesingle wordsLanguage ...
Adding in Time InteractionsInteraction with a site doesnot necessarily meaninterest in itTime needs to also includean inte...
Grouping Using CommonalityInterestsUser AInterestsUser BInterestsCommon
Using Color TheoryProducts with a feel-good messageHappiness, energy, encouragementHealth care (but not food!)Relatable, c...
What We’re Talking About
The CSS Service Enginelesscss.orgsass-lang.comlearnboost.github.com/stylus
http://leafo.net/lessphp/Design Engine Foundation: LESSPHP+
The Basics of a Design Engine//create new LESS object$less= new lessc();//compile LESS code to CSS$less->checkedCompile(/p...
Passing Variables into LESSPHP//create a new LESS object$less = new lessc();//set the variables$less->setVariables(array(c...
Implementing Color FunctionsLighten / Darken Saturate / DesaturateAdjust HueMix Colors
Managing Irrelevant ContentRemove / hide contentbased on user profileand state
Managing Irrelevant Content//variables passed into LESS compilation$less->setVariables(array("percent" => "80%",));//LESS ...
Traits of the BoredDistractionRepetitionTirednessReasons for BoredomLack of interestReadinessActing on Disinterest / Boredom
Highlighting on Agitated BehaviorHighlight relevantcontent to reduceagitated behavior
Acting Upon User Queues$less->setVariables(array("percent" => "100%","size-mod" => "2"));Variables passed into LESS script
Acting Upon User Queues.highlight{@bg-calm: "blue";@bg-action: "red";@base-font: "14px";background-color: mix(@bg-calm,@bg...
Interaction and Emotion PluginjQuery Behavior Minerby Cedric Dugashttps://github.com/posabsolute/jquery-behavior-miner
In the End…What a person is interested inWhat a person is doingWhat their emotional state is
http://slideshare.com/jcleblancThank You! Questions?Jonathan LeBlancHead of Developer Evangelism N.A. (PayPal)Github: http...
Creating Operational Redundancy for Effective Web Data Mining
Creating Operational Redundancy for Effective Web Data Mining
Creating Operational Redundancy for Effective Web Data Mining
Upcoming SlideShare
Loading in...5
×

Creating Operational Redundancy for Effective Web Data Mining

1,714

Published on

In this session, we will explore the principles behind building a highly scalable, efficient, and effective web data mining architecture, based on standard semantic principles of data collection. This type of standard collection will allow any company to turn unstructured web data into structurally sound, valuable content.

Published in: Technology
2 Comments
1 Like
Statistics
Notes
  • @DeveloperSteve Excellent point - the other item that you have to be aware of when implementing these principles in other countries are personal data retention laws. In many countries personal information falls under numerous categories, including:
    - How long you can retain personal information
    - The security regulations behind the servers that the information is stored on
    - Whether that personal information needs to be made available to the users when requested
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • slide 20 European players need to remember the Cookie Laws
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,714
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
19
Comments
2
Likes
1
Embeds 0
No embeds

No notes for slide
  • The semantic data movement was an abysmal failure. Strip down the site to its basic components – the language and words used on the page
  • Open graph protocol
  • This is why I prefer using cURL: customization of requests, timeouts, allows redirects, etc.
  • Stripping irrelevant data
  • Scraping site keywords
  • You can also play with the fade in / fade out to modify the lightness and highlighting
  • Creating Operational Redundancy for Effective Web Data Mining

    1. 1. Using Operational RedundancyEffective Web Data MiningJonathan LeBlancHead of Developer Evangelism N.A. (PayPal)Github: http://github.com/jcleblancSlides: http://slideshare.net/jcleblancTwitter: @jcleblanc
    2. 2. PremiseThe interactions of a user can be used topersonalize their experience
    3. 3. Elements of Mining RedundancyWebsiteDataMiningUserEmotionalState MiningUserInteractionMining
    4. 4. Our Subject MaterialHTML content is poorly structuredThere are some pretty bad webpractices on the interwebzYou can’t trust that anythingsemantically valid will be present
    5. 5. How We’ll Capture This DataStart with base linguisticsExtend with available extras
    6. 6. The Basic PiecesPage DataScrapeyScrapeyKeywordsWithout allthe fluffWeightingWord dietsFTW
    7. 7. Capture Raw Page DataSemantic data on the webis sucktasticAssume 5 year olds builtthe sitesLanguage is the key
    8. 8. Extract KeywordsWe now have a big jumbleof words. Let’s extractWhy is “and” a top word?Stop words = sad panda
    9. 9. Weight KeywordsAll content is not createdequalMeta and headers andsemantics oh my!This is where we leechoff the work of others
    10. 10. Questions to Keep in MindShould I use regex to parse webcontent?How do users interact with pagecontent?What key identifiers can be monitoredto detect interest?
    11. 11. Fetching the Data: cURL$req = curl_init($url);$options = array(CURLOPT_URL => $url,CURLOPT_HEADER => $header,CURLOPT_RETURNTRANSFER => true,CURLOPT_FOLLOWLOCATION => true,CURLOPT_AUTOREFERER => true,CURLOPT_TIMEOUT => 15,CURLOPT_MAXREDIRS => 10);curl_setopt_array($req, $options);
    12. 12. //list of findable / replaceable string characters$find = array(/r/, /n/, /ss+/); $replace = array( , , );//perform page content modification$mod_content = preg_replace(#<script(.*?)>(.*?)</script>#is, , $page_content);$mod_content = preg_replace(#<style(.*?)>(.*?)</style>#is, , $mod_content);$mod_content = strip_tags($mod_content);$mod_content = strtolower($mod_content);$mod_content = preg_replace($find, $replace, $mod_content);$mod_content = trim($mod_content);$mod_content = explode( , $mod_content);natcasesort($mod_content);
    13. 13. //set up list of stop words and the final found stopped list$common_words = array(a, ..., zero);$searched_words = array();//extract list of keywords with number of occurrencesforeach($mod_content as $word) {$word = trim($word);if(strlen($word) > 2 && !in_array($word, $common_words)){$searched_words[$word]++;}}arsort($searched_words, SORT_NUMERIC);
    14. 14. Scraping Site Meta Data//load scraped page data as a valid DOM document$dom = new DOMDocument();@$dom->loadHTML($page_content);//scrape title$title = $dom->getElementsByTagName("title");$title = $title->item(0)->nodeValue;
    15. 15. //loop through all found meta tags$metas = $dom->getElementsByTagName("meta");for ($i = 0; $i < $metas->length; $i++){$meta = $metas->item($i);if($meta->getAttribute("property")){if ($meta->getAttribute("property") == "og:description"){$dataReturn["description"] = $meta->getAttribute("content");}} else {if($meta->getAttribute("name") == "description"){$dataReturn["description"] = $meta->getAttribute("content");} else if($meta->getAttribute("name") == "keywords”){$dataReturn[”keywords"] = $meta->getAttribute("content");}}}
    16. 16. Weighting Important DataTags you should careabout: meta (include OG),title, description, h1+,headerBonus points for adding incontent location modifiers
    17. 17. Weighting Important Tags//our keyword weights$weights = array("keywords" => "3.0","meta" => "2.0","header1" => "1.5","header2" => "1.2");//add modifier hereif(strlen($word) > 2 && !in_array($word, $common_words)){$searched_words[$word]++;}
    18. 18. Expanding to Phrases2-3 adjacent words, makingup a direct relevant calloutSeems easy right? Just likesingle wordsLanguage gets wonkywithout stop words
    19. 19. Adding in Time InteractionsInteraction with a site doesnot necessarily meaninterest in itTime needs to also includean interaction componentGift buying seasons seeinterest variations
    20. 20. Grouping Using CommonalityInterestsUser AInterestsUser BInterestsCommon
    21. 21. Using Color TheoryProducts with a feel-good messageHappiness, energy, encouragementHealth care (but not food!)Relatable, calm, friendly, peace, securityStartups / innovative productsCreativity, imaginationAuction sites (but not sales sites!)Passion, stimulation, excitement, power
    22. 22. What We’re Talking About
    23. 23. The CSS Service Enginelesscss.orgsass-lang.comlearnboost.github.com/stylus
    24. 24. http://leafo.net/lessphp/Design Engine Foundation: LESSPHP+
    25. 25. The Basics of a Design Engine//create new LESS object$less= new lessc();//compile LESS code to CSS$less->checkedCompile(/path/styles.less,path/styles.css);//create new CSS file and return new file linkecho "<link rel=stylesheet href=http://path/styles.csstype=text/css />";
    26. 26. Passing Variables into LESSPHP//create a new LESS object$less = new lessc();//set the variables$less->setVariables(array(color => red,base => 960px));//compile LESS into PHP and unset variablesecho $less->compile(".magic { color: @color;width: @base - 200; }");$less->unsetVariable(color);
    27. 27. Implementing Color FunctionsLighten / Darken Saturate / DesaturateAdjust HueMix Colors
    28. 28. Managing Irrelevant ContentRemove / hide contentbased on user profileand state
    29. 29. Managing Irrelevant Content//variables passed into LESS compilation$less->setVariables(array("percent" => "80%",));//LESS template.highlight{@bg-color: "#464646”;@font-color: "#eee";background-color: fade(@bg-color, @percent);color: fade(@font-color, @percent);}
    30. 30. Traits of the BoredDistractionRepetitionTirednessReasons for BoredomLack of interestReadinessActing on Disinterest / Boredom
    31. 31. Highlighting on Agitated BehaviorHighlight relevantcontent to reduceagitated behavior
    32. 32. Acting Upon User Queues$less->setVariables(array("percent" => "100%","size-mod" => "2"));Variables passed into LESS script
    33. 33. Acting Upon User Queues.highlight{@bg-calm: "blue";@bg-action: "red";@base-font: "14px";background-color: mix(@bg-calm,@bg-action,@percent );font-size: @size-mod + @base-font;}LESS script logic for color / size variations
    34. 34. Interaction and Emotion PluginjQuery Behavior Minerby Cedric Dugashttps://github.com/posabsolute/jquery-behavior-miner
    35. 35. In the End…What a person is interested inWhat a person is doingWhat their emotional state is
    36. 36. http://slideshare.com/jcleblancThank You! Questions?Jonathan LeBlancHead of Developer Evangelism N.A. (PayPal)Github: http://github.com/jcleblancSlides: http://slideshare.net/jcleblancTwitter: @jcleblanc
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×