Creating Operational Redundancy for Effective Web Data Mining

  • 1,395 views
Uploaded on

In this session, we will explore the principles behind building a highly scalable, efficient, and effective web data mining architecture, based on standard semantic principles of data collection. …

In this session, we will explore the principles behind building a highly scalable, efficient, and effective web data mining architecture, based on standard semantic principles of data collection. This type of standard collection will allow any company to turn unstructured web data into structurally sound, valuable content.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • @DeveloperSteve Excellent point - the other item that you have to be aware of when implementing these principles in other countries are personal data retention laws. In many countries personal information falls under numerous categories, including:
    - How long you can retain personal information
    - The security regulations behind the servers that the information is stored on
    - Whether that personal information needs to be made available to the users when requested
    Are you sure you want to
    Your message goes here
  • slide 20 European players need to remember the Cookie Laws
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
1,395
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
16
Comments
2
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • The semantic data movement was an abysmal failure. Strip down the site to its basic components – the language and words used on the page
  • Open graph protocol
  • This is why I prefer using cURL: customization of requests, timeouts, allows redirects, etc.
  • Stripping irrelevant data
  • Scraping site keywords
  • You can also play with the fade in / fade out to modify the lightness and highlighting

Transcript

  • 1. Using Operational RedundancyEffective Web Data MiningJonathan LeBlancHead of Developer Evangelism N.A. (PayPal)Github: http://github.com/jcleblancSlides: http://slideshare.net/jcleblancTwitter: @jcleblanc
  • 2. PremiseThe interactions of a user can be used topersonalize their experience
  • 3. Elements of Mining RedundancyWebsiteDataMiningUserEmotionalState MiningUserInteractionMining
  • 4. Our Subject MaterialHTML content is poorly structuredThere are some pretty bad webpractices on the interwebzYou can’t trust that anythingsemantically valid will be present
  • 5. How We’ll Capture This DataStart with base linguisticsExtend with available extras
  • 6. The Basic PiecesPage DataScrapeyScrapeyKeywordsWithout allthe fluffWeightingWord dietsFTW
  • 7. Capture Raw Page DataSemantic data on the webis sucktasticAssume 5 year olds builtthe sitesLanguage is the key
  • 8. Extract KeywordsWe now have a big jumbleof words. Let’s extractWhy is “and” a top word?Stop words = sad panda
  • 9. Weight KeywordsAll content is not createdequalMeta and headers andsemantics oh my!This is where we leechoff the work of others
  • 10. Questions to Keep in MindShould I use regex to parse webcontent?How do users interact with pagecontent?What key identifiers can be monitoredto detect interest?
  • 11. Fetching the Data: cURL$req = curl_init($url);$options = array(CURLOPT_URL => $url,CURLOPT_HEADER => $header,CURLOPT_RETURNTRANSFER => true,CURLOPT_FOLLOWLOCATION => true,CURLOPT_AUTOREFERER => true,CURLOPT_TIMEOUT => 15,CURLOPT_MAXREDIRS => 10);curl_setopt_array($req, $options);
  • 12. //list of findable / replaceable string characters$find = array(/r/, /n/, /ss+/); $replace = array( , , );//perform page content modification$mod_content = preg_replace(#<script(.*?)>(.*?)</script>#is, , $page_content);$mod_content = preg_replace(#<style(.*?)>(.*?)</style>#is, , $mod_content);$mod_content = strip_tags($mod_content);$mod_content = strtolower($mod_content);$mod_content = preg_replace($find, $replace, $mod_content);$mod_content = trim($mod_content);$mod_content = explode( , $mod_content);natcasesort($mod_content);
  • 13. //set up list of stop words and the final found stopped list$common_words = array(a, ..., zero);$searched_words = array();//extract list of keywords with number of occurrencesforeach($mod_content as $word) {$word = trim($word);if(strlen($word) > 2 && !in_array($word, $common_words)){$searched_words[$word]++;}}arsort($searched_words, SORT_NUMERIC);
  • 14. Scraping Site Meta Data//load scraped page data as a valid DOM document$dom = new DOMDocument();@$dom->loadHTML($page_content);//scrape title$title = $dom->getElementsByTagName("title");$title = $title->item(0)->nodeValue;
  • 15. //loop through all found meta tags$metas = $dom->getElementsByTagName("meta");for ($i = 0; $i < $metas->length; $i++){$meta = $metas->item($i);if($meta->getAttribute("property")){if ($meta->getAttribute("property") == "og:description"){$dataReturn["description"] = $meta->getAttribute("content");}} else {if($meta->getAttribute("name") == "description"){$dataReturn["description"] = $meta->getAttribute("content");} else if($meta->getAttribute("name") == "keywords”){$dataReturn[”keywords"] = $meta->getAttribute("content");}}}
  • 16. Weighting Important DataTags you should careabout: meta (include OG),title, description, h1+,headerBonus points for adding incontent location modifiers
  • 17. Weighting Important Tags//our keyword weights$weights = array("keywords" => "3.0","meta" => "2.0","header1" => "1.5","header2" => "1.2");//add modifier hereif(strlen($word) > 2 && !in_array($word, $common_words)){$searched_words[$word]++;}
  • 18. Expanding to Phrases2-3 adjacent words, makingup a direct relevant calloutSeems easy right? Just likesingle wordsLanguage gets wonkywithout stop words
  • 19. Adding in Time InteractionsInteraction with a site doesnot necessarily meaninterest in itTime needs to also includean interaction componentGift buying seasons seeinterest variations
  • 20. Grouping Using CommonalityInterestsUser AInterestsUser BInterestsCommon
  • 21. Using Color TheoryProducts with a feel-good messageHappiness, energy, encouragementHealth care (but not food!)Relatable, calm, friendly, peace, securityStartups / innovative productsCreativity, imaginationAuction sites (but not sales sites!)Passion, stimulation, excitement, power
  • 22. What We’re Talking About
  • 23. The CSS Service Enginelesscss.orgsass-lang.comlearnboost.github.com/stylus
  • 24. http://leafo.net/lessphp/Design Engine Foundation: LESSPHP+
  • 25. The Basics of a Design Engine//create new LESS object$less= new lessc();//compile LESS code to CSS$less->checkedCompile(/path/styles.less,path/styles.css);//create new CSS file and return new file linkecho "<link rel=stylesheet href=http://path/styles.csstype=text/css />";
  • 26. Passing Variables into LESSPHP//create a new LESS object$less = new lessc();//set the variables$less->setVariables(array(color => red,base => 960px));//compile LESS into PHP and unset variablesecho $less->compile(".magic { color: @color;width: @base - 200; }");$less->unsetVariable(color);
  • 27. Implementing Color FunctionsLighten / Darken Saturate / DesaturateAdjust HueMix Colors
  • 28. Managing Irrelevant ContentRemove / hide contentbased on user profileand state
  • 29. Managing Irrelevant Content//variables passed into LESS compilation$less->setVariables(array("percent" => "80%",));//LESS template.highlight{@bg-color: "#464646”;@font-color: "#eee";background-color: fade(@bg-color, @percent);color: fade(@font-color, @percent);}
  • 30. Traits of the BoredDistractionRepetitionTirednessReasons for BoredomLack of interestReadinessActing on Disinterest / Boredom
  • 31. Highlighting on Agitated BehaviorHighlight relevantcontent to reduceagitated behavior
  • 32. Acting Upon User Queues$less->setVariables(array("percent" => "100%","size-mod" => "2"));Variables passed into LESS script
  • 33. Acting Upon User Queues.highlight{@bg-calm: "blue";@bg-action: "red";@base-font: "14px";background-color: mix(@bg-calm,@bg-action,@percent );font-size: @size-mod + @base-font;}LESS script logic for color / size variations
  • 34. Interaction and Emotion PluginjQuery Behavior Minerby Cedric Dugashttps://github.com/posabsolute/jquery-behavior-miner
  • 35. In the End…What a person is interested inWhat a person is doingWhat their emotional state is
  • 36. http://slideshare.com/jcleblancThank You! Questions?Jonathan LeBlancHead of Developer Evangelism N.A. (PayPal)Github: http://github.com/jcleblancSlides: http://slideshare.net/jcleblancTwitter: @jcleblanc