Your SlideShare is downloading. ×
Creating Operational Redundancy for Effective Web Data Mining
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Creating Operational Redundancy for Effective Web Data Mining


Published on

In this session, we will explore the principles behind building a highly scalable, efficient, and effective web data mining architecture, based on standard semantic principles of data collection. …

In this session, we will explore the principles behind building a highly scalable, efficient, and effective web data mining architecture, based on standard semantic principles of data collection. This type of standard collection will allow any company to turn unstructured web data into structurally sound, valuable content.

Published in: Technology

1 Like
  • @DeveloperSteve Excellent point - the other item that you have to be aware of when implementing these principles in other countries are personal data retention laws. In many countries personal information falls under numerous categories, including:
    - How long you can retain personal information
    - The security regulations behind the servers that the information is stored on
    - Whether that personal information needs to be made available to the users when requested
    Are you sure you want to  Yes  No
    Your message goes here
  • slide 20 European players need to remember the Cookie Laws
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • The semantic data movement was an abysmal failure. Strip down the site to its basic components – the language and words used on the page
  • Open graph protocol
  • This is why I prefer using cURL: customization of requests, timeouts, allows redirects, etc.
  • Stripping irrelevant data
  • Scraping site keywords
  • You can also play with the fade in / fade out to modify the lightness and highlighting
  • Transcript

    • 1. Using Operational RedundancyEffective Web Data MiningJonathan LeBlancHead of Developer Evangelism N.A. (PayPal)Github: @jcleblanc
    • 2. PremiseThe interactions of a user can be used topersonalize their experience
    • 3. Elements of Mining RedundancyWebsiteDataMiningUserEmotionalState MiningUserInteractionMining
    • 4. Our Subject MaterialHTML content is poorly structuredThere are some pretty bad webpractices on the interwebzYou can’t trust that anythingsemantically valid will be present
    • 5. How We’ll Capture This DataStart with base linguisticsExtend with available extras
    • 6. The Basic PiecesPage DataScrapeyScrapeyKeywordsWithout allthe fluffWeightingWord dietsFTW
    • 7. Capture Raw Page DataSemantic data on the webis sucktasticAssume 5 year olds builtthe sitesLanguage is the key
    • 8. Extract KeywordsWe now have a big jumbleof words. Let’s extractWhy is “and” a top word?Stop words = sad panda
    • 9. Weight KeywordsAll content is not createdequalMeta and headers andsemantics oh my!This is where we leechoff the work of others
    • 10. Questions to Keep in MindShould I use regex to parse webcontent?How do users interact with pagecontent?What key identifiers can be monitoredto detect interest?
    • 11. Fetching the Data: cURL$req = curl_init($url);$options = array(CURLOPT_URL => $url,CURLOPT_HEADER => $header,CURLOPT_RETURNTRANSFER => true,CURLOPT_FOLLOWLOCATION => true,CURLOPT_AUTOREFERER => true,CURLOPT_TIMEOUT => 15,CURLOPT_MAXREDIRS => 10);curl_setopt_array($req, $options);
    • 12. //list of findable / replaceable string characters$find = array(/r/, /n/, /ss+/); $replace = array( , , );//perform page content modification$mod_content = preg_replace(#<script(.*?)>(.*?)</script>#is, , $page_content);$mod_content = preg_replace(#<style(.*?)>(.*?)</style>#is, , $mod_content);$mod_content = strip_tags($mod_content);$mod_content = strtolower($mod_content);$mod_content = preg_replace($find, $replace, $mod_content);$mod_content = trim($mod_content);$mod_content = explode( , $mod_content);natcasesort($mod_content);
    • 13. //set up list of stop words and the final found stopped list$common_words = array(a, ..., zero);$searched_words = array();//extract list of keywords with number of occurrencesforeach($mod_content as $word) {$word = trim($word);if(strlen($word) > 2 && !in_array($word, $common_words)){$searched_words[$word]++;}}arsort($searched_words, SORT_NUMERIC);
    • 14. Scraping Site Meta Data//load scraped page data as a valid DOM document$dom = new DOMDocument();@$dom->loadHTML($page_content);//scrape title$title = $dom->getElementsByTagName("title");$title = $title->item(0)->nodeValue;
    • 15. //loop through all found meta tags$metas = $dom->getElementsByTagName("meta");for ($i = 0; $i < $metas->length; $i++){$meta = $metas->item($i);if($meta->getAttribute("property")){if ($meta->getAttribute("property") == "og:description"){$dataReturn["description"] = $meta->getAttribute("content");}} else {if($meta->getAttribute("name") == "description"){$dataReturn["description"] = $meta->getAttribute("content");} else if($meta->getAttribute("name") == "keywords”){$dataReturn[”keywords"] = $meta->getAttribute("content");}}}
    • 16. Weighting Important DataTags you should careabout: meta (include OG),title, description, h1+,headerBonus points for adding incontent location modifiers
    • 17. Weighting Important Tags//our keyword weights$weights = array("keywords" => "3.0","meta" => "2.0","header1" => "1.5","header2" => "1.2");//add modifier hereif(strlen($word) > 2 && !in_array($word, $common_words)){$searched_words[$word]++;}
    • 18. Expanding to Phrases2-3 adjacent words, makingup a direct relevant calloutSeems easy right? Just likesingle wordsLanguage gets wonkywithout stop words
    • 19. Adding in Time InteractionsInteraction with a site doesnot necessarily meaninterest in itTime needs to also includean interaction componentGift buying seasons seeinterest variations
    • 20. Grouping Using CommonalityInterestsUser AInterestsUser BInterestsCommon
    • 21. Using Color TheoryProducts with a feel-good messageHappiness, energy, encouragementHealth care (but not food!)Relatable, calm, friendly, peace, securityStartups / innovative productsCreativity, imaginationAuction sites (but not sales sites!)Passion, stimulation, excitement, power
    • 22. What We’re Talking About
    • 23. The CSS Service
    • 24. Engine Foundation: LESSPHP+
    • 25. The Basics of a Design Engine//create new LESS object$less= new lessc();//compile LESS code to CSS$less->checkedCompile(/path/styles.less,path/styles.css);//create new CSS file and return new file linkecho "<link rel=stylesheet href=http://path/styles.csstype=text/css />";
    • 26. Passing Variables into LESSPHP//create a new LESS object$less = new lessc();//set the variables$less->setVariables(array(color => red,base => 960px));//compile LESS into PHP and unset variablesecho $less->compile(".magic { color: @color;width: @base - 200; }");$less->unsetVariable(color);
    • 27. Implementing Color FunctionsLighten / Darken Saturate / DesaturateAdjust HueMix Colors
    • 28. Managing Irrelevant ContentRemove / hide contentbased on user profileand state
    • 29. Managing Irrelevant Content//variables passed into LESS compilation$less->setVariables(array("percent" => "80%",));//LESS template.highlight{@bg-color: "#464646”;@font-color: "#eee";background-color: fade(@bg-color, @percent);color: fade(@font-color, @percent);}
    • 30. Traits of the BoredDistractionRepetitionTirednessReasons for BoredomLack of interestReadinessActing on Disinterest / Boredom
    • 31. Highlighting on Agitated BehaviorHighlight relevantcontent to reduceagitated behavior
    • 32. Acting Upon User Queues$less->setVariables(array("percent" => "100%","size-mod" => "2"));Variables passed into LESS script
    • 33. Acting Upon User Queues.highlight{@bg-calm: "blue";@bg-action: "red";@base-font: "14px";background-color: mix(@bg-calm,@bg-action,@percent );font-size: @size-mod + @base-font;}LESS script logic for color / size variations
    • 34. Interaction and Emotion PluginjQuery Behavior Minerby Cedric Dugas
    • 35. In the End…What a person is interested inWhat a person is doingWhat their emotional state is
    • 36. You! Questions?Jonathan LeBlancHead of Developer Evangelism N.A. (PayPal)Github: @jcleblanc