Web Mining in the Cloud Ken Krugler, Bixo Labs, Inc. ACM Silicon Valley Data Mining Camp 01 November 2009 Hadoop/Cascading...
About me <ul><li>Background in vertical web crawl </li></ul><ul><ul><li>Krugle search engine for open source code </li></u...
Typical Data Mining
Data Mining Victory!
Meanwhile, Over at McAfee…
Web Mining 101 <ul><li>Extracting & Analyzing Web Data </li></ul><ul><li>More Than Just Search </li></ul><ul><li>Business ...
4 Steps in Web Mining <ul><li>Collect - fetch content from web </li></ul><ul><li>Parse - extract data from formats </li></...
Web Mining versus Data Mining <ul><li>Scale - 10 million isn’t a big number </li></ul><ul><li>Access - public but restrict...
How to Mine Large Scale Web Data? <ul><li>Start with scalable map-reduce platform </li></ul><ul><li>Add a workflow API lay...
One Solution - the HECB Stack <ul><li>B ixo </li></ul><ul><li>C ascading </li></ul><ul><li>H adoop </li></ul><ul><li>E C2 ...
EC2 - Amazon Elastic Compute Cloud <ul><li>True cost of non-cloud environment </li></ul><ul><ul><li>Cost of servers & netw...
Why Hadoop? <ul><li>Perfect for processing lots of data </li></ul><ul><ul><li>Map-reduce </li></ul></ul><ul><ul><li>Distri...
Why Cascading? <ul><li>API on top of Hadoop </li></ul><ul><li>Supports efficient, reliable workflows </li></ul><ul><li>Red...
Why Bixo? <ul><li>Plugs into Cascading-based workflow </li></ul><ul><ul><li>Scales with Hadoop cluster </li></ul></ul><ul>...
SEO Keyword Data Mining <ul><li>Example of typical web mining task </li></ul><ul><li>Find common keywords (1,2,3 word term...
Workflow
Custom Code for Example <ul><li>Filtering URLs inside domain </li></ul><ul><ul><li>Non-English content </li></ul></ul><ul>...
End Result in Data Mining Tool
What Next? <ul><li>Another example - mining mailing lists </li></ul><ul><li>Go straight to  Summary/Q&A </li></ul><ul><li>...
Another Example - HUGMEE <ul><li>H adoop </li></ul><ul><li>U sers who </li></ul><ul><li>G enerate the </li></ul><ul><li>M ...
Helpful Hadoopers <ul><li>Use mailing list archives for data (collect) </li></ul><ul><li>Parse mbox files and emails (pars...
Scoring Algorithm <ul><li>Very sophisticated point system </li></ul><ul><li>“thanks” == 5 </li></ul><ul><li>“owe you a bee...
High Level Steps <ul><li>Collect emails </li></ul><ul><ul><li>Fetch mod_mbox generated page </li></ul></ul><ul><ul><li>Par...
High Level Steps <ul><li>Analyze emails </li></ul><ul><ul><li>Find key phrases in replies (ignore signoff) </li></ul></ul>...
Workflow
Building the Flow
mod_mbox Page
Custom Operation
Validate
This Hug’s for Ted!
Produce
Public Terabyte Dataset <ul><li>Sponsored by Concurrent/Bixolabs </li></ul><ul><li>High quality crawl of top domains </li>...
Summary <ul><li>HECB stack works well for web mining </li></ul><ul><ul><li>Cheaper than typical colo option </li></ul></ul...
Any Questions? <ul><li>My email: </li></ul><ul><li>[email_address] </li></ul><ul><li>Bixo mailing list:  http://tech.group...
 
Upcoming SlideShare
Loading in …5
×

Elastic Web Mining

3,225
-1

Published on

My talk at the ACM Data Mining Unconference on 01 Nov 2009. How to use an open source stack (Hadoop, Cascading, Bixo) in EC2 for cost effective, scalable and reliable web mining.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,225
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
46
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Over the prior 4 years I had a startup called Krugle, that provided code search for open source projects and inside large companies. We did a large, 100M page crawl of the “programmer’s web” to find out information about open source projects. Based on what I learned from that experience, I started the Bixo open source project. It’s a toolkit for building web mining workflows, and I’ll be talking more about that later. Several companies paid me to integrate Bixo into an existing data processing environment. And that in turn led to Bixo Labs, which is a platform for quickly creating creating web mining apps. Elastic means the size of the system can easily be changed to match the web mining task.
  • This is the world that many of you live in. Analyzing data to find important patterns. Here’s an example of output from the QlikView business intelligence tool It was used to help analyze the relative prevalence of keywords in two competing web sites. Here you see two word terms that often occur on McAfee’s site, but not on Symantec’s Which is very useful data for anybody who worries about search engine optimization.
  • You all know about analyzing data to find important patterns that get managers all worked up…
  • But how do you get to this point? How do you use the web as the source for data that you’re analyzing That’s what I’m going to be talking about here.
  • Quick intro to web mining, so we’re on the same page Most people think about the big search companies when they think about web mining. Search is clearly the biggest web mining category, and generates the most revenue. But other types of web mining have value that is high and growing.
  • It’s common to confuse web crawling with fetching. Crawling is the process of automatically finding new pages by extracting links from fetched pages. But for many web mining applications, you have a “white list” of pre-defined URLs. In either case, though, you need to reliably, efficiently and politely fetch pages. Content comes in a variety of formats - typically HTML, but also PDF, word, zip archives, etc. Need to parse these formats to extract key data - typically text, but could be image data. Often the analyze step will include aspects of machine learning - classification, clustering. “useful data” covers a lot of ground, because there are a lot of ways to use the output of web mining. Generating an index is one of the most common, because people think about search as the goal. But for data mining, the end result at this point is often highly reduced data that is input to traditional data mining tools.
  • What are the key differences between web mining and traditional data mining I’m saying “traditional” because the face of data mining is clearly changing. But if you look at most vendor tools, the focus is on what I’d call “traditional data mining” Scale - 10M is big for data mining, but not for web mining Access - with DM, once you defeated Mongor, keeper of data base access keys, you were golden Web pages are typically public, but it’s a shared resource so implicit rules apply. Like “don’t bring my web site to its knees”. Data mining breaks traditional implicit contract, so extra cautions apply. Implicit contract is that I let you crawl me, and you drive traffic to me when your search index goes live. But with DM, there often isn’t an index as the end result. With mining DBs, there’s explicit structure, which is mostly lacking from web pages.
  • If it doesn’t scale, then it won’t handle the quantity of data you’ll ultimately want to process from the web If you can’t create real workflows, it will never be reliable or efficient. If you don’t use specialized web crawling code, you’ll get blacklisted Because you’re trying to distill down large data, there’s often some custom processing. If you don’t run it a cloud environment, you’ll be wasting money - and I’ll explain why in a few slides.
  • I’m focusing on one particular solution to the challenges of web mining that I just described. It’s the “HECB” stack. I’m going to talk about these from the bottom up, which is EC2 first, then Hadoop…but the acronym didn’t work as well.
  • At Krugle we ran two clusters, one of 11 servers, and a smaller 4 server cluster In the end, our actual utilization ratio was probably &lt; 20% Even with close to 100% utilization, the break-even point for EC2 vs. colo is somewhere between 50 and 200 servers, depending on who you talk to. If utilization was 20%, then break even would be 250 to 1000 servers. Mining for search doesn’t work so well in this model - cluster should be always crawling (ABC) so not as bursty And transferring raw content, parse, and index will generate lots of transfer charges. But for web mining that’s focused on data mining, data is distilled so this isn’t an issue.
  • Map-reduce - how do you parallelize the processing of lots of data so that you can Do the work on many servers? The answer is Map-reduce. HDFS - how do you store lots of data in a fault-tolerant, cost-effective manner. How do you make sure the data (the big stuff) moves as little as possible during processing. The answer is the Hadoop distributed file system. It’s open source, so lots of support, consultants, rapid bug fixes, etc. Large companies are using it, especially Yahoo Elastic map reduce is a special service built on top of EC2, where it’s easier to run Hadoop jobs Because you have access to pre-configured Hadoop clusters, special tools, etc.
  • If you ever had to write a complex workflow using Hadoop, you know the answer. It frees you from the lower-level details of thinking in map-reduce. You can think about the workflow as operations on records with fields. And in data mining, the workflow often winds up being very complex. Because you can build workflows out of a mix of pre-defined &amp; custom pipes, it’s a real toolkit. Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels more like C++ :) Key aspect of reliable workflows is Cascading’s ability to check your workflow (the DAG it builds) Finds cases where fields aren’t available for operations. Solves a key problem we ran into when customizing Nutch at Krugle
  • Does the world really need yet another web crawler? No, but it does need a web mining toolkit Two companies agreed to sponsor work on Bixo as an open source project. Polite yet efficient - tension between those two goals that’s hard to resolve. If you do a crawl of any reasonable size, you’ll run into lots of errors. Even if a web server says “I swear to you, I’m sending you a 20K HTML file in English” It’s a 50K text file in Russian using the Cyrillic character set. And because it’s open source, you get the benefit of a community of users. They contribute re-usable toolkit components.
  • Whenever I show a workflow diagram like this, I make a joke about it being intuitively obvious. Which, obviously, it’s not. And in fact the full workflow is a bit bigger, as I left out the second stage that describes more of the keyword analysis. But the key point is that the blue color items are provided by Cascading. And the green color items are provided by Bixo. So what’s left are two yellow items, which represent the two points of customization.
  • There were two main pieces of custom code that needed to be written. One was some URL filtering to focus on the right content inside the web sites. Avoiding non-English pages by specific URL patterns. Same kind of thing for forums and such, since these pages weren’t part of what could easily be optimized. And if enough people need this type of support, since Bixo is open source it will likely become part of the toolkit
  • Finally we can actually use a traditional data mining tool to help make sense of the digested data. Many things we could do in addition Clustering of results, to improve keyword analysis Larger sites have “areas of interest” Identifying broken links, typos Identifying personal data - email addresses, phone numbers
  • I try to limit presentations to 20 slides - so I’ve hit that limit In the spirit of the unconference - let me know what you’d like to do next.
  • Let’s use a real example now of using Bixo to do web mining. Imagine that the Apache Foundation decided to honor people who make significant contributions to the Hadoop community. In a typical company, determining the winner would depend on political maneuvering, bribes,and sucking up. But the Apache Foundation could decides to go for a quantitative approach for the HUGMEE award.
  • How do you figure out the most helpful Hadoopers? As we discussed previously, it’s a classic web mining problem Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files. How do we score based on key phrases (next slide)?
  • Parsing the mod_mbox page is simple with Tika’s HtmlParser Cheated a bit when parsing emails - some users like Owen have many aliases So hand-generated alias resolution table.
  • Need to ignore “thanks” in “thanks in advance for doing my job for me” signoff. Generate two tuples for each email: one with messageId/name/address One with reply-to messageId/score Group/sum aspect is classic reduce operation.
  • I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom Cascading operations, 6 MR jobs. OK, actually not so clear, but… Key point is that only purple is stuff that I had to actually create Some lines are purple as well, since that workflow (DAG) is also something I defined - see next page. But only two custom operations actually needed - parsing mbox_page and calculating score Running took about 30 minutes - mostly politely waiting until it was Ok to politely do another fetch. Downloaded 150MB of mbox files 409 unique email addresses with at least one positive reply.
  • Most of the code needed to create the workflow for this data mining app. Lots of oatmeal code - which is good. Don’t want to be writing tricky code here. Could optimize, but that would be a mistake…most web mining is programmer-constrained. So just use more servers in EC2 - cheaper &amp; faster.
  • Example of the top-level pages that were fetched in first phase. Then needed to be parsed to extract links to mbox files.
  • Example of one of two custom operation Parsing mod_mbox page Uses Tika to extract Ids Emits tuple with URL for each mbox ID
  • Curve looks right - exponential decay. 409 unique email addresses that got some love from somebody.
  • And the winner is…Ted Dunning I know - I should have colored the elephant yellow.
  • A list of the usual suspects Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.
  • Elastic Web Mining

    1. 2. Web Mining in the Cloud Ken Krugler, Bixo Labs, Inc. ACM Silicon Valley Data Mining Camp 01 November 2009 Hadoop/Cascading/Bixo in EC2
    2. 3. About me <ul><li>Background in vertical web crawl </li></ul><ul><ul><li>Krugle search engine for open source code </li></ul></ul><ul><ul><li>Bixo open source web mining toolkit </li></ul></ul><ul><li>Consultant for companies using EC2 </li></ul><ul><ul><li>Web mining </li></ul></ul><ul><ul><li>Data processing </li></ul></ul><ul><li>Founder of Bixo Labs </li></ul><ul><ul><li>Elastic web mining platform </li></ul></ul><ul><ul><li>http://bixolabs.com </li></ul></ul>
    3. 4. Typical Data Mining
    4. 5. Data Mining Victory!
    5. 6. Meanwhile, Over at McAfee…
    6. 7. Web Mining 101 <ul><li>Extracting & Analyzing Web Data </li></ul><ul><li>More Than Just Search </li></ul><ul><li>Business intelligence, competitive intelligence, events, people, companies, popularity, pricing, social graphs, Twitter feeds, Facebook friends, support forums, shopping carts… </li></ul>
    7. 8. 4 Steps in Web Mining <ul><li>Collect - fetch content from web </li></ul><ul><li>Parse - extract data from formats </li></ul><ul><li>Analyze - tokenize, rate, classify, cluster </li></ul><ul><li>Produce - “useful data” </li></ul>
    8. 9. Web Mining versus Data Mining <ul><li>Scale - 10 million isn’t a big number </li></ul><ul><li>Access - public but restricted </li></ul><ul><ul><li>Special implicit rules apply </li></ul></ul><ul><li>Structure - not much </li></ul>
    9. 10. How to Mine Large Scale Web Data? <ul><li>Start with scalable map-reduce platform </li></ul><ul><li>Add a workflow API layer </li></ul><ul><li>Mix in a web crawling toolkit </li></ul><ul><li>Write your custom data processing code </li></ul><ul><li>Run in an elastic cloud environment </li></ul>
    10. 11. One Solution - the HECB Stack <ul><li>B ixo </li></ul><ul><li>C ascading </li></ul><ul><li>H adoop </li></ul><ul><li>E C2 </li></ul>
    11. 12. EC2 - Amazon Elastic Compute Cloud <ul><li>True cost of non-cloud environment </li></ul><ul><ul><li>Cost of servers & networking (2 year life) </li></ul></ul><ul><ul><li>Cost of colo (6 servers/rack) </li></ul></ul><ul><ul><li>Cost of OPS salary (15% of FTE/cluster) </li></ul></ul><ul><ul><li>Managing servers is no fun </li></ul></ul><ul><li>Web mining is perfect for the cloud </li></ul><ul><ul><li>“ bursty” => savings are even greater </li></ul></ul><ul><ul><li>Data is distilled, so no transfer $$$ pain </li></ul></ul>
    12. 13. Why Hadoop? <ul><li>Perfect for processing lots of data </li></ul><ul><ul><li>Map-reduce </li></ul></ul><ul><ul><li>Distributed file system </li></ul></ul><ul><li>Open source, large community, etc. </li></ul><ul><li>Runs well in EC2 clusters </li></ul><ul><li>Elastic Map Reduce as option </li></ul>
    13. 14. Why Cascading? <ul><li>API on top of Hadoop </li></ul><ul><li>Supports efficient, reliable workflows </li></ul><ul><li>Reduces painful low-level MR details </li></ul><ul><li>Build workflow using “pipe” model </li></ul>
    14. 15. Why Bixo? <ul><li>Plugs into Cascading-based workflow </li></ul><ul><ul><li>Scales with Hadoop cluster </li></ul></ul><ul><ul><li>Rules well in EC2 </li></ul></ul><ul><li>Handles grungy web crawling details </li></ul><ul><ul><li>Polite yet efficient fetching </li></ul></ul><ul><ul><li>Errors, web servers that lie </li></ul></ul><ul><ul><li>Parsing lots of formats, broken HTML </li></ul></ul><ul><li>Open source toolkit for web mining apps </li></ul>
    15. 16. SEO Keyword Data Mining <ul><li>Example of typical web mining task </li></ul><ul><li>Find common keywords (1,2,3 word terms) </li></ul><ul><ul><li>Do domain-centric web crawl </li></ul></ul><ul><ul><li>Parse pages to extract title, meta, h1, links </li></ul></ul><ul><ul><li>Output keywords sorted by frequency </li></ul></ul><ul><li>Compare to competitor site(s) </li></ul>
    16. 17. Workflow
    17. 18. Custom Code for Example <ul><li>Filtering URLs inside domain </li></ul><ul><ul><li>Non-English content </li></ul></ul><ul><ul><li>User-generated content (forums, etc) </li></ul></ul><ul><li>Generating keywords from text </li></ul><ul><ul><li>Special tokenization </li></ul></ul><ul><ul><li>One, two, three word phrases </li></ul></ul><ul><li>But 95% of code was generic </li></ul>
    18. 19. End Result in Data Mining Tool
    19. 20. What Next? <ul><li>Another example - mining mailing lists </li></ul><ul><li>Go straight to Summary/Q&A </li></ul><ul><li>Talk about Public Terabyte Dataset </li></ul><ul><li>Write tweets, posts & emails </li></ul><ul><li>Find people to meet in the lobby </li></ul>
    20. 21. Another Example - HUGMEE <ul><li>H adoop </li></ul><ul><li>U sers who </li></ul><ul><li>G enerate the </li></ul><ul><li>M ost </li></ul><ul><li>E ffective </li></ul><ul><li>E mails </li></ul>
    21. 22. Helpful Hadoopers <ul><li>Use mailing list archives for data (collect) </li></ul><ul><li>Parse mbox files and emails (parse) </li></ul><ul><li>Score based on key phrases (analyze) </li></ul><ul><li>End result is score/name pair (produce) </li></ul>
    22. 23. Scoring Algorithm <ul><li>Very sophisticated point system </li></ul><ul><li>“thanks” == 5 </li></ul><ul><li>“owe you a beer” == 50 </li></ul><ul><li>“worship the ground you walk on” == 100 </li></ul>
    23. 24. High Level Steps <ul><li>Collect emails </li></ul><ul><ul><li>Fetch mod_mbox generated page </li></ul></ul><ul><ul><li>Parse it to extract links to mbox files </li></ul></ul><ul><ul><li>Fetch mbox files </li></ul></ul><ul><ul><li>Split into separate emails </li></ul></ul><ul><li>Parse emails </li></ul><ul><ul><li>Extract key headers (messageId, email, etc) </li></ul></ul><ul><ul><li>Parse body to identify quoted text </li></ul></ul>
    24. 25. High Level Steps <ul><li>Analyze emails </li></ul><ul><ul><li>Find key phrases in replies (ignore signoff) </li></ul></ul><ul><ul><li>Score emails by phrases </li></ul></ul><ul><ul><li>Group & sum by message ID </li></ul></ul><ul><ul><li>Group & sum by email address </li></ul></ul><ul><li>Produce ranked list </li></ul><ul><ul><li>Toss email addresses with no love </li></ul></ul><ul><ul><li>Sort by summed score </li></ul></ul>
    25. 26. Workflow
    26. 27. Building the Flow
    27. 28. mod_mbox Page
    28. 29. Custom Operation
    29. 30. Validate
    30. 31. This Hug’s for Ted!
    31. 32. Produce
    32. 33. Public Terabyte Dataset <ul><li>Sponsored by Concurrent/Bixolabs </li></ul><ul><li>High quality crawl of top domains </li></ul><ul><ul><li>HECB Stack using Elastic Map Reduce </li></ul></ul><ul><li>Hosted by Amazon in S3, free to EC2 users </li></ul><ul><li>Crawl & processing code available </li></ul><ul><li>Questions, input? http://bixolabs.com/PTD/ </li></ul>Back
    33. 34. Summary <ul><li>HECB stack works well for web mining </li></ul><ul><ul><li>Cheaper than typical colo option </li></ul></ul><ul><ul><li>Scales to hundreds of millions of pages </li></ul></ul><ul><ul><li>Reliable and efficient workflow </li></ul></ul><ul><li>Web mining has high & increasing value </li></ul><ul><ul><li>Search engine optimization, advertising </li></ul></ul><ul><ul><li>Social networks, reputation </li></ul></ul><ul><ul><li>Competitive pricing </li></ul></ul><ul><ul><li>Etc, etc, etc. </li></ul></ul>
    34. 35. Any Questions? <ul><li>My email: </li></ul><ul><li>[email_address] </li></ul><ul><li>Bixo mailing list: http://tech.groups.yahoo.com/group/ bixo -dev/ </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×