Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Bixo Web Mining Toolkit


Published on

Ken Krugler's talk at the Hadoop User Group (HUG) on the Bixo Web Mining Toolkit

Published in: Technology, Business

The Bixo Web Mining Toolkit

  1. 1. Bixo - Web Mining Toolkit 23 Sep 2009 Web Mining Toolkit Ken Krugler TransPac Software, Inc. My background - did a startup called Krugle from 2005 - 2008 Used Nutch to do a vertical crawl of the web, looking for technical software pages. Mined pages for references to open source projects. Used experience to create Bixo, an open source web mining toolkit Built on top of Hadoop, Cascading, Tika. 1
  2. 2. Bixo - Web Mining Toolkit 23 Sep 2009 Web Mining 101 Extracting & Processing Web Data More Than Just Search Business intelligence, competitive intelligence, events, people, companies, popularity, pricing, social graphs, Twitter feeds, Facebook friends, support forums, shopping carts… Quick intro to web mining, so we’re on the same page Most people think about the big search companies when they think about web mining. Search is clearly the biggest web mining category, and generates the most revenue. But other types of web mining have value that is high and growing. This is what Bixo focuses on. 2
  3. 3. Bixo - Web Mining Toolkit 23 Sep 2009 4 Steps in Mining Collect - fetch content from web Parse - extract data from formats Analyze - tokenize, rate, classify, cluster Produce - an index, a report Search Note - does not include serving up the search results Why do I bring this up? To help clarify why web mining is not the same as vertical search (next slide) 3
  4. 4. Bixo - Web Mining Toolkit 23 Sep 2009 Vertical Search Vertical crawl to get specific content Common use case for Nutch, Heritrix But web mining often has different outcome And specialized processing of data Most people think of vertical search when they think of specialized web mining. Lots of people have been doing this, using OSS like Nutch & Heritrix. End result is typically a Lucene index, plus the content, inverted links, etc. Typical web mining is not the same as vertical search. Often uses a white list, versus crawling to discover links. More specialized processing of the data. And these differences help answer the question of (next slide)… 4
  5. 5. Bixo - Web Mining Toolkit 23 Sep 2009 Why Bixo? Response to needs of commercial projects – Plug into Cascading-based workflow – Low IT time/skill requirements – Run well in AWS EC2 environment – Flexible I/O support for AWS - S3, HBase – Toolkit for building custom solutions • Fetch white list (parse/index, data mine) • Scrape white list (social popularity) Does the world really need yet another web crawler? No, but it does need a web mining toolkit Two companies agreed to sponsor work on Bixo as an open source project. On the point of running well in an EC2 environment… Even though there are many web mining tasks that can be handled on a single computer, You very quickly run into issues of scale if you can’t handle upwards of 100M+ pages. 5
  6. 6. Bixo - Web Mining Toolkit 23 Sep 2009 Bixo Overview MIT license open source project In use by three companies “Pipe” model for building workflows Runs on top of Hadoop/Cascading Full disclosure - Bixo makes heavy use of Cascading, which is under GPL. So if you want to sell a product based on Bixo, you need to talk to Chris Wensel. The pipe model comes from our use of Cascading to define the workflows. 6
  7. 7. Bixo - Web Mining Toolkit 23 Sep 2009 What is Cascading API for Hadoop data processing workflows Operations on tuples with named fields Workflows created from pipes Reduces painful low-level MR details Key for complex/reliable workflows I know Chris Wensel has previously talked about Cascading here, but just to make sure we’re all on the same page… “tuple” is like a row in a database. Named fields with values. Example of tuple - result of fetching a page, has URL, time of fetch, content, headers, response rate, etc. Because you can build workflows out of a mix of pre-defined & custom pipes, it’s a real toolkit. Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels more like C++ :) Key aspect of reliable workflows is Cascading’s ability to check your workflow (the DAG it builds) Finds cases where fields aren’t available for operations. Solves a key problem we ran into when customizing Nutch at Krugle 7
  8. 8. Bixo - Web Mining Toolkit 23 Sep 2009 Architecture This architecture looks nice and squeaky clean - and in general it is. One issue is with the fetch phase of bixo not fitting well into the MR model. External resource constraints mean you can’t treat it like a regular job. So lots of threads in a special reduce phase, with corresponding issues -Stack size -Error handling 8
  9. 9. Bixo - Web Mining Toolkit 23 Sep 2009 HUGMEE Hadoop Users who Generate the Most Effective Emails Let’s use a real example now of using Bixo to do web mining. Imagine that the Apache Foundation decided to honor people who make significant contributions to the Hadoop community. In a typical company, determining the winner would depend on political maneuvering, bribes,and sucking up. But the Apache Foundation could decides to go for a quantitative approach for the HUGMEE award. 9
  10. 10. Bixo - Web Mining Toolkit 23 Sep 2009 Helpful Hadoopers Use mailing list archives for data (collect) Parse mbox files and emails (parse) Score based on key phrases (analyze) End result is score/name pair (produce) How do you figure out the most helpful Hadoopers? As we discussed previously, it’s a classic web mining problem Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files. How do we score based on key phrases (next slide)? 10
  11. 11. Bixo - Web Mining Toolkit 23 Sep 2009 Scoring Algorithm Very sophisticated point system “thanks” == 5 “owe you a beer” == 50 “worship the ground you walk on” == 100 11
  12. 12. Bixo - Web Mining Toolkit 23 Sep 2009 High Level Steps Collect emails – Fetch mod_mbox generated page – Parse it to extract links to mbox files – Fetch mbox files – Split into separate emails Parse emails – Extract key headers (messageId, email, etc) – Parse body to identify quoted text Parsing the mod_mbox page is simple with Tika’s HtmlParser Cheated a bit when parsing emails - some users like Owen have many aliases So hand-generated alias resolution table. 12
  13. 13. Bixo - Web Mining Toolkit 23 Sep 2009 High Level Steps Analyze emails – Find key phrases in replies (ignore signoff) – Score emails by phrases – Group & sum by message ID – Group & sum by email address Produce ranked list – Toss email addresses with no love – Sort by summed score Need to ignore “thanks” in “thanks in advance for doing my job for me” signoff. Generate two tuples for each email: -one with messageId/name/address -One with reply-to messageId/score Group/sum aspect is classic reduce operation. 13
  14. 14. Bixo - Web Mining Toolkit 23 Sep 2009 Workflow I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom Cascading operations, 6 MR jobs. OK, actually not so clear, but… Key point is that only purple is stuff that I had to actually create Some lines are purple as well, since that workflow (DAG) is also something I defined - see next page. But only two custom operations actually needed - parsing mbox_page and calculating score Running took about 30 minutes - mostly politely waiting until it was Ok to politely do another fetch. Downloaded 150MB of mbox files 409 unique email addresses with at least one positive reply. 14
  15. 15. Bixo - Web Mining Toolkit 23 Sep 2009 Building the Flow Most of the code needed to create the workflow for this data mining app. Lots of oatmeal code - which is good. Don’t want to be writing tricky code here. Could optimize, but that would be a mistake…most web mining is programmer-constrained. So just use more servers in EC2 - cheaper & faster. 15
  16. 16. Bixo - Web Mining Toolkit 23 Sep 2009 mod_mbox Page Example of the top-level pages that were fetched in first phase. Then needed to be parsed to extract links to mbox files. 16
  17. 17. Bixo - Web Mining Toolkit 23 Sep 2009 Custom Operation Example of one of two custom operation Parsing mod_mbox page Uses Tika to extract Ids Emits tuple with URL for each mbox ID 17
  18. 18. Bixo - Web Mining Toolkit 23 Sep 2009 Validate Curve looks right - exponential decay. 409 unique email addresses that got some love from somebody. 18
  19. 19. Bixo - Web Mining Toolkit 23 Sep 2009 This Hug’s for Ted! And the winner is…Ted Dunning I know - I should have colored the elephant yellow. 19
  20. 20. Bixo - Web Mining Toolkit 23 Sep 2009 Produce A list of the usual suspects Coincidentally, Ted helped me derive the scoring algorithm I used…hmm. 20
  21. 21. Bixo - Web Mining Toolkit 23 Sep 2009 Use Bixo to… Find +/- product comments on forums Compare web site quality Track social network popularity Derive optimized SEO terms Scape and analyze pricing data Previous example could be easily changed to “find opinion makers on forums” Many other use cases All involve web mining workflow - fetch, parse, analyze, produce 21
  22. 22. Bixo - Web Mining Toolkit 23 Sep 2009 Summary Bixo is a web mining toolkit Built on Hadoop, Cascading, Tika Young project but used commercially Future - Mahout, monitoring, HBase, URL DB, cleanup, bug fixes, rinse, repeat Lots to be done, of course, but moving fast 22
  23. 23. Bixo - Web Mining Toolkit 23 Sep 2009 Resources Web: List: Source: Bugs: URLs to find out more about the Bixo project. Stefan Groschupf from 101tec helped with initial Bixo coding. His company provides infrastructure for project, thus in URLs above 23
  24. 24. Bixo - Web Mining Toolkit 23 Sep 2009 Any Questions? 24