SlideShare a Scribd company logo
1 of 24
Download to read offline
Bixo - Web Mining Toolkit                                                                   23 Sep 2009




                   Web Mining Toolkit




                                            Ken Krugler
                                    TransPac Software, Inc.




             My background - did a startup called Krugle from 2005 - 2008
             Used Nutch to do a vertical crawl of the web, looking for technical software
             pages.
             Mined pages for references to open source projects.


             Used experience to create Bixo, an open source web mining toolkit
             Built on top of Hadoop, Cascading, Tika.




                                                                                                     1
Bixo - Web Mining Toolkit                                                                 23 Sep 2009




                       Web Mining 101

                        Extracting & Processing Web Data
                        More Than Just Search
                        Business intelligence, competitive intelligence,
                        events, people, companies, popularity, pricing,
                        social graphs, Twitter feeds, Facebook friends,
                        support forums, shopping carts…




             Quick intro to web mining, so we’re on the same page


             Most people think about the big search companies when they think about web
             mining.
             Search is clearly the biggest web mining category, and generates the most
             revenue.
             But other types of web mining have value that is high and growing.
             This is what Bixo focuses on.




                                                                                                   2
Bixo - Web Mining Toolkit                                                                23 Sep 2009




                      4 Steps in Mining

                        Collect - fetch content from web
                        Parse - extract data from formats
                        Analyze - tokenize, rate, classify, cluster
                        Produce - an index, a report
                        Search




             Note - does not include serving up the search results
             Why do I bring this up? To help clarify why web mining is not the same as
             vertical search (next slide)




                                                                                                  3
Bixo - Web Mining Toolkit                                                                     23 Sep 2009




                         Vertical Search

                         Vertical crawl to get specific content
                         Common use case for Nutch, Heritrix
                         But web mining often has different outcome
                         And specialized processing of data




             Most people think of vertical search when they think of specialized web
             mining.
             Lots of people have been doing this, using OSS like Nutch & Heritrix.
             End result is typically a Lucene index, plus the content, inverted links, etc.


             Typical web mining is not the same as vertical search.
             Often uses a white list, versus crawling to discover links.
             More specialized processing of the data.
             And these differences help answer the question of (next slide)…




                                                                                                       4
Bixo - Web Mining Toolkit                                                                  23 Sep 2009




                              Why Bixo?

                        Response to needs of commercial projects
                         – Plug into Cascading-based workflow
                         – Low IT time/skill requirements
                         – Run well in AWS EC2 environment
                         – Flexible I/O support for AWS - S3, HBase
                         – Toolkit for building custom solutions
                             • Fetch white list (parse/index, data mine)
                             • Scrape white list (social popularity)




             Does the world really need yet another web crawler?
             No, but it does need a web mining toolkit


             Two companies agreed to sponsor work on Bixo as an open source project.


             On the point of running well in an EC2 environment…
             Even though there are many web mining tasks that can be handled on a single
             computer,
             You very quickly run into issues of scale if you can’t handle upwards of
             100M+ pages.




                                                                                                    5
Bixo - Web Mining Toolkit                                                                23 Sep 2009




                            Bixo Overview

                        MIT license open source project
                        In use by three companies
                        “Pipe” model for building workflows
                        Runs on top of Hadoop/Cascading




             Full disclosure - Bixo makes heavy use of Cascading, which is under GPL.
             So if you want to sell a product based on Bixo, you need to talk to Chris
             Wensel.


             The pipe model comes from our use of Cascading to define the workflows.




                                                                                                  6
Bixo - Web Mining Toolkit                                                                     23 Sep 2009




                     What is Cascading

                        API for Hadoop data processing workflows
                        Operations on tuples with named fields
                        Workflows created from pipes
                        Reduces painful low-level MR details
                        Key for complex/reliable workflows




             I know Chris Wensel has previously talked about Cascading here, but just to
             make sure we’re all on the same page…


             “tuple” is like a row in a database. Named fields with values.
             Example of tuple - result of fetching a page, has URL, time of fetch, content,
             headers, response rate, etc.


             Because you can build workflows out of a mix of pre-defined & custom pipes,
             it’s a real toolkit.


             Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels
             more like C++ :)


             Key aspect of reliable workflows is Cascading’s ability to check your
             workflow (the DAG it builds)
             Finds cases where fields aren’t available for operations.
             Solves a key problem we ran into when customizing Nutch at Krugle




                                                                                                       7
Bixo - Web Mining Toolkit                                                                    23 Sep 2009




                            Architecture




             This architecture looks nice and squeaky clean - and in general it is.
             One issue is with the fetch phase of bixo not fitting well into the MR model.
             External resource constraints mean you can’t treat it like a regular job.
             So lots of threads in a special reduce phase, with corresponding issues
             -Stack size
             -Error handling




                                                                                                      8
Bixo - Web Mining Toolkit                                                                    23 Sep 2009




                                HUGMEE

                        Hadoop
                        Users who
                        Generate the
                        Most
                        Effective
                        Emails




             Let’s use a real example now of using Bixo to do web mining.

             Imagine that the Apache Foundation decided to honor people who make
             significant contributions to the Hadoop community.


             In a typical company, determining the winner would depend on political
             maneuvering, bribes,and sucking up.


             But the Apache Foundation could decides to go for a quantitative approach for
             the HUGMEE award.




                                                                                                      9
Bixo - Web Mining Toolkit                                                                      23 Sep 2009




                    Helpful Hadoopers

                        Use mailing list archives for data (collect)
                        Parse mbox files and emails (parse)
                        Score based on key phrases (analyze)
                        End result is score/name pair (produce)




             How do you figure out the most helpful Hadoopers?
             As we discussed previously, it’s a classic web mining problem


             Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files.


             How do we score based on key phrases (next slide)?




                                                                                                       10
Bixo - Web Mining Toolkit                                         23 Sep 2009




                     Scoring Algorithm

                        Very sophisticated point system
                        “thanks” == 5
                        “owe you a beer” == 50
                        “worship the ground you walk on” == 100




                                                                          11
Bixo - Web Mining Toolkit                                                                 23 Sep 2009




                       High Level Steps

                        Collect emails
                         – Fetch mod_mbox generated page
                         – Parse it to extract links to mbox files
                         – Fetch mbox files
                         – Split into separate emails
                        Parse emails
                         – Extract key headers (messageId, email, etc)
                         – Parse body to identify quoted text




             Parsing the mod_mbox page is simple with Tika’s HtmlParser


             Cheated a bit when parsing emails - some users like Owen have many aliases
             So hand-generated alias resolution table.




                                                                                                  12
Bixo - Web Mining Toolkit                                                             23 Sep 2009




                      High Level Steps

                        Analyze emails
                         – Find key phrases in replies (ignore signoff)
                         – Score emails by phrases
                         – Group & sum by message ID
                         – Group & sum by email address
                        Produce ranked list
                         – Toss email addresses with no love
                         – Sort by summed score




             Need to ignore “thanks” in “thanks in advance for doing my job for me”
             signoff.


             Generate two tuples for each email:
             -one with messageId/name/address
             -One with reply-to messageId/score


             Group/sum aspect is classic reduce operation.




                                                                                              13
Bixo - Web Mining Toolkit                                                                      23 Sep 2009




                                Workflow




             I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom
             Cascading operations, 6 MR jobs.


             OK, actually not so clear, but…
             Key point is that only purple is stuff that I had to actually create
             Some lines are purple as well, since that workflow (DAG) is also something I
             defined - see next page.
             But only two custom operations actually needed - parsing mbox_page and
             calculating score


             Running took about 30 minutes - mostly politely waiting until it was Ok to
             politely do another fetch.
             Downloaded 150MB of mbox files
             409 unique email addresses with at least one positive reply.




                                                                                                       14
Bixo - Web Mining Toolkit                                                                 23 Sep 2009




                      Building the Flow




             Most of the code needed to create the workflow for this data mining app.


             Lots of oatmeal code - which is good. Don’t want to be writing tricky code
             here.


             Could optimize, but that would be a mistake…most web mining is
             programmer-constrained.
             So just use more servers in EC2 - cheaper & faster.




                                                                                                  15
Bixo - Web Mining Toolkit                                                       23 Sep 2009




                       mod_mbox Page




             Example of the top-level pages that were fetched in first phase.


             Then needed to be parsed to extract links to mbox files.




                                                                                        16
Bixo - Web Mining Toolkit                             23 Sep 2009




                     Custom Operation




             Example of one of two custom operation
             Parsing mod_mbox page
             Uses Tika to extract Ids
             Emits tuple with URL for each mbox ID




                                                              17
Bixo - Web Mining Toolkit                                                   23 Sep 2009




                                Validate




             Curve looks right - exponential decay.
             409 unique email addresses that got some love from somebody.




                                                                                    18
Bixo - Web Mining Toolkit                                          23 Sep 2009




                    This Hug’s for Ted!




             And the winner is…Ted Dunning


             I know - I should have colored the elephant yellow.




                                                                           19
Bixo - Web Mining Toolkit                                                             23 Sep 2009




                                 Produce




             A list of the usual suspects

             Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.




                                                                                              20
Bixo - Web Mining Toolkit                                                                  23 Sep 2009




                            Use Bixo to…

                        Find +/- product comments on forums
                        Compare web site quality
                        Track social network popularity
                        Derive optimized SEO terms
                        Scape and analyze pricing data




             Previous example could be easily changed to “find opinion makers on forums”


             Many other use cases


             All involve web mining workflow - fetch, parse, analyze, produce




                                                                                                   21
Bixo - Web Mining Toolkit                                         23 Sep 2009




                               Summary

                        Bixo is a web mining toolkit
                        Built on Hadoop, Cascading, Tika
                        Young project but used commercially
                        Future - Mahout, monitoring, HBase, URL
                        DB, cleanup, bug fixes, rinse, repeat




             Lots to be done, of course, but moving fast




                                                                          22
Bixo - Web Mining Toolkit                                                               23 Sep 2009




                             Resources

                        Web: http://bixo.101tec.com
                        List: http://tech.groups.yahoo.com/group/bixo-dev/
                        Source: http://github.com/emi/bixo/tree
                        Bugs: http://oss.101tec.com/jira/browse/bixo




             URLs to find out more about the Bixo project.


             Stefan Groschupf from 101tec helped with initial Bixo coding.
             His company provides infrastructure for project, thus 101tec.com in URLs
             above




                                                                                                23
Bixo - Web Mining Toolkit                23 Sep 2009




                        Any Questions?




                                                 24

More Related Content

What's hot

Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for HadoopHadoop User Group
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summitdrewz lin
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 

What's hot (20)

Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 

Viewers also liked

Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709Hadoop User Group
 
Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop User Group
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21Hadoop User Group
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21Hadoop User Group
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataYahoo Developer Network
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 

Viewers also liked (20)

Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709
 
Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In Python
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
Hadoop Release Plan Feb17
Hadoop Release Plan Feb17Hadoop Release Plan Feb17
Hadoop Release Plan Feb17
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
 
Searching At Scale
Searching At ScaleSearching At Scale
Searching At Scale
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Mumak
MumakMumak
Mumak
 
File Context
File ContextFile Context
File Context
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
HUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - FacebookHUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - Facebook
 
Cloudera Desktop
Cloudera DesktopCloudera Desktop
Cloudera Desktop
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 

Similar to The Bixo Web Mining Toolkit

Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28korusamol
 
Inaugural address manjusha - Indicthreads cloud computing conference 2011
Inaugural address manjusha -  Indicthreads cloud computing conference 2011Inaugural address manjusha -  Indicthreads cloud computing conference 2011
Inaugural address manjusha - Indicthreads cloud computing conference 2011IndicThreads
 
Cloud Computing for Barcamp NOLA 2009
Cloud Computing for Barcamp NOLA 2009Cloud Computing for Barcamp NOLA 2009
Cloud Computing for Barcamp NOLA 2009Steven Evatt
 
Real World Azure - Dev
Real World Azure - DevReal World Azure - Dev
Real World Azure - DevClint Edmonson
 
Building Cloud Native Applications
Building Cloud Native Applications Building Cloud Native Applications
Building Cloud Native Applications Munish Gupta
 
Drupalcamp New York 2009
Drupalcamp New York 2009Drupalcamp New York 2009
Drupalcamp New York 2009Tom Deryckere
 
Doing More With Less: The Economics of Open Source Database Adoption
Doing More With Less: The Economics of Open Source Database AdoptionDoing More With Less: The Economics of Open Source Database Adoption
Doing More With Less: The Economics of Open Source Database AdoptionEDB
 
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
The Perfect Storm: The Impact of Analytics, Big Data and AnalyticsThe Perfect Storm: The Impact of Analytics, Big Data and Analytics
The Perfect Storm: The Impact of Analytics, Big Data and AnalyticsInside Analysis
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0animove
 
Alternative Database Technology in the Cloud
Alternative Database Technology in the CloudAlternative Database Technology in the Cloud
Alternative Database Technology in the CloudBret Piatt
 
Welcome and Introduction to A Morning with MongoDB Petah Tikvah
Welcome and Introduction to A Morning with MongoDB Petah TikvahWelcome and Introduction to A Morning with MongoDB Petah Tikvah
Welcome and Introduction to A Morning with MongoDB Petah TikvahMongoDB
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Morningwithmongodbisrael 121217184113-phpapp02
Morningwithmongodbisrael 121217184113-phpapp02Morningwithmongodbisrael 121217184113-phpapp02
Morningwithmongodbisrael 121217184113-phpapp02Andrei Colta
 
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...Taras Filatov
 
10 things ever architect should know about the Windows Azure Platform - ericnel
10 things ever architect should know about the Windows Azure Platform -  ericnel10 things ever architect should know about the Windows Azure Platform -  ericnel
10 things ever architect should know about the Windows Azure Platform - ericnelEric Nelson
 

Similar to The Bixo Web Mining Toolkit (20)

Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Inaugural address manjusha - Indicthreads cloud computing conference 2011
Inaugural address manjusha -  Indicthreads cloud computing conference 2011Inaugural address manjusha -  Indicthreads cloud computing conference 2011
Inaugural address manjusha - Indicthreads cloud computing conference 2011
 
Cloud Computing for Barcamp NOLA 2009
Cloud Computing for Barcamp NOLA 2009Cloud Computing for Barcamp NOLA 2009
Cloud Computing for Barcamp NOLA 2009
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Real World Azure - Dev
Real World Azure - DevReal World Azure - Dev
Real World Azure - Dev
 
Building Cloud Native Applications
Building Cloud Native Applications Building Cloud Native Applications
Building Cloud Native Applications
 
Drupalcamp New York 2009
Drupalcamp New York 2009Drupalcamp New York 2009
Drupalcamp New York 2009
 
Doing More With Less: The Economics of Open Source Database Adoption
Doing More With Less: The Economics of Open Source Database AdoptionDoing More With Less: The Economics of Open Source Database Adoption
Doing More With Less: The Economics of Open Source Database Adoption
 
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
The Perfect Storm: The Impact of Analytics, Big Data and AnalyticsThe Perfect Storm: The Impact of Analytics, Big Data and Analytics
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
 
Alternative Database Technology in the Cloud
Alternative Database Technology in the CloudAlternative Database Technology in the Cloud
Alternative Database Technology in the Cloud
 
Cloud based Web Intelligence
Cloud based Web IntelligenceCloud based Web Intelligence
Cloud based Web Intelligence
 
Welcome and Introduction to A Morning with MongoDB Petah Tikvah
Welcome and Introduction to A Morning with MongoDB Petah TikvahWelcome and Introduction to A Morning with MongoDB Petah Tikvah
Welcome and Introduction to A Morning with MongoDB Petah Tikvah
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Morningwithmongodbisrael 121217184113-phpapp02
Morningwithmongodbisrael 121217184113-phpapp02Morningwithmongodbisrael 121217184113-phpapp02
Morningwithmongodbisrael 121217184113-phpapp02
 
The 8 Don'ts of WCM
The 8 Don'ts of WCMThe 8 Don'ts of WCM
The 8 Don'ts of WCM
 
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...
Mobile Backend Apps and APIs meetup London overview of BaaS APIs and discussi...
 
10 things ever architect should know about the Windows Azure Platform - ericnel
10 things ever architect should know about the Windows Azure Platform -  ericnel10 things ever architect should know about the Windows Azure Platform -  ericnel
10 things ever architect should know about the Windows Azure Platform - ericnel
 

More from Tom Croucher

Using Node.js to Build Great Streaming Services - HTML5 Dev Conf
Using Node.js to  Build Great  Streaming Services - HTML5 Dev ConfUsing Node.js to  Build Great  Streaming Services - HTML5 Dev Conf
Using Node.js to Build Great Streaming Services - HTML5 Dev ConfTom Croucher
 
Streams are Awesome - (Node.js) TimesOpen Sep 2012
Streams are Awesome - (Node.js) TimesOpen Sep 2012 Streams are Awesome - (Node.js) TimesOpen Sep 2012
Streams are Awesome - (Node.js) TimesOpen Sep 2012 Tom Croucher
 
Using Node.js to improve the performance of Mobile apps and Mobile web
Using Node.js to improve  the performance of  Mobile apps and Mobile webUsing Node.js to improve  the performance of  Mobile apps and Mobile web
Using Node.js to improve the performance of Mobile apps and Mobile webTom Croucher
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applicationsTom Croucher
 
Creating the Internet of Things with JavaScript - Fluent Conf
Creating the Internet of Things with JavaScript - Fluent ConfCreating the Internet of Things with JavaScript - Fluent Conf
Creating the Internet of Things with JavaScript - Fluent ConfTom Croucher
 
Using Node.js to make HTML5 work for everyone
Using Node.js to make HTML5 work for everyone Using Node.js to make HTML5 work for everyone
Using Node.js to make HTML5 work for everyone Tom Croucher
 
A million connections and beyond - Node.js at scale
A million connections and beyond - Node.js at scaleA million connections and beyond - Node.js at scale
A million connections and beyond - Node.js at scaleTom Croucher
 
OSCON 2011 - Node.js Tutorial
OSCON 2011 - Node.js TutorialOSCON 2011 - Node.js Tutorial
OSCON 2011 - Node.js TutorialTom Croucher
 
Lessons from a coding veteran - Web Directions @Media
Lessons from a coding veteran - Web Directions @MediaLessons from a coding veteran - Web Directions @Media
Lessons from a coding veteran - Web Directions @MediaTom Croucher
 
Multi-tiered Node Architectures - JSConf 2011
Multi-tiered Node Architectures - JSConf 2011Multi-tiered Node Architectures - JSConf 2011
Multi-tiered Node Architectures - JSConf 2011Tom Croucher
 
A language for the Internet: Why JavaScript and Node.js is right for Internet...
A language for the Internet: Why JavaScript and Node.js is right for Internet...A language for the Internet: Why JavaScript and Node.js is right for Internet...
A language for the Internet: Why JavaScript and Node.js is right for Internet...Tom Croucher
 
A language for the Internet: Why JavaScript and Node.js is right for Internet...
A language for the Internet: Why JavaScript and Node.js is right for Internet...A language for the Internet: Why JavaScript and Node.js is right for Internet...
A language for the Internet: Why JavaScript and Node.js is right for Internet...Tom Croucher
 
How to stop writing spaghetti code
How to stop writing spaghetti codeHow to stop writing spaghetti code
How to stop writing spaghetti codeTom Croucher
 
Doing Horrible Things with DNS - Web Directions South
Doing Horrible Things with DNS - Web Directions SouthDoing Horrible Things with DNS - Web Directions South
Doing Horrible Things with DNS - Web Directions SouthTom Croucher
 
Doing Horrible Things to DNS in the Name of Science - SF Performance Meetup
Doing Horrible Things to DNS in the Name of Science - SF Performance MeetupDoing Horrible Things to DNS in the Name of Science - SF Performance Meetup
Doing Horrible Things to DNS in the Name of Science - SF Performance MeetupTom Croucher
 
JavaScript is the new black - Why Node.js is going to rock your world - Web 2...
JavaScript is the new black - Why Node.js is going to rock your world - Web 2...JavaScript is the new black - Why Node.js is going to rock your world - Web 2...
JavaScript is the new black - Why Node.js is going to rock your world - Web 2...Tom Croucher
 
How to stop writing spaghetti code - JSConf.eu 2010
How to stop writing spaghetti code - JSConf.eu 2010How to stop writing spaghetti code - JSConf.eu 2010
How to stop writing spaghetti code - JSConf.eu 2010Tom Croucher
 
Node.js and How JavaScript is Changing Server Programming
Node.js and How JavaScript is Changing Server Programming  Node.js and How JavaScript is Changing Server Programming
Node.js and How JavaScript is Changing Server Programming Tom Croucher
 
Server Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yetServer Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yetTom Croucher
 

More from Tom Croucher (20)

Using Node.js to Build Great Streaming Services - HTML5 Dev Conf
Using Node.js to  Build Great  Streaming Services - HTML5 Dev ConfUsing Node.js to  Build Great  Streaming Services - HTML5 Dev Conf
Using Node.js to Build Great Streaming Services - HTML5 Dev Conf
 
Streams are Awesome - (Node.js) TimesOpen Sep 2012
Streams are Awesome - (Node.js) TimesOpen Sep 2012 Streams are Awesome - (Node.js) TimesOpen Sep 2012
Streams are Awesome - (Node.js) TimesOpen Sep 2012
 
Using Node.js to improve the performance of Mobile apps and Mobile web
Using Node.js to improve  the performance of  Mobile apps and Mobile webUsing Node.js to improve  the performance of  Mobile apps and Mobile web
Using Node.js to improve the performance of Mobile apps and Mobile web
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
 
Creating the Internet of Things with JavaScript - Fluent Conf
Creating the Internet of Things with JavaScript - Fluent ConfCreating the Internet of Things with JavaScript - Fluent Conf
Creating the Internet of Things with JavaScript - Fluent Conf
 
Using Node.js to make HTML5 work for everyone
Using Node.js to make HTML5 work for everyone Using Node.js to make HTML5 work for everyone
Using Node.js to make HTML5 work for everyone
 
A million connections and beyond - Node.js at scale
A million connections and beyond - Node.js at scaleA million connections and beyond - Node.js at scale
A million connections and beyond - Node.js at scale
 
OSCON 2011 - Node.js Tutorial
OSCON 2011 - Node.js TutorialOSCON 2011 - Node.js Tutorial
OSCON 2011 - Node.js Tutorial
 
Lessons from a coding veteran - Web Directions @Media
Lessons from a coding veteran - Web Directions @MediaLessons from a coding veteran - Web Directions @Media
Lessons from a coding veteran - Web Directions @Media
 
Multi-tiered Node Architectures - JSConf 2011
Multi-tiered Node Architectures - JSConf 2011Multi-tiered Node Architectures - JSConf 2011
Multi-tiered Node Architectures - JSConf 2011
 
A language for the Internet: Why JavaScript and Node.js is right for Internet...
A language for the Internet: Why JavaScript and Node.js is right for Internet...A language for the Internet: Why JavaScript and Node.js is right for Internet...
A language for the Internet: Why JavaScript and Node.js is right for Internet...
 
A language for the Internet: Why JavaScript and Node.js is right for Internet...
A language for the Internet: Why JavaScript and Node.js is right for Internet...A language for the Internet: Why JavaScript and Node.js is right for Internet...
A language for the Internet: Why JavaScript and Node.js is right for Internet...
 
How to stop writing spaghetti code
How to stop writing spaghetti codeHow to stop writing spaghetti code
How to stop writing spaghetti code
 
Doing Horrible Things with DNS - Web Directions South
Doing Horrible Things with DNS - Web Directions SouthDoing Horrible Things with DNS - Web Directions South
Doing Horrible Things with DNS - Web Directions South
 
Doing Horrible Things to DNS in the Name of Science - SF Performance Meetup
Doing Horrible Things to DNS in the Name of Science - SF Performance MeetupDoing Horrible Things to DNS in the Name of Science - SF Performance Meetup
Doing Horrible Things to DNS in the Name of Science - SF Performance Meetup
 
JavaScript is the new black - Why Node.js is going to rock your world - Web 2...
JavaScript is the new black - Why Node.js is going to rock your world - Web 2...JavaScript is the new black - Why Node.js is going to rock your world - Web 2...
JavaScript is the new black - Why Node.js is going to rock your world - Web 2...
 
How to stop writing spaghetti code - JSConf.eu 2010
How to stop writing spaghetti code - JSConf.eu 2010How to stop writing spaghetti code - JSConf.eu 2010
How to stop writing spaghetti code - JSConf.eu 2010
 
Sf perf
Sf perfSf perf
Sf perf
 
Node.js and How JavaScript is Changing Server Programming
Node.js and How JavaScript is Changing Server Programming  Node.js and How JavaScript is Changing Server Programming
Node.js and How JavaScript is Changing Server Programming
 
Server Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yetServer Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yet
 

Recently uploaded

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 

Recently uploaded (20)

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 

The Bixo Web Mining Toolkit

  • 1. Bixo - Web Mining Toolkit 23 Sep 2009 Web Mining Toolkit Ken Krugler TransPac Software, Inc. My background - did a startup called Krugle from 2005 - 2008 Used Nutch to do a vertical crawl of the web, looking for technical software pages. Mined pages for references to open source projects. Used experience to create Bixo, an open source web mining toolkit Built on top of Hadoop, Cascading, Tika. 1
  • 2. Bixo - Web Mining Toolkit 23 Sep 2009 Web Mining 101 Extracting & Processing Web Data More Than Just Search Business intelligence, competitive intelligence, events, people, companies, popularity, pricing, social graphs, Twitter feeds, Facebook friends, support forums, shopping carts… Quick intro to web mining, so we’re on the same page Most people think about the big search companies when they think about web mining. Search is clearly the biggest web mining category, and generates the most revenue. But other types of web mining have value that is high and growing. This is what Bixo focuses on. 2
  • 3. Bixo - Web Mining Toolkit 23 Sep 2009 4 Steps in Mining Collect - fetch content from web Parse - extract data from formats Analyze - tokenize, rate, classify, cluster Produce - an index, a report Search Note - does not include serving up the search results Why do I bring this up? To help clarify why web mining is not the same as vertical search (next slide) 3
  • 4. Bixo - Web Mining Toolkit 23 Sep 2009 Vertical Search Vertical crawl to get specific content Common use case for Nutch, Heritrix But web mining often has different outcome And specialized processing of data Most people think of vertical search when they think of specialized web mining. Lots of people have been doing this, using OSS like Nutch & Heritrix. End result is typically a Lucene index, plus the content, inverted links, etc. Typical web mining is not the same as vertical search. Often uses a white list, versus crawling to discover links. More specialized processing of the data. And these differences help answer the question of (next slide)… 4
  • 5. Bixo - Web Mining Toolkit 23 Sep 2009 Why Bixo? Response to needs of commercial projects – Plug into Cascading-based workflow – Low IT time/skill requirements – Run well in AWS EC2 environment – Flexible I/O support for AWS - S3, HBase – Toolkit for building custom solutions • Fetch white list (parse/index, data mine) • Scrape white list (social popularity) Does the world really need yet another web crawler? No, but it does need a web mining toolkit Two companies agreed to sponsor work on Bixo as an open source project. On the point of running well in an EC2 environment… Even though there are many web mining tasks that can be handled on a single computer, You very quickly run into issues of scale if you can’t handle upwards of 100M+ pages. 5
  • 6. Bixo - Web Mining Toolkit 23 Sep 2009 Bixo Overview MIT license open source project In use by three companies “Pipe” model for building workflows Runs on top of Hadoop/Cascading Full disclosure - Bixo makes heavy use of Cascading, which is under GPL. So if you want to sell a product based on Bixo, you need to talk to Chris Wensel. The pipe model comes from our use of Cascading to define the workflows. 6
  • 7. Bixo - Web Mining Toolkit 23 Sep 2009 What is Cascading API for Hadoop data processing workflows Operations on tuples with named fields Workflows created from pipes Reduces painful low-level MR details Key for complex/reliable workflows I know Chris Wensel has previously talked about Cascading here, but just to make sure we’re all on the same page… “tuple” is like a row in a database. Named fields with values. Example of tuple - result of fetching a page, has URL, time of fetch, content, headers, response rate, etc. Because you can build workflows out of a mix of pre-defined & custom pipes, it’s a real toolkit. Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels more like C++ :) Key aspect of reliable workflows is Cascading’s ability to check your workflow (the DAG it builds) Finds cases where fields aren’t available for operations. Solves a key problem we ran into when customizing Nutch at Krugle 7
  • 8. Bixo - Web Mining Toolkit 23 Sep 2009 Architecture This architecture looks nice and squeaky clean - and in general it is. One issue is with the fetch phase of bixo not fitting well into the MR model. External resource constraints mean you can’t treat it like a regular job. So lots of threads in a special reduce phase, with corresponding issues -Stack size -Error handling 8
  • 9. Bixo - Web Mining Toolkit 23 Sep 2009 HUGMEE Hadoop Users who Generate the Most Effective Emails Let’s use a real example now of using Bixo to do web mining. Imagine that the Apache Foundation decided to honor people who make significant contributions to the Hadoop community. In a typical company, determining the winner would depend on political maneuvering, bribes,and sucking up. But the Apache Foundation could decides to go for a quantitative approach for the HUGMEE award. 9
  • 10. Bixo - Web Mining Toolkit 23 Sep 2009 Helpful Hadoopers Use mailing list archives for data (collect) Parse mbox files and emails (parse) Score based on key phrases (analyze) End result is score/name pair (produce) How do you figure out the most helpful Hadoopers? As we discussed previously, it’s a classic web mining problem Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files. How do we score based on key phrases (next slide)? 10
  • 11. Bixo - Web Mining Toolkit 23 Sep 2009 Scoring Algorithm Very sophisticated point system “thanks” == 5 “owe you a beer” == 50 “worship the ground you walk on” == 100 11
  • 12. Bixo - Web Mining Toolkit 23 Sep 2009 High Level Steps Collect emails – Fetch mod_mbox generated page – Parse it to extract links to mbox files – Fetch mbox files – Split into separate emails Parse emails – Extract key headers (messageId, email, etc) – Parse body to identify quoted text Parsing the mod_mbox page is simple with Tika’s HtmlParser Cheated a bit when parsing emails - some users like Owen have many aliases So hand-generated alias resolution table. 12
  • 13. Bixo - Web Mining Toolkit 23 Sep 2009 High Level Steps Analyze emails – Find key phrases in replies (ignore signoff) – Score emails by phrases – Group & sum by message ID – Group & sum by email address Produce ranked list – Toss email addresses with no love – Sort by summed score Need to ignore “thanks” in “thanks in advance for doing my job for me” signoff. Generate two tuples for each email: -one with messageId/name/address -One with reply-to messageId/score Group/sum aspect is classic reduce operation. 13
  • 14. Bixo - Web Mining Toolkit 23 Sep 2009 Workflow I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom Cascading operations, 6 MR jobs. OK, actually not so clear, but… Key point is that only purple is stuff that I had to actually create Some lines are purple as well, since that workflow (DAG) is also something I defined - see next page. But only two custom operations actually needed - parsing mbox_page and calculating score Running took about 30 minutes - mostly politely waiting until it was Ok to politely do another fetch. Downloaded 150MB of mbox files 409 unique email addresses with at least one positive reply. 14
  • 15. Bixo - Web Mining Toolkit 23 Sep 2009 Building the Flow Most of the code needed to create the workflow for this data mining app. Lots of oatmeal code - which is good. Don’t want to be writing tricky code here. Could optimize, but that would be a mistake…most web mining is programmer-constrained. So just use more servers in EC2 - cheaper & faster. 15
  • 16. Bixo - Web Mining Toolkit 23 Sep 2009 mod_mbox Page Example of the top-level pages that were fetched in first phase. Then needed to be parsed to extract links to mbox files. 16
  • 17. Bixo - Web Mining Toolkit 23 Sep 2009 Custom Operation Example of one of two custom operation Parsing mod_mbox page Uses Tika to extract Ids Emits tuple with URL for each mbox ID 17
  • 18. Bixo - Web Mining Toolkit 23 Sep 2009 Validate Curve looks right - exponential decay. 409 unique email addresses that got some love from somebody. 18
  • 19. Bixo - Web Mining Toolkit 23 Sep 2009 This Hug’s for Ted! And the winner is…Ted Dunning I know - I should have colored the elephant yellow. 19
  • 20. Bixo - Web Mining Toolkit 23 Sep 2009 Produce A list of the usual suspects Coincidentally, Ted helped me derive the scoring algorithm I used…hmm. 20
  • 21. Bixo - Web Mining Toolkit 23 Sep 2009 Use Bixo to… Find +/- product comments on forums Compare web site quality Track social network popularity Derive optimized SEO terms Scape and analyze pricing data Previous example could be easily changed to “find opinion makers on forums” Many other use cases All involve web mining workflow - fetch, parse, analyze, produce 21
  • 22. Bixo - Web Mining Toolkit 23 Sep 2009 Summary Bixo is a web mining toolkit Built on Hadoop, Cascading, Tika Young project but used commercially Future - Mahout, monitoring, HBase, URL DB, cleanup, bug fixes, rinse, repeat Lots to be done, of course, but moving fast 22
  • 23. Bixo - Web Mining Toolkit 23 Sep 2009 Resources Web: http://bixo.101tec.com List: http://tech.groups.yahoo.com/group/bixo-dev/ Source: http://github.com/emi/bixo/tree Bugs: http://oss.101tec.com/jira/browse/bixo URLs to find out more about the Bixo project. Stefan Groschupf from 101tec helped with initial Bixo coding. His company provides infrastructure for project, thus 101tec.com in URLs above 23
  • 24. Bixo - Web Mining Toolkit 23 Sep 2009 Any Questions? 24