SlideShare a Scribd company logo
1 of 24
Introduction to Data Mining forWeb Applications Paul-Alexandru Chirita, Ph.D.
About Me Education: Ph.D., Information Retrieval & Data Mining, Univ. of Hannover, Germany B.Sc., Ecole Polytechnique, Paris, France + “Politehnica” Univ. Bucharest, CS Dept. Roughly 8 yrs. in IT, out of which 7 in IR & DM Now in Adobe Romania (L3S, Yahoo!, Schlumberger and others in the past)
Web Mining The application of Data Mining algorithms to discover patterns in the Web. Three dimensions: Usage Mining Analyzes various access logs in order to provide input to Business Decisions By far the most used, with the highest ROI Content Mining Analyzes Web page content in order to extract useful information (e.g., keywords, topic, content type, sentiment, etc.) Structure Mining Also known as “Link Analysis” Investigates the hyperlink structure of the Web to improve current algorithms
Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
Client side tools Purpose: Return basic information about traffic on your Web Site, SEO Most of them are also (partly) integrated with Monetization Tools (e.g., AdWords) Pros: Hosted by third party sites, zero or minimal cost for you Easy to implement and integrate, no maintenance Cons: The client side tracking code will eat some of your bandwidth (~200-600 ms. additional response time) If your traffic increases “too much” you have to pay
Client-side tools: Google Analytics Free, and well-engineered! Shows statistics about: Basic stuff: Visits, Pages, etc. Visitor profiles: Browser, OS, Language/Locale Visitor loyalty: How many times did each visitor return to your site, When was the last time they did it, For how long Trends: Is your traffic & popularity growing or decreasing Traffic sources: Entry/Exit pages, Referring sites & search engines Some customization planned for the near-term future Good for personal or small scale sites https://www.google.com/analytics
Client-side tools: Google Analytics [2]
Omniture: Site Catalyst Low price per thousand of entries, but may become costly if you have a lot of traffic (millions of visits per day) or if you have many dozens of sensors Same statistics as Google Analytics, but you can drill down very deep: Statistics per hour of day, per file type (html, cfm, etc.), per action type (download, view page, etc.) Visitor segmentation down to the level of city Purchases, Promotions, and Many metrics for e-commerce (e.g., how many products added to the cart have actually been checked out) Most importantly, you can define ANY metric you want! (e.g., how many people click on my survey link, how many of them fill it in, etc.) www.omniture.com
Omniture: Site Catalyst [2]
Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
Server side tools Purpose: Return basic information about traffic on your Web Site Similar to the client-side tools, but currently more focused on Reliability & Application Improvements Pros: Most importantly, zero bandwidth overhead for your app (Every ms counts!) Show a lot of developer specific information (errors, visitor browsers/OS, etc.) Very easy to install Cons: Usually open source, but hard to extend with your own metrics
FREE Server side tools Similar statistics as with the Client Side tools, but… Less business specific information (do not include Visitor Loyalty, Trends, etc.) More developer specific data (errors & error types, HTTP status codes, etc.) Good for medium and large scale sites http://awstats.sourceforge.net/ http://www.stedee.id.au/awffull/
Server side tools: AW Stats
Server side tools: Webalizer / AWF-Full
Paid Server side tools Overcome most limitations of the free tools Log everything into text files (see next Section) Provide some sort of SQL-like query language which helps you define any type of query you want Run reports much faster The most expensive of them all, meant for professional use http://www.splunk.com/
Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
How is this done in the heavy weight category ;-) Multiple log files, one per each functionality checked As simple as possible (see next slide for an example) The main guideline is to be able to parse any log file and generate statistics using only the command line Example: Tab separated
Sample log Date & Time		IP (hashed)	User ID  (hashed)	Query		Parameters Sep 28 06:49:42		Ea9hjnc4ufTfU	anonymous	spell checker	:0:10:en_US:en_US:0:0 Sep 28 06:49:42		8NCTsHqR366	anonymous	javascript		:0:10:fr_FR:fr_FR:0:1 Sep 28 06:49:42		K4nD5xy/R5fw	anonymous	text	:0:10:en_US:en_US:0:1 Sep 28 06:49:43		lRqBaIaUWxna	yxDkhBEqC6xxR8z=	module	:0:10:en_US:en_US:0:0 Sep 28 06:49:44 	jMjJpy6bHAdb	hPFLKaMNeShD0=	delete spread	:0:10:en_US:en_US:0:0 Sep 28 06:49:44		r3xgRLagX1cQ6	anonymous	_x	:0:10:ru_RU:ru_RU:0:0 Sep 28 06:49:45		b2DLBl3VTT67Q	anonymous	anti a	:0:10:de_DE:de_DE:0:0 Sep 28 06:49:45		KaKiB2ITEdPeM	VcLic9CIy4QxVtJQ=	create a star	:0:10:en_US:en_US:0:0
What can be done using this data You can basically measure everything ;-) Plus you can enable loads of new features: Personalization for search, sold/promoted products, etc. Browsing recommendations Improve site organization (make popular pages more accessible, promote some other pages and track their traffic increase, etc.) Search suggestions Advertising (keyword selection, etc.)
Personalized search and promotions Show different results/ads to different users
Browsing recommendations
Search suggestions
How To Web - Introduction To Data Mining For Web Applications

More Related Content

Viewers also liked

Izobrazevanje za data-mining
Izobrazevanje za data-miningIzobrazevanje za data-mining
Izobrazevanje za data-miningbutest
 
Educational Data Mining in relation to education statistics of Nepal
Educational Data Mining in relation to education statistics of NepalEducational Data Mining in relation to education statistics of Nepal
Educational Data Mining in relation to education statistics of NepalRaj Subit
 
Data Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational ResearchData Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational ResearchQiang Hao
 
Educational Data Mining/Learning Analytics issue brief overview
Educational Data Mining/Learning Analytics issue brief overviewEducational Data Mining/Learning Analytics issue brief overview
Educational Data Mining/Learning Analytics issue brief overviewMarie Bienkowski
 
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
Learning Analytics in Education:  Using Student’s Big Data to Improve TeachingLearning Analytics in Education:  Using Student’s Big Data to Improve Teaching
Learning Analytics in Education: Using Student’s Big Data to Improve TeachingRafael Scapin, Ph.D.
 
Data Mining in Healthcare: How Health Systems Can Improve Quality and Reduce...
Data Mining in Healthcare:  How Health Systems Can Improve Quality and Reduce...Data Mining in Healthcare:  How Health Systems Can Improve Quality and Reduce...
Data Mining in Healthcare: How Health Systems Can Improve Quality and Reduce...Health Catalyst
 

Viewers also liked (6)

Izobrazevanje za data-mining
Izobrazevanje za data-miningIzobrazevanje za data-mining
Izobrazevanje za data-mining
 
Educational Data Mining in relation to education statistics of Nepal
Educational Data Mining in relation to education statistics of NepalEducational Data Mining in relation to education statistics of Nepal
Educational Data Mining in relation to education statistics of Nepal
 
Data Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational ResearchData Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational Research
 
Educational Data Mining/Learning Analytics issue brief overview
Educational Data Mining/Learning Analytics issue brief overviewEducational Data Mining/Learning Analytics issue brief overview
Educational Data Mining/Learning Analytics issue brief overview
 
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
Learning Analytics in Education:  Using Student’s Big Data to Improve TeachingLearning Analytics in Education:  Using Student’s Big Data to Improve Teaching
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
 
Data Mining in Healthcare: How Health Systems Can Improve Quality and Reduce...
Data Mining in Healthcare:  How Health Systems Can Improve Quality and Reduce...Data Mining in Healthcare:  How Health Systems Can Improve Quality and Reduce...
Data Mining in Healthcare: How Health Systems Can Improve Quality and Reduce...
 

Similar to How To Web - Introduction To Data Mining For Web Applications

Basis Omniture
Basis OmnitureBasis Omniture
Basis Omnituresmishra
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingMichelle Minkoff
 
Internet of Things Chicago - Meetup
Internet of Things Chicago - MeetupInternet of Things Chicago - Meetup
Internet of Things Chicago - MeetupJason Lobel
 
Web analyticsandbigdata techweek2011
Web analyticsandbigdata techweek2011Web analyticsandbigdata techweek2011
Web analyticsandbigdata techweek2011Raghu Kashyap
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
 
Google Business Tools
Google Business ToolsGoogle Business Tools
Google Business Toolsredcomin
 
Data Driven Design: Using Web Analytics to Improve Information Architectures
Data Driven Design: Using Web Analytics to Improve Information ArchitecturesData Driven Design: Using Web Analytics to Improve Information Architectures
Data Driven Design: Using Web Analytics to Improve Information ArchitecturesAndrea Wiggins
 
Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Sematext Group, Inc.
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALsathish sak
 
PPT 3 Web Analytics (1).pptx
PPT 3 Web Analytics (1).pptxPPT 3 Web Analytics (1).pptx
PPT 3 Web Analytics (1).pptxDevChaudhari15
 
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...hannonhill
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
BAQMaR - Conference DM
BAQMaR - Conference DMBAQMaR - Conference DM
BAQMaR - Conference DMBAQMaR
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw
 
Google analytics and google data studio
Google analytics and google data studioGoogle analytics and google data studio
Google analytics and google data studioBrian Pichman
 
Cmg10 Web Analytics Pres Am Long
Cmg10 Web Analytics Pres   Am LongCmg10 Web Analytics Pres   Am Long
Cmg10 Web Analytics Pres Am LongAnna Long
 

Similar to How To Web - Introduction To Data Mining For Web Applications (20)

Basis Omniture
Basis OmnitureBasis Omniture
Basis Omniture
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 
Web scrapingpanel
Web scrapingpanelWeb scrapingpanel
Web scrapingpanel
 
Internet of Things Chicago - Meetup
Internet of Things Chicago - MeetupInternet of Things Chicago - Meetup
Internet of Things Chicago - Meetup
 
Web analyticsandbigdata techweek2011
Web analyticsandbigdata techweek2011Web analyticsandbigdata techweek2011
Web analyticsandbigdata techweek2011
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 
Web analytics
Web analyticsWeb analytics
Web analytics
 
Google Business Tools
Google Business ToolsGoogle Business Tools
Google Business Tools
 
Data Driven Design: Using Web Analytics to Improve Information Architectures
Data Driven Design: Using Web Analytics to Improve Information ArchitecturesData Driven Design: Using Web Analytics to Improve Information Architectures
Data Driven Design: Using Web Analytics to Improve Information Architectures
 
Google’s tridente
Google’s tridenteGoogle’s tridente
Google’s tridente
 
Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVAL
 
PPT 3 Web Analytics (1).pptx
PPT 3 Web Analytics (1).pptxPPT 3 Web Analytics (1).pptx
PPT 3 Web Analytics (1).pptx
 
Web Analytics Basics
Web Analytics BasicsWeb Analytics Basics
Web Analytics Basics
 
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
BAQMaR - Conference DM
BAQMaR - Conference DMBAQMaR - Conference DM
BAQMaR - Conference DM
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
 
Google analytics and google data studio
Google analytics and google data studioGoogle analytics and google data studio
Google analytics and google data studio
 
Cmg10 Web Analytics Pres Am Long
Cmg10 Web Analytics Pres   Am LongCmg10 Web Analytics Pres   Am Long
Cmg10 Web Analytics Pres Am Long
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 

How To Web - Introduction To Data Mining For Web Applications

  • 1. Introduction to Data Mining forWeb Applications Paul-Alexandru Chirita, Ph.D.
  • 2. About Me Education: Ph.D., Information Retrieval & Data Mining, Univ. of Hannover, Germany B.Sc., Ecole Polytechnique, Paris, France + “Politehnica” Univ. Bucharest, CS Dept. Roughly 8 yrs. in IT, out of which 7 in IR & DM Now in Adobe Romania (L3S, Yahoo!, Schlumberger and others in the past)
  • 3. Web Mining The application of Data Mining algorithms to discover patterns in the Web. Three dimensions: Usage Mining Analyzes various access logs in order to provide input to Business Decisions By far the most used, with the highest ROI Content Mining Analyzes Web page content in order to extract useful information (e.g., keywords, topic, content type, sentiment, etc.) Structure Mining Also known as “Link Analysis” Investigates the hyperlink structure of the Web to improve current algorithms
  • 4. Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
  • 5. Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
  • 6. Client side tools Purpose: Return basic information about traffic on your Web Site, SEO Most of them are also (partly) integrated with Monetization Tools (e.g., AdWords) Pros: Hosted by third party sites, zero or minimal cost for you Easy to implement and integrate, no maintenance Cons: The client side tracking code will eat some of your bandwidth (~200-600 ms. additional response time) If your traffic increases “too much” you have to pay
  • 7. Client-side tools: Google Analytics Free, and well-engineered! Shows statistics about: Basic stuff: Visits, Pages, etc. Visitor profiles: Browser, OS, Language/Locale Visitor loyalty: How many times did each visitor return to your site, When was the last time they did it, For how long Trends: Is your traffic & popularity growing or decreasing Traffic sources: Entry/Exit pages, Referring sites & search engines Some customization planned for the near-term future Good for personal or small scale sites https://www.google.com/analytics
  • 9. Omniture: Site Catalyst Low price per thousand of entries, but may become costly if you have a lot of traffic (millions of visits per day) or if you have many dozens of sensors Same statistics as Google Analytics, but you can drill down very deep: Statistics per hour of day, per file type (html, cfm, etc.), per action type (download, view page, etc.) Visitor segmentation down to the level of city Purchases, Promotions, and Many metrics for e-commerce (e.g., how many products added to the cart have actually been checked out) Most importantly, you can define ANY metric you want! (e.g., how many people click on my survey link, how many of them fill it in, etc.) www.omniture.com
  • 11. Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
  • 12. Server side tools Purpose: Return basic information about traffic on your Web Site Similar to the client-side tools, but currently more focused on Reliability & Application Improvements Pros: Most importantly, zero bandwidth overhead for your app (Every ms counts!) Show a lot of developer specific information (errors, visitor browsers/OS, etc.) Very easy to install Cons: Usually open source, but hard to extend with your own metrics
  • 13. FREE Server side tools Similar statistics as with the Client Side tools, but… Less business specific information (do not include Visitor Loyalty, Trends, etc.) More developer specific data (errors & error types, HTTP status codes, etc.) Good for medium and large scale sites http://awstats.sourceforge.net/ http://www.stedee.id.au/awffull/
  • 14. Server side tools: AW Stats
  • 15. Server side tools: Webalizer / AWF-Full
  • 16. Paid Server side tools Overcome most limitations of the free tools Log everything into text files (see next Section) Provide some sort of SQL-like query language which helps you define any type of query you want Run reports much faster The most expensive of them all, meant for professional use http://www.splunk.com/
  • 17. Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
  • 18. How is this done in the heavy weight category ;-) Multiple log files, one per each functionality checked As simple as possible (see next slide for an example) The main guideline is to be able to parse any log file and generate statistics using only the command line Example: Tab separated
  • 19. Sample log Date & Time IP (hashed) User ID (hashed) Query Parameters Sep 28 06:49:42 Ea9hjnc4ufTfU anonymous spell checker :0:10:en_US:en_US:0:0 Sep 28 06:49:42 8NCTsHqR366 anonymous javascript :0:10:fr_FR:fr_FR:0:1 Sep 28 06:49:42 K4nD5xy/R5fw anonymous text :0:10:en_US:en_US:0:1 Sep 28 06:49:43 lRqBaIaUWxna yxDkhBEqC6xxR8z= module :0:10:en_US:en_US:0:0 Sep 28 06:49:44 jMjJpy6bHAdb hPFLKaMNeShD0= delete spread :0:10:en_US:en_US:0:0 Sep 28 06:49:44 r3xgRLagX1cQ6 anonymous _x :0:10:ru_RU:ru_RU:0:0 Sep 28 06:49:45 b2DLBl3VTT67Q anonymous anti a :0:10:de_DE:de_DE:0:0 Sep 28 06:49:45 KaKiB2ITEdPeM VcLic9CIy4QxVtJQ= create a star :0:10:en_US:en_US:0:0
  • 20. What can be done using this data You can basically measure everything ;-) Plus you can enable loads of new features: Personalization for search, sold/promoted products, etc. Browsing recommendations Improve site organization (make popular pages more accessible, promote some other pages and track their traffic increase, etc.) Search suggestions Advertising (keyword selection, etc.)
  • 21. Personalized search and promotions Show different results/ads to different users

Editor's Notes

  1. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  2. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  3. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  4. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  5. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  6. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  7. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  8. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  9. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  10. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  11. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  12. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  13. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  14. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  15. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  16. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  17. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  18. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  19. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  20. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  21. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului