• Like
Web Mining and IA_MS..
Upcoming SlideShare
Loading in...5
×

Web Mining and IA_MS..

  • 868 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
868
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
48
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Abstract: With the rapid growth of the Internet, web mining has emerged as a new area of research involving web content mining (WCM), web structure mining (WSM), and web usage mining (WUM). WCM refers to the discovery of useful information from the web, including text, images, audio, video, etc. WSM focuses on the web’s hyperlink structure and topology.  WUM is about analyzing the logs to find interesting patterns. People have found out some very interesting features of the Internet through web mining. For example, the Internet is a scale-free network (SFN), meaning that the vertex connectivities follow a scale-free power-law distribution, instead of a random network. SFN is governed by robust self-organizing phenomena that go beyond the individual systems. Also, people have found out that the Internet is self-similar. The better understanding of the Internet could help people to deign more robust artificial networks that could ultimately provide information assurance. The state-of-the-art findings of web mining and techniques will be reported and the future directions of web mining will be discussed.
  • What do you do online?

Transcript

  • 1. There are only 10 types of people in the world: Those who understand binary, and those who don't.
  • 2. Web Mining and Information Assurance Dr. Xueping Li Dept. of Industrial & Information Engineering University of Tennessee
  • 3. Outline
    • Introduction to Web Mining
    • Web content mining
    • Web usage mining
    • Web structure mining
    • Complex Networks
  • 4. What Is Data Mining?
    • Data mining (knowledge discovery in databases):
      • Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
    • Alternative names and their “inside stories”:
      • Data mining: a misnomer?
      • Knowledge discovery(mining) in databases ( KDD ), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
    Source: Jiawei Han and Micheline Kamber: Data Mining: concepts and Techniques
  • 5. Why Data Mining? — Potential Applications
    • Database analysis and decision support
      • Market analysis and management
        • target marketing, customer relation management, market basket analysis, cross selling, market segmentation
      • Risk analysis and management
        • Forecasting, customer retention, improved underwriting, quality control, competitive analysis
      • Fraud detection and management
    • Other Applications
      • Text mining (news group, email, documents) and Web analysis .
      • Intelligent query answering
    Jiawei Han and Micheline Kamber: Data Mining: concepts and Techniques Motivation: “Necessity is the Mother of Invention”
  • 6. What is Web Mining?
    • Discovering useful information from the World-Wide Web and its usage patterns
    • Applications
      • Web search e.g., Google, Yahoo,…
      • Vertical Search e.g., FatLens, Become,…
      • Recommendations e.g., Amazon.com
      • Advertising e.g., Google, Yahoo
      • Web site design e.g., landing page optimization
  • 7. Structured vs. Web data mining
    • Traditional data mining
      • data is structured and relational
      • well-defined tables, columns, rows, keys, and constraints.
    • Web data
      • Readily available data rich in features and patterns
        • Text, image, audio, video
      • Spontaneous formation and evolution of
        • topic-induced graph clusters
        • hyperlink-induced communities
      • Challenges
        • Content includes truth, lies, obsolete information, contradictions, …
        • Uncontrolled quality, widely distributed, rapidly changing, heterogeneous/complex data types, no consistent semantics or structure within or across objects, etc. (XHTML & XML?)
  • 8. Size of the Web
    • Number of pages
      • Technically, infinite
        • Because of dynamically generated content
        • Lots of duplication (30-40%)
      • Best estimate of “unique” static HTML pages comes from search engine claims
        • Google = 8 billion, Yahoo = 20 billion
        • Lots of marketing hype
    • Number of unique web sites
      • Netcraft survey says 72 million sites (http://news.netcraft.com/archives/web_server_survey.html)
  • 9. Growth of the Internet * Fig. source: Douglas E. Comer, Computers Networks and Internets with Internet Applications , 4e, Person Prentice Hall, 2004
  • 10. Web Mining Taxonomy
    • Web content mining ( WCM )
    • Web usage mining (WUM)
    • Web structure mining (WSM)
    Web Mining Web Content Mining Web Usage Mining Web Structure Mining
  • 11. WCM & WUM
  • 12.  
  • 13. Main source of the data: Log files
    • Main source of the data about the activity of our web server are Log files
    • Typical line of a Log file:
      • 2005-05-29 04:13:40 128.2.215.4 - W3SVC1 WM 160.36.231.167 80 GET /Kdd/wm/wm.zip - 206 64 1507568 551 1816312 HTTP/1.1 www.utk.edu Mozilla/4.0+(compatible;+MSIE+5.5;+Windows+NT+5.0) - http://li.utk.edu/kdd/wm
    • E.g. Log files on WinNT/2000 reside at the winntsystem32logfiles system directory
  • 14. What kind of problems do we solve?
    • Personalization of web services:
      • Preparing offers (discounts, products, contents) customized for each particular user
    • Understanding of what is going on at the web server:
      • Customer groups identification, behavioral patterns
      • … the goal is to better organize web services
      • … optimization of site navigation
    • Better “Banner Adds” selection to increase the probability to be clicked by the user
      • … it is not hard to increase the probability
    • Building the psychological profiles based on the texts read by the user
      • … to get more info about the user than he has about himself 
    • Etc. etc. etc.
  • 15. Data analysis methods
    • Log files include sequences of events (click-streams):
      • … methods for analyzing event sequences are usually modified classical methods from the area of Data-Mining for analysis of very large databases
      • Basic methods are modified methods for induction of association rules, clustering, decision trees
    • Other analytic methods are from the areas of Text-Mining , Statistics and Machine-Learning
  • 16. Fig. A General Architecture for Web Usage Mining
  • 17. WUM - web usage miner
    • main goal: navigation pattern discovery
      • sequence of pages through the website
      • typical patterns
      • optimization of site navigation
    • three steps
      • log file cleaning
      • pattern analysis
      • visualization
  • 18. Association rules example
    • Items={milk, coke, pepsi, beer, juice}.
    • Support = 3 baskets.
      • B1 = {m, c, b} B2 = {m, p, j}
      • B3 = {m, b} B4 = {c, j}
      • B5 = {m, p, b} B6 = {m, c, b, j}
      • B7 = {c, b, j} B8 = {b, c}
    • Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.
  • 19. Association Rules
    • If-then rules about the contents of baskets.
    • { i 1 , i 2 ,…, i k } -> j means: “if a basket contains all of i 1 ,…, i k then it is likely to contain j .
    • Confidence of this association rule is the probability of j given i 1 ,…, i k .
  • 20. Association rules example (cont.)
      • B1 = {m, c, b} B2 = {m, p, j}
      • B3 = {m, b} B4 = {c, j}
      • B5 = {m, p, b} B6 = {m, c, b, j}
      • B7 = {c, b, j} B8 = {b, c}
    • An association rule: {m, b} -> c .
      • Confidence = 2/4 = 50%.
    + _ _ +
  • 21. Association rules in Web-logs
    • Searching for rules that connect two or more events , e.g.
      • 60% of the users that visited URL/company/product , also visited company/product/product1.html
      • 30% of the users that visited URL/company/special-offer/ also visited company/product2.html
  • 22. Profiling using time dimension
    • Searching for rules that connect two or more events taking into account time dimension:
      • 30% of the users that visited URL/company/product/product1.html also searched in the last week words W1 and W2 on Yahoo
      • 60% of the users that ordered product1 in the next 15 days also ordered product2
  • 23. Classification rules
    • Identification of behavior for groups of users - additional information can be obtained from cookies, registration,etc.:
      • Users that frequently visit page /company/products/product3.html are from educational institutions
      • 50% of the users that visited /company/products/product4.html are in age group of 20-25 and live at the sea coast
  • 24. Real-Time Data-Analysis
    • At some web servers there are too many hits to be saved and analyzed off-line:
      • … we have a data stream – no time or space for off-line data analysis (e.g. search engines, shops, banks, news, …)
      • … we would like to understand what is going on to detect e.g. anomalies or changes in trends
    • The solution is in using special type of methods for online event analysis:
      • Methods are able to analyze non-stationary data
      • At each moment results (models) are in human readable form (e.g. decision trees, rules, …)
      • … no need to save Log files
  • 25. Document visualization
  • 26. Application: From Web log to Web Loyalty A study done by the Harvard Business School indicates that an increase of 5% in customer loyalty can increase profitability from 25% to a much as 80% . (Multimedia Live, 2001)
  • 27. Web search The Web Ad indexes Web crawler Indexer Indexes Search User
  • 28. Search engine components
    • Spider (a.k.a. crawler/robot) – builds corpus
      • Collects web pages recursively
        • For each known URL, fetch the page, parse it, and extract new URLs
        • Repeat
      • Additional pages from direct submissions & other sources
    • The indexer – creates inverted indexes
      • Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc.
    • Query processor – serves query results
      • Front end – query reformulation, word stemming, capitalization, optimization of Booleans, etc.
      • Back end – finds matching documents and ranks them
  • 29. Typical anatomy of a large-scale crawler.
  • 30. PageRank
    • Used by Google
    • Prioritize pages returned from search by looking at Web structure.
    • Importance of page is calculated based on number of pages which point to it – Backlinks .
    • Weighting is used to provide more importance to backlinks coming form important pages.
  • 31. PageRank (cont’d)
    • PR(p) = c (PR(1)/N 1 + … + PR(n)/N n )
      • PR(i): PageRank for a page i which points to target page p.
      • N i : number of links coming out of page i
  • 32. WSM
    • Self-Similarity of Internet Traffic
    • Internet Invariant
    • Scale Free Network
  • 33.  
  • 34.
    • Self-Similarity of Internet Traffic (Measured) and Not in Poisson or Ordinary Telephone Traffic
  • 35. Internet Invariant
    • FTP transfers, Pareto tail
    • Interarrival time of packets, Heavy-tailed
    • Connection duration, Lognormal
    • TCP connections/Web session, Heavy-tailed
    • Session duration, Pareto
    Martin J. Fischer etc., “ Analyzing the Waiting Time Process in Internet Queueing Systems With the Transform Approximation Method ”
  • 36. Random Networks (Erdos/Renyi, 1960)
    • Average path length L ~ LnN , small;
    • Clustering coefficient C ~0 ; C : probability that any two nodes are connected to each other, given that they are both connected to a common node ( probability that friends of friends are friends)
  • 37. Regular Networks
    • High degree of clustering: C ~1
    • Average path length L : large
  • 38. Small-World Networks
    • High degree of clustering: C~1
    • Average path length L: Small (due to shortcuts);
    • D.J.Watts and S.H. Strogatz, Nature 393, pp. 440-442 (1998)
  • 39. Random, Small-World, and Regular Networks
    • Examples of small-world networks: power grid , internet , social network , scientific citation network , movie-actor network et al.
    Small Low Random Small High Small-World Large High Regular L C
  • 40. Complex Networks: How are they formed?
    • Growth
      • Starting with a small number of nodes, at every time step a new node with a number 9m) of links is added
    • Preferential Attachment
      • Barabasi-Albert (BA) model: probability for node i to acquire a new link is
      • This results in an algebraic degree distribution
    A. L. Barabasi and R. Albert, Science 286, 509 (1999)
  • 41. Consequence of Algebraic Degree Distribution
    • Statistical moments
    Do not exist for n=[r]-1, [r],… where [r] is the smallest integer greater than r: networks have no characteristic scales ( Scale-Free Networks ) Examples of SFN: (1) WWW , r(in)~2.1, r(out)~2.4; (2) Interent (r~2.5) (3) Network of movie actors (r~2.3); (4) Electrical power-grid of western US (r~4) (5) Scientific citation network (r~3.0)
  • 42. Alternative Models
    • For scale-free networks, preferential attachment probability IIi(ki)~ki leads to an algebraic degree distribution;
    • For random networks, the attachment probability does not depend on ki: i(ki) = constant, which leads to an exponential degree distribution: P(k)~e^(-ak);
    • Many realistic networks exhibit scale-free feature only to certain extent. Often, algebraic and exponential distributions are observed in different ranges of k.
  • 43. How robust is the Internet?
    • SFN is robust against random attacks while vulnerable to malicious intentional attacks
    Yuhai Tu, How robust is the Internet? Nature , Vol 406, July 2000
  • 44. More topics
    • Privacy Issues In Web Mining
    • Crawling the web
    • Web graph analysis
    • Structured data extraction
    • Classification and vertical search
    • Collaborative filtering
    • Web advertising and optimization
    • Mining web logs
    • Systems Issues
  • 45. Hmm, conclusion?
    • Web-Mining should be used by everybody offering services on the web and not being satisfied by simple access statistics !
    • The idea is to make something more out of the data already collected by your computer.
    • It is expected that Web-Mining will become soon a standard part of a typical web-solution.
    Marko Grobelnik http://www-ai.ijs.si/MarkoGrobelnik/ Institut Jo ž ef Stefan
  • 46. Acknowledgements & References
    • Fowler, T. B., “A Short Tutorial on Fractals and Internet Traffic,” The Telecommuni-cation Review, Volume 10, Mitretek Systems, McLean, VA, pp. 1-14, 1999.
    • Bastian Germershaus, “ Integration of association rules into WUM”
    • Gao Kun, “Analysis Techniques of Discovered Patterns”
    • Mengdan Yu, “Mining E-Business Gold”
    • Stanford CS345 “ Data Mining”
  • 47.
    • Thanks~~~