PhD Thesis presentation
Upcoming SlideShare
Loading in...5
×
 

PhD Thesis presentation

on

  • 1,127 views

Thesis viva

Thesis viva

Statistics

Views

Total Views
1,127
Views on SlideShare
1,127
Embed Views
0

Actions

Likes
2
Downloads
41
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

PhD Thesis presentation PhD Thesis presentation Presentation Transcript

  • DETECTION OF DISHONESTBEHAVIORS IN ON-LINENETWORKS USING GRAPH-BASEDRANKING TECHNIQUESFrancisco Javier Ortega RodríguezSupervised byProf. Dr. José Antonio Troyano Jiménez
  • Motivation2
  • Motivation3  WWW: Web Search A new business model  Advertisements on the web pages  More web traffic  More visits to (or views of) the ads  Search Engine Optimization (SEO) is born  White Hat SEO  Black Hat SEO Web Spam!
  • Motivation4  Social Networks  Reputation of users similar to relevance of web pages  Higher reputation can imply some benefits  Malicious users manipulate the TRS’s  On-line marketplaces: money  Social news sites: slant the contents of the web site  Simply for “trolling” (for pleasure)
  • Motivation5
  • Motivation6  Hypothesis The detection of dishonest behaviors in on- line networks can be carried out with graph- based techniques, flexible enough to include in their schemes specific information (in the form of features of the elements in a graph) about the network to be processed and the concrete task to be solved.
  • Roadmap7
  • Web Spam Detection8  Web spam mechanisms try to increase the web traffic to specific web sites  Reach the top positions of a web search engine  Relatedness: similarity to the user query  Changing the content of the web page  Visibility: relevance in the collection  Getting a high number of references
  • Web Spam Detection9  Content-based methods: self promotion  HiddenHTML code  Keyword stuffing
  • Web Spam Detection10  Link-based methods: mutual promotion  Link-farms  PR-sculpting
  • Roadmap11
  • Web Spam Detection12  Relevant web spam detection methods:  Link-based approaches  PageRank-based  Adaptations:  Truncated PageRank [Castillo et al. 2007]  TrustRank [Gyongy et al. 2004]
  • Web Spam Detection13  Relevant web spam detection methods:  Link-based approaches  Pros:  Tackle the link-based spam methods  The ranking can be directly used as the result of an user query  Cons:  Do not take into account the content of the web pages  Need human intervention in some specific parts
  • Web Spam Detection14  Relevant web spam detection methods:  Content-based approaches Database WP’s Size Compressibilit Avg. word lenght .. y . 1 … … … … 2 … … … … … … … … … Classifie r Spam Not Spam
  • Web Spam Detection15  Relevant web spam detection methods:  Content-based approaches  Pros:  Deal with content-based spam methods  Binary classification methods  Cons:  Very slow in comparison to the link-based methods  Based on user-specified features  Do not take into account the topology of the web graph
  • Web Spam Detection16  Relevant web spam detection methods:  Hybrid approaches Database WP’s Size % Compressibilit Out-links Avg. word .. In- y / In-links lenght . links 1 … … … … … … 2 … … … … … … … … … … … … … Link-based metrics
  • Web Spam Detection17  Relevant web spam detection methods:  Hybrid approaches  Pros:  Combine the pros of link and content-based methods.  Really effective in the classification of web pages  Cons:  Need user-specified features for both the content and the link-based heuristics.  Opportunity:  Do not take advantage of the global topology of the web graph
  • Roadmap18
  • PolaritySpam19  Intuition  Include content-based knowledge in a link-based system. Content Propagatio Ranking Evaluation n algorithm Databas e Selection of sources
  • PolaritySpam20  Content Evaluation Content Evaluation Databas e
  • PolaritySpam21  Content Evaluation  Acquire useful knowledge from the textual content  Content-based heuristics  Adequate for spam detection  Easy to compute  Highest discriminative ability  A-priori spam likelihood of a web page
  • PolaritySpam22  Content Evaluation  Small set of heuristics [Ntoulas et al., 2006]  Compressibility  Average length of words A high value of the metrics implies an a-priori high spam likelihood of a web page
  • PolaritySpam23  Selection of Sources Databas e Selection of sources
  • PolaritySpam24  Selection of Sources  Automatically pick a set of a-priori spam and not- spam web pages, Sources- and Sources+, respectively  Take into account the content-basedmi1 , mi 2 ,..., mij } M i { heuristics  Given a web page wpi with metrics:
  • PolaritySpam25  Selection of Sources  Most Spamy/Not-Spamy sources (S-NS) Sources  Content-based S-NS (CS-NS) Sources  Content-based Graph Sources (C-GS)
  • PolaritySpam26  Propagation algorithm Propagatio Ranking n algorithm
  • PolaritySpam27  Propagation algorithm:  PageRank-based algorithm  Idea: propagate a-priori information from a specific set of web pages, Sources  A-priori scores for the Sources ei ¹ 0Ûwpi Î Sources
  • PolaritySpam28  Propagation algorithm:  Two scores for each web page, vi: ei+ ¹ 0 Û wpi Î Sources+ Set of a-priori non-spam web pages ei- ¹ 0 Û wpi Î Sources- Set of a-priori spam web pages
  • PolaritySpam29 Propagatio Ranking n algorithm
  • PolaritySpam30  Evaluation:  Dataset  Baseline  Evaluation Methods  Results
  • PolaritySpam31  Evaluation:  Dataset  WEBSPAM-UK 2006 (Università degli Studi di Milano)  Metrics:  98 million pages  11,400 hosts manually labeled  7,423 hosts are labeled as spam  About 10 million web pages are labeled as spam  Processed with Terrier IR Platform  http://terrier.org
  • PolaritySpam32  Evaluation:  Baseline: TrustRank [Gyongy et al., 2004]  Link-based web spam detection method  Personalized PageRank equation  Propagation from a set of hand-picked web pages
  • PolaritySpam33  Evaluation:  Evaluation methods: PR-Buckets … Bucket 1 Bucket 2 Bucket N
  • PolaritySpam34  Evaluation:  Evaluation methods: PR-Buckets  Evaluation metric: number of spam web pages in each bucket … Bucket 1 Bucket 2 Bucket N
  • PolaritySpam35  Evaluation:  Evaluation methods: PR-Buckets  Evaluation metric: number of spam web pages in each bucket … Bucket 1 Bucket 2 Bucket N
  • PolaritySpam36  Evaluation:  Normalized Discounted Cumulative Gain (nDCG):  Global metric: measures the demotion of spam web pages  Sum the “relevance” scores of not-spam web pages
  • PolaritySpam37  Evaluation:  Normalized Discounted Cumulative Gain (nDCG):  Global metric: measures the demotion of spam web pages  Sum the “relevance” scores of not-spam web pages
  • PolaritySpam38  Evaluation:  Normalized Discounted Cumulative Gain (nDCG):  Global metric: measures the demotion of spam web pages  Sum the “relevance” scores of not-spam web pages
  • PolaritySpam39  Evaluation:  PR-Buckets evaluation 1000 Number of Spam Web Pages 100 10 1 1 2 3 4 5 6 7 8 9 10 Buckets TrustRank S-NS CS-NS C-GS
  • PolaritySpam40  Evaluation:  nDCG evaluation nDCG TrustRan 0.7381 k S-NS 0.4230 CS-NS 0.8621 C-GS 0.8648
  • PolaritySpam41  Evaluation:  Content-based heuristics 1000Number of Spam Web Pages 100 10 1 1 2 3 4 5 6 7 8 9 10 Buckets AverageLength Compressibility AllMetrics TrustRank PolaritySpam
  • Roadmap42
  • Trust & Reputation in Social Networks43  Trust and reputation are key concepts in social networks  Similar to the relevance of web pages in the WWW  Reputation: assessment of the trustworthiness of a user in a social network, according to his behavior and the opinions of the other users.
  • Trust & Reputation in Social Networks44  Example: On-line marketplaces  Trustworthiness as determining as the price  Higher reputation implies more sales  Positive and negative opinions
  • Trust & Reputation in Social Networks45  Main goal: gain high reputation  Obtain positive feedbacks from the customers  Sell some bargains  Special offers  Give negative opinions for sellers that can be competitors.  Obtain false positive opinions from other accounts (not necessarily other users). Dishonest behaviors!
  • Roadmap46
  • Trust & Reputation in Social Networks47  TRS’s in real world  Moderators  Special set of users with specific responsibilities  Example: Slashdot.org  A hierarchy of moderators  A special user, No_More_Trolls, maintains a list of known trolls  Drawbacks:  Scalability  Subjectivity
  • Trust & Reputation in Social Networks48  TRS’s in real world  Unsupervised TRS’s  Users rate the contents of the system (and also other users)  Scalability problem: rely on the users  Subjectivity problem: decentralized  Examples: Digg.com, eBay.com  Drawbacks:  Unsupervised!
  • Trust & Reputation in Social Networks49  Transivity of Trust and Distrust [Guha et al., 2004]  Multiplicative distrust  The enemy of my enemy is my friend  Additive distrust  Don’t trust someone not trusted by someone you don’t trust  Neutral distrust  Don’t take into account your enemies’ opinions
  • Trust & Reputation in Social Networks50  Threats of TRS’s Orchestrated attacks Camouflage behind good behavior Malicious Spies Camouflage behind judgments
  • Trust & Reputation in Social Networks51  Threats of TRS’s  Orchestrated attacks: obtaining positive opinions from other accounts (not necessarily other users) 6 1 7 2 0 3 9 5 8 4
  • Trust & Reputation in Social Networks52  Threats of TRS’s  Camouflage behind good behavior: feigning good behavior in order to obtain positive feedback from others. 6 1 7 2 0 3 9 5 8 4
  • Trust & Reputation in Social Networks53  Threats of TRS’s  Malicious spies: using an “honest” account to provide positive opinions to malicious users. 6 1 7 2 0 3 9 8 4 5
  • Trust & Reputation in Social Networks54  Threats of TRS’s  Camouflage behind judgments: giving negative feedback to users who can be competitors. 6 1 7 2 0 3 9 5 8 4
  • Roadmap55
  • PolarityTrust56  Intuition  Compute a ranking of the users in a social network according to their trustworthiness  Take into account both positive and negative feedback  Graph-based ranking algorithm to obtain two scores ifor each node: PT (v )  PT (vi ) : positive reputation of user i  : negative reputation of user i
  • PolarityTrust57  Intuition  Propagation algorithm for the opinions of the users  Given a set of trustworthy users  Their PT+ and PT- scores are propagated to their neighbors  … and so on 6 1 7 2 0 3 9 4 5 8
  • PolarityTrust58  Algorithm  Propagation schema of the opinions of the users  Different behavior depending on the type of relation between the users: positive or negative PT ⁺ (b) ↑ ⁻ PT (e) ↑ b e a d ⁺ PT (a) c ⁻ PT (d) f ⁻ PT (c) ↑ ⁺ PT (f) ↑
  • PolarityTrust59  Algorithm  The scores of the nodes influence the scores of their neighbors PT (vi )
  • PolarityTrust60  Algorithm  The scores of the nodes influence the scores of their neighbors PT (vi ) (1 d )ei d Set of sources
  • PolarityTrust61  Algorithm  The scores of the nodes influence the scores of their neighbors pij PT (vi ) (1 d )ei d PT (v j ) j In ( i ) | p jk | k Out ( v j ) Direct relation with the PT+ of positively voting users
  • PolarityTrust62  Algorithm  The scores of the nodes influence the scores of their neighbors pij pij PT (vi ) (1 d )ei d PT (v j ) PT (v j ) j In ( i ) | p jk | j In ( i ) | p jk | k Out ( v j ) k Out ( v j ) Inverse relation with the PT- of negatively voting users
  • PolarityTrust63  Algorithm  The scores of the nodes influence the scores of their neighbors pij pij PT (vi ) (1 d )ei d PT (v j ) PT (v j ) j In ( i ) | p jk | j In ( i ) | p jk | k Out ( v j ) k Out ( v j ) pij pij PT (vi ) (1 d )ei d PT (v j ) PT (v j ) j In ( i ) | p jk | j In ( i ) | p jk | k Out ( v j ) k Out ( v j )
  • PolarityTrust64  Non-Negative Propagation  Problems caused by negative opinions from malicious users  Solution: dynamically avoid the propagation of these opinions from malicious users PR⁻(b) ↑ b a ⁻ PR (a) c ⁺ PR (c) ↑
  • PolarityTrust65  Action-Reaction Propagation  Problems caused by dishonest voting attacks  Positive votes to malicious users  Orchestrated attacks, malicious spies…  Negative votes to good users  Camouflage behind bad judgments  React against bad actions: dishonest voting  Penalize users who performs these actions  Proportional to the trustworthiness of the nodes been affected
  • PolarityTrust66  Action-Reaction Propagation  Computation:  Relation between the number of dishonest votes and the total number of votes  Applied after each iteration of the ranking algorithm b a d c
  • PolarityTrust67  Complete Formulation
  • PolarityTrust68  Evaluation  Datasets  Baselines  Results
  • PolarityTrust69  Evaluation  Datasets  Barabasi & Albert  Preferential attachment property  Randomly generated attacks  Metrics of the dataset  104nodes per graph  103 malicious users  100 malicious spies
  • PolarityTrust70  Evaluation  Datasets  Slashdot Zoo  Graph of users in Slashdot.org  Friend and Foe relationships  Gold Standard = list of Foes of the special user No_More_Trolls  Metrics of the dataset  71,500 users in total  24% negative edges  96 known trolls  Source set: CmdrTaco and his friends  6 users in total
  • PolarityTrust71  Evaluation  Baselines  EigenTrust [Kamvar et al. 2003]  It does not take into account negative opinions  Fans Minus Freaks  (Number of friends – Number of foes)  Signed Spectral Ranking [Kunegis et al. 2009]  Negative Ranking [Kunegis et al. 2009]
  • PolarityTrust72  Evaluation  Results: Randomly generated datasets  nDCG Threats ET FmF SR NR PTNN PTAR PT A 0.833 0.843 0.599 0.749 0.876 0.906 0.987 AB 0.833 0.844 0.811 0.920 0.876 0.906 0.987 ABC 0.842 0.719 0.816 0.920 0.877 0.903 0.984 ABCD 0.823 0.723 0.818 0.937 0.879 0.903 0.984 ABCDE 0.753 0.777 0.877 0.933 0.966 0.862 0.982 A: No estrategies B: Orchestrated attack ET: EigenTrust PTNN: Non-Negative Propagation C: Camouflage behind good FmF: Fans Minus PTAR: Action-Reaction behaviors Freaks Propagation D: Malicious spies SR: Spectral Ranking PT: PolarityTrust E: Camouflage behind judgments NR: Negative Ranking
  • PolarityTrust73  Evaluation  Results: Slashdot Zoo dataset ET FmF SR NR PTNN PTAR PT nDCG 0.31 0.460 0.479 0.477 0.593 0.570 0.588 0 ET: EigenTrust PTNN: Non-Negative Propagation FmF: Fans Minus PTAR: Action-Reaction Freaks Propagation SR: Spectral Ranking PT: PolarityTrust NR: Negative Ranking
  • PolarityTrust74  Evaluation  Results: Trolling Slashdot  nDCG Threats ET FmF SR NR PTNN PTAR PT A 0.310 0.460 0.479 0.477 0.593 0.570 0.588 AB 0.308 0.460 0.478 0.477 0.593 0.570 0.588 ABC 0.311 0.460 0.474 0.484 0.593 0.570 0.588 ABCD 0.370 0.476 0.501 0.501 0.580 0.570 0.586 ABCDE 0.370 0.475 0.501 0.496 0.580 0.574 0.588 A: No estrategies B: Orchestrated attack ET: EigenTrust PTNN: Non-Negative Propagation C: Camouflage behind good FmF: Fans Minus PTAR: Action-Reaction behaviors Freaks Propagation D: Malicious spies SR: Spectral Ranking PT: PolarityTrust E: Camouflage behind judgments NR: Negative Ranking
  • PolarityTrust75  Evaluation  Include a set of sources of distrust  In Slashdot Zoo Dataset:  Sources of trust: CmdrTaco and friends  Sources of distrust: 5 random foes of No_More_Trolls  Many possible methods to choose the sources of distrust
  • PolarityTrust76  Evaluation  Results: Sources os trust and distrust  nDCG Sources of Trust Sources of Trust & Distrust Threats PTNN PTAR PT PTNN PTAR PT A 0.593 0.570 0.588 0.846 0.790 0.846 AB 0.593 0.570 0.588 0. 846 0.790 0.846 ABC 0.593 0.570 0.588 0.846 0.790 0.846 ABCD 0.580 0.570 0.586 0.775 0.739 0.782 ABCDE 0.580 0.574 0.588 0.774 0.741 0.781 A: No estrategies PTNN: Non-Negative Propagation B: Orchestrated attack D: Malicious spies PTAR: Action-Reaction C: Camouflage behind good E: Camouflage behind Propagation behaviors judgments PT: PolarityTrust
  • Roadmap77
  • Conclusions78  Final Remarks  Development of two systems for the detection of dishonest behaviors in on-line networks  Web Spam Detection: PolaritySpam  Trust and Reputation: PolarityTrust  Propagation of some a-priori information  Web Spam: Textual content of the web pages  Trust and Reputation: Trust and distrust sources sets
  • Conclusions79  Final Remarks  Web Spam Detection  Unlike existent approaches, include content-based knowledge into a link-based technique  Unsupervised methods for the selection of sources  Propagate information of the sources through the network  Two simple metrics improve state-of-the-art methods
  • Conclusions80  Final Remarks  Trust and Reputation in social networks  Negative links improve the discriminative ability of TRS’s  Propagationestrategies to deal with different attacks against a TRS  Non-Negative propagation  Action-Reaction propagation  Interrelated scores modeling the transitivity of trust and distrust  Flexible to be adapted to different situations and
  • Conclusions81  Future Work  PolaritySpam  Applicability of more content-based metrics  Aditional methods for the selection of sources  Propagation ability of each source  Infer negative relations between web pages  According to their textual content  Apply similar propagation schemas as in PolarityTrust
  • Conclusions82  Future Work  PolarityTrust  Study other possible attacks  Playbook sequences (omniscience of the attackers)  Analyze the casuistry of the different social networks  Selection of sources of trust and distrust  Link-based methods  Study other contexts with positive and negative relations:  Trending topics  Authorities in the blogosphere
  • Conclusions83  Future Work  Both techniques  Study of the parallelization of both algorithms  Many works on the parallelization of PageRank  Saving time and memory  Detection of Spam on the social networks  Spam messages and spam user accounts  Recommender Systems  NLP and Opinion Mining techniques in a link-based system  Use the positive and negative information
  • Curriculum Vitae84  Academic and Research milestones  2006: Degree on Computer Science  2006: Funded Student in the Itálica research group  2008: Master of Advances Studies:  “STR: A graph-based tagger generator”  2010: Research stay at the University of Glasgow  IR Group (Dr. Iadh Ounis and Dr. Craig Macdonald)
  • Curriculum Vitae85  26 contributions to conferences and journals 5 JCR  10 International Conferences  2 CORE B  4 CORE C  4 ISI Proceedings  3 Lecture Notes in Computer Sciences  3 CiteSeer Venue Impact Ratings  Proyectos de investigación
  • Curriculum Vitae86  Contributions related to the thesis PolarityRank National Conf. TexRank for International PolarityTrus Tagging Conf. t JCR System Combinatio STR n Methods PolaritySpa m Web Spam Detection Improving a Tagger Generator in IE
  • Curriculum Vitae87  Contributions related to the thesis National Conf. TexRank for International Tagging Conf. JCR System Combinatio n Methods Improving a Tagger Generator in IE TextRank como motor de aprendizaje en tareas de etiquetado, SEPLN 2006 Bootstrapping Applied to a Corpus Generation Task, EUROCAST 2007 Improving the Performance of a Tagger Generator in an Information Extraction Application, Journal of Universal Computer Science (2007)
  • Curriculum Vitae88  Contributions related to the thesis National Conf. International Conf. JCR STR STR: A Graph-based Tagging Technique, International Journal on Artificial Intelligence Tools (2011)
  • Curriculum Vitae89  Contributions related to the thesis PolarityRank National Conf. International Conf. JCR Web Spam Detection A Knowledge-Rich Approach to Featured-based Opinion Extraction from Product Reviews, SMUC 2010 (CIKM 2010) Combining Textual Content and Hyperlinks in Web Spam Detection, NLDB 2010
  • Curriculum Vitae90  Contributions related to the thesis National Conf. International PolarityTrus Conf. t JCR PolaritySpa m PolarityTrust: Measuring Trust and Reputation in Social Networks, ITA 2011 PolaritySpam: Propagating Content-based Information Through a Web Graph to Detect Web Spam, International Journal of Innovative Computing, Information and Control (2012)
  • DETECTION OF DISHONESTBEHAVIORS IN ON-LINENETWORKS USING GRAPH-BASEDRANKING TECHNIQUESFrancisco Javier Ortega RodríguezSupervised byProf. Dr. José Antonio Troyano Jiménez