Peers store only what they need (“common good” at par with “own welfare”)
No tight control of topology/content
Support partial-match queries
Have search scope (orders of magnitude improvement over Gnutella)
Make implicit use of latent semantics
Provably good on a reasonable model
Very good on simulations
P2P search framework
Search queries are propagated on the overlay (from peer to a neighbor peer).
When a peer receives a query, it checks if it can satisfy it; decreases hop count; and forwards it to a subset of its neighbors.
Each search includes query and a “ propagation rule”, which determines which neighbors the search is propagated to.
“ DHTs” propagation rule= hash of query “ Gnutella” propagation rule independent of query Associative propagation rules are predicates ( guide rules )
What do we mean by “latent semantics” ?
Challenges in using latent semantics in P2P setting
Our proposal: search propagation via Possession rules
Possession rules overlays
Possession rules search strategies: Rapier , GAS
Models for “blind search” strategies (gnutella)
Analysis in the Itemsets model
More on GAS search strategy
View of P2P file sharing network
What is latent semantics?
Peer/Item matrix is “Market Basket” dataset . Similar to buyers/items, Document/terms, Web-pages/hyperlinks, movies/viewers.
Applications for extracting patterns from market basket data: Information Retrieval, Collaborative Filtering, Web search, Marketing, Recommendation Systems,…. (clustering, search, association rules)
Selections people make are dependent :
If you buy baby formula, you are more likely to buy diapers.
If two people loved a show, they are more likely to agree on other shows.
?? P2P search – direct queries to peers with interests that match yours
Overlay topology (“ networking aspects ”) must be coupled with search strategy (“ Information Retrieval/Data-Mining ”)
“ Traditional” IR and data-mining tools are not adapted to the highly distributed P2P setting.
Similarity metrics/clustering/ranking involve matrix operations on the “market basket” data: principal component analysis (LSI), eigenvalue computations, association rules…
Rule( O ): do you possess item O ?
Peer maintains a possession rule for each item in its index (subset if index is large)
Search strategy : a sequence of possession rules (with “hop counts”/search size limit)
Making this work:
“ Network ”: How to build overlay that supports possession rules
“ IR/DM ”: design search strategies that use possession rules (and work!)
Possession-rules overlays Index of P26 Rules/Items: Rule(A) Rule(B) Rule(C ) Rule(D) Peer26 P4,p5,p10 D P13,p15,p1 C P2,p6,p9 B P11,p7,p3 A Rule(item) neighbors item Example Search Strategy of P26: 2 hops in rule(A) 4 hops in rule(B) 6 hops in rule(C ) 4 hops in rule(A) 3 hops in rule(D)
Rules/Items: Rule(A) Rule(B) Rule(C ) Rule(D) Blind searching for O takes 13 probes Searching with rule( O ) takes 2 probes
When you find O , you often discover multiple peers that have O; when you give O , the searcher informs you of other peers with O .
Peers that have O can find other peers that have O
Coverage : The induced overlay on peers that satisfy each rule constitutes of large connected components.
Small degree : Each peer participates in a limited number of rules. (yet, overall there is a large number rules), for each rule it “participates” in, the peer maintains several participating neighbors.
Overlay and search boost each other (easy to find appropriate neighbors for each rule):
Network is “gnutella-like”, within each rule (… can use “super-peer” overlay within each rule !!)
To beat blind search, associative search should probe peers that are more likely to answer than “random peers”
RAPIER : Random Possession Rule – crudest strategy
GAS : Greedy Selection – refined strategy
Urand : (“gnutella”) all peers have same likelihood of being probed in each query
Prand : (“gnutella modified”) peers are probed proportionally to their index size ( RAPIER has same bias )
RAPIER – Random Possession Rule simplest possession-rule based strategy
RAPIER Search strategy:
Repeat until found :
Pick a random item O from your index
Search peers that have this item (using rule( O ))
Straightforward to implement on top of a possession-rule overlay network
Analysis: Itemsets Model
Items belong to “topics.” There are very many topics; but each peer can only select items from a fixed set of topics. Topic popularities can highly vary; but each peer has equal interest in each of “its” topics.
We show that
RAPIER is at least as good as Prand
RAPIER is better than Prand when peers have fewer topics
Simple model that hints on what is going on…
Data : used Client/Hostname matrix from proxy logs as peer/item matrix. Each entry, in turn, is treated as a search item.
Similarly-structured “market basket” data
Has rare items (which current P2P networks don’t support)
No universal model for market basket data
Can’t get a full index for many peers from current P2P networks… and these networks don’t reflect well on rare items.
Metric : ESS (Expected Search Size – number of peers probed till search is resolved). CDF of fraction of “searches” that have ESS below “x”.
ESS – Expected Search Size
ESS : 1/(success probability in each probe) (when probes are “independent” – not true for GAS)
Probe success probability :
Urand : fraction of peers that have the item in their index
Prand : weight of each peer is its index size divided by sum of index sizes of all peers.
Success prob: (weight of peers with item) / (weight of peers without item)
RAPIER : the average, over possession rules peer participates in, of fraction of peers in rule that have the item.
When searching by possession rules we have bias towards peers that participate in more rules/ have more items.
But, with this bias, a strategy has better chance of finding what it is looking for! So…
We show that the likelihood of being probed is proportional to number of rules you participate in.
Prand “blind search” strategy has same bias.
Thus, it is “fair” to compare Prand search with possession-rule based RAPIER
GAS …Refining RAPIER
Some rules are better than others (e.g., possession of a very popular item carries weaker information)
Unsuccessful search carries information : suppose you lost something, you think you lost it at home. You search home going through various closets and drawers and don’t find it, then you may decide to go search the office, even if you have not completed an exhaustive search at home. What happened? The posterior distribution on the item’s location had changed as a result of the search.
GAS – Greedy Strategy
Urand Blind search (Gnutella),
Prand Gnutella modified,
Rapier , GAS – our algorithms
Rare Items: present in 1% of peers
Rarer items: 0.1% of peers
Even Rarer Item: 0.01% of peers
GAS – Greedy Strategy
Idea : use the search strategy that would have optimized your search on previous queries.
Caveat : this is NP-Complete
Can do : greedy approximation strategy: GAS
initialize the “query vector” to a uniform distribution on previous selections.
Iterate the following:
Apply the possession rule that maximizes success probability with respect to the query posterior
update the query posterior .
Theorem : GAS is a constant factor approximation of the optimal strategy
Building GAS strategies
Take a sample of items currently in your index D , E , F , G .
“search” for these items in each possession rule you participate A,B,C
obtain a matrix: fraction of peers with item x in rule(y)
* 0.03 0.2 0.1 rule(C) 0.1 * 0.04 * rule(B) * 0.2 * 0.03 rule(A) G F E D Item Rule()
GAS strategy (example) C,C,C,A,C,C,A,C,A,C,B,B,A,C,B,B,C,A,B,B,C GAS search of size 21: 10 probes in rule(C) 6 probes in rule(B) 5 probes in rule(A) RAPIER search of size 21: 7 probes in rule(C) 7 probes in rule(B) 7 probes in rule(A) * 0.03 0.2 0.1 rule(C) 0.1 * 0.04 * rule(B) * 0.2 * 0.03 rule(A) G F E D Item Rule()
We proposed a general framework for associative P2P search: exploit patterns inherent in human selections to boost search. Adapted to the P2P setting .
Search strategies and the overlay structure are “symbiotic” and guided/boosted by previous selections/queries.
“ Common good” in par with “own welfare”: All data maintained by each peer has direct personal benefit (like gnutella). Helping others helps you…
Possession rules :
Strategies are “approximations” to “standard” similarity metrics… that work!!.
Easy to find other sources of desired item (for alternative/parallel downloads)
IR-DM: association rules/collaborative filtering/Web search
P2P networks: unstructured networks; DHTs
DHTs have “symbiotic” overlay/search strategy
Caching at peers (Freenet) adapt overlay according to search
Crespo/Garcia-Molina 02– routing indexes
System isolates “topics”+map queries/items to topics.
Peer knows “summary” of what can be reached thru it/each neighbor
Query keywords are used to select a neighbor who is a best match
Differences from our approach :
No connection between search and overlay topology
Uses only text/keywords. We use co-location associations between items.
CG02 : tradeoff between topic divergence (all nodes ending up with similar index “summary”); or restricted coverage (number of peers included in each peer summary);
neurogrid.net (Sam Joseph, U. Tokyo) “agent” text-based approach
Peers learn and remember content of other peers
Integrate text matching (of query keywords) in search strategy (use rule(O) if query keywords match O’s metadata)
Select which possession rules to participate in (e.g., using item popularity heuristic or GAS-like selection)
Search strategy gives more weight to more recent selections (are more indicative of next query)
Explore other types of propagation rules
P2P “communities” ?
Integrate “Recommendation Systems” in P2P ?
Some Extra Comments…
Issues with straightforward importing of IR techniques
Vector space approach
Why we need to use several propagation rules in a search? (when searching according to “examples” in the index)
“ Straight” IR vector-space approach
#neighbors=O(dimension) - want small dimension
Yet, Matrix operations, e.g principal component analysis (LSI), are hard in our distributed setting
Yet, each peer should be able to compute the mapping for its queries and/or index
Proximity metric alone is insufficient (Need different propagation rules)
Peers are mapped to vectors, according to their index content. Queries are mapped to the vectors in the same space.
Overlay topology is correlated with distances in this vector space (bias towards closer peers)
Search propagation targets regions of the space that are “closest” to the query.
Why we need several propagation rules for the same query –”decision-tree like” search
propagation rule =approx interest area
Each peer covers several interest areas, peers have different sets of interest areas.
Peer Query: 80% basketball 20%polo
“ World” Index: 5% basketball 0.1% polo
All “basketball” lovers would be close matches; but need to direct search to more “polo” lovers