Your SlideShare is downloading. ×
  • Like
Addressing scalability challenges in peer-to-peer search
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Addressing scalability challenges in peer-to-peer search

  • 582 views
Published

Addressing scalability challenges in peer-to-peer search: Two-layered semantic search and cloud assistance to handle query spikes

Addressing scalability challenges in peer-to-peer search: Two-layered semantic search and cloud assistance to handle query spikes

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
582
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Addressing scalability challenges in peer-to-peer search PhD seminar 4 Feb, 2014 Harisankar H, PhD scholar, DOS lab, Dept. of CSE Advisor: Prof. D. Janakiram http://harisankarh.wordpress.com
  • 2. Outline • Issues with centralized search – Can peer-to-peer search help? • Scalability challenges in peer-to-peer search • Proposed architectural extensions – Two-layered architecture for peer-to-peer concept search – Cloud-assisted approach to handle query spikes
  • 3. Centralized search scenario • Scenario – Search engines crawl available content, index and maintain it in data centers – User queries directed to data centers, processed internally and results sent back – Centrally managed by single company Content End users Datacenters
  • 4. Some issues with centralized search – Privacy concerns • All user queries accessible from a single location – Centralized control • Individual companies decide what to(not to) index, rank etc. – Transparency • Complete details of ranking, pre-processing etc. not made available publicly • Concerns of censorship and doctoring of results
  • 5. Some issues with centralized search contd.. • Uses mostly syntactic search techniques – Based on word or multi-word phrases – Low quality of results due to ambiguity of natural language • Issues with centralized semantic search – Difficult to capture long tail of niche interests of users • Requires rich human generated knowledge bases in numerous niche areas
  • 6. Peer-to-peer search approach • Edge nodes in the internet participate in providing and using the search service • Search as a collaborative service • Crawling, indexing and search distributed across the peers
  • 7. How could peer-to-peer search help? • Each user query can be sent to a different peer among millions – Obtaining query logs in a single location difficult – Reduced privacy concerns • Distributed control across numerous peers – Avoids centralized control • Search application available with all peers – Better transparency in ranking etc. • Background knowledge of peers can be utilized for effective semantic search – Can help improve quality of results • Led to lot of academic research in the area as well as real world p2p search engines* * e.g., faroo.com, yacy.net; YacyPi Kickstarter project
  • 8. Realizing peer-to-peer search • Distribution of search index – Term partitioning • Responsibility of individual terms assigned to different peers – E.g., peer1 is currently responsible for term “computer” • Term-to-peer mapping achieved through a structured overlay(e.g., DHT) Image src: http://wwarodomfr.blogspot.in/2008/09/chord-routing.html
  • 9. Scalability challenges in peer-to-peer search • • • • Peers share only idle resources Peers join/leave autonomously Limited individual resources leads to No SLA – Peer bandwidth bottleneck during query processing • Particularly queries involving multiple terms(index transfer between multiple peers) – Instability during query spikes • Knowledge management issues at large scale – Difficult to have consensus at large scale – Need wide understanding and have to meet requirements of large diverse group
  • 10. Two-layered architecture for peer-topeer concept search* • Peers organized as communities based on common interest • Each community maintains its own background knowledge to use in semantic search – Maintained in a distributed manner • A global layer with aggregate information to facilitate search across communities • Background knowledge bases extend from minimal universally accepted knowledge in upper layer • Search, indexing and knowledge management proceeds independently in each community *joint work with Prof. Fausto Guinchiglia and Uladzimir, Univ. of Trento
  • 11. Two-layered architecture for peer-to-peer concept search GLOBAL Comm: index UK BK-1 doc index -1 BK-3 Community-1 doc index -3 Community-3 BK-2 doc index -2 Community-2
  • 12. Two-layered architecture • Global layer – retrieves relevant communities for query based on universal knowledge • Community layer – retrieves relevant documents for query based on background knowledge of community
  • 13. Overcoming the shortcomings of singlelayered approaches • Search can be scoped only to the relevant communities for a query – Results in less bandwidth-related issues • Two layers make knowledge management scalable and interoperable – Niche interests supported at community-level background knowledge bases – Minimal universal knowledge for interoperability • Search within community based on community’s background knowledge – Focused interest of community helps in better term-toconcept mapping
  • 14. Two-layered approach • Index partitioning – Uses partition-by-term • Posting list for each term stored in different peers – Uses Distributed Hash Table(DHT) to realize dynamic termto-peer mapping • O(logN) hops for each lookup • Overlay network – Communities and global layer maintained using twolayered overlay • Based on our earlier work on computational grids* – O(logN) hops for lookup even with two-layers *M.V. Reddy, A.V. Srinivas, T. Gopinath, and D. Janakiram, “Vishwa: A reconfigurable P2P middleware for Grid Computations,” in ICPP'06
  • 15. Two-layered approach • Community management – Similar to public communities in flickr, facebook etc. • Search within community – Uses Concept Search* as underlying semantic search scheme • Extends syntactic search with available knowledge to realize semantic search • Falls back to syntactic search when no knowledge is available *Fausto Giunchiglia, Uladzimir Kharkevich, Ilya Zaihrayeu, “Concept search”, ESWC 2009
  • 16. Two-layered approach • Knowledge representation – Term -> concept mapping – Concept hierarchy • Concept relations expressed as subsumption relations • Concepts in documents/queries extracted – by analyzing words and natural language phrases – Nounphrases translated into conjunctions of atomic concepts (complex concepts) • Small-1Λdog-2 – Documents/queries represented as enumerated sequences of complex concepts • Eg: 1:small-1Λdog-2 2:big-1Λanimal-3
  • 17. Two-layered approach • Relevance model – Documents having more specific concepts than query concepts considered relevant • Eg: poodle-1 relevant when searching for dog-2 – Ranking done by extending tf-idf relevance model • Incorporates term-concept and concept-concept similarities also • Distributed knowledge maintenance – Each atomic concept indexed on DHT with id – Node responsible for each atomic concept id also stores ids of • All immediate more specific atomic concepts • All concepts in the path to root of the atomic concept
  • 18. Two-layered approach • Document indexing and search – Concepts mapped to peer using DHT – Query routed to peers responsible for the query concepts and related concepts – Results from multiple peers combined to give final results • Global search – The popularity(document frequency) of each concept indexed in upper layer – Tf-idf extended with universal knowledge to search for communities – Combined score of doc = (score of community)*(score of doc within community)
  • 19. Experiments • Single layer syntactic vs semantic: TREC ad-hoc,TREC8 ( simulated with 10,000 peers) – Wordnet as knowledge base • Single vs 2 layer – 18 communities (doc: categories in dMoz*) • 18*1000 = 18,000 peers simulated – – – – – UK = domain-independent concepts and relations from wordnet BK = UK + wordnet domains + YAGO BK mapped to communities Queries selected as directory path to a specific subdirectory Standard result: documents in that subdirectory *http://www.dmoz.org/
  • 20. Experiments • Tools – GATE(NLP), Lucene(search library), PeerSim(peer-topeer system simulator) • Performance metrics – Quality • Precision @10, precision @20 • Mean average precision, MAP – Network bandwidth • Average number of postings transferred – Response time • s-postings, s-hops
  • 21. Results (1 layer syntactic vs semantic) • Quality improved • But, cost also increased
  • 22. Results (1 layer vs 2 layer) • Quality improved • Cost decreased – 94% decrease in posting transfer for opt. case
  • 23. Two-layered approach results • Proposed approach gives better quality and performance over single-layered approaches – Performance can further improved using optimizations like early termination • But, issue of query spikes remain
  • 24. Query spikes in peer-to-peer search • Query spikes can lead to instability – Replication/caching insufficient due to high document creation rate* rate of queries related to “Bin laden” increased by 10,000 times within one hour in Google on May 1, 2011 after Operation Geronimo.
  • 25. Some background • Term-partitioned search – Term/popular query responsibility assigned to individual peers • Updates and queries are sent to peer responsible which process them – Term -> peer mapping done using a Distributed Hash Table(DHTs) top-k result list of q
  • 26. Cloud-assisted p2p search(CAPS) • Offload responsibilities of spiking queries to public cloud
  • 27. Issues in realizing CAPS • Maintaining full index copy in cloud is very expensive – Storage alone will cost more than 5 million dollars per month* • Approach: transfer only relevant index portion to cloud – Need to be performed fast considering effect on user experience(result quality, response time) • Effect on the desirable properties of peer-to-peer search – Privacy, transparency, decentralized control etc.
  • 28. CAPS components • Switching decision maker – Decide when to switch – Simple e.g., “switch when query rate increases by X% within last Y seconds” • Switching implementor – Switching algorithm to seamlessly transfer index partition – Dynamic creation of cloud instances
  • 29. CAPS Switching algorithm • Ensures that result quality is not affected • Controlled bandwidth usage at peer
  • 30. Addressing additional concerns • Transparency – Index resides both among peers and cloud • Centralized control – Query can switched back to peers or other clouds • Privacy – Only spiking queries(less revealing) are forwarded to cloud • Cost – Cloud used only transiently for spiking queries • Cloud payment model – Anonymous keyword-based advertising model*
  • 31. CAPS Evaluation • Experimental setup – Target system consists of millions of peers – Implemented the relevant components in a realistic network • Responsible peer, preceding peers, cloud instance • Datasets – Real datasets on query/corresponding updates(rates) not publicly available – Used synthetic queries and updates with expected query/update rates/ratio
  • 32. Experimental setup • 6 heterogeneous workstations with 4-6 cores, 8-16GB RAM used
  • 33. Experiments • Two sets of experiments 1. Demonstrate effect of query spike with and without cloud-assistance 2. Effect of switching on user experience • Response time and result quality • Switching time
  • 34. Results-1 With cloud assistance Without cloud assistance
  • 35. Results-2(effect of switching on user experience) • Result freshness • Response time
  • 36. Switching time
  • 37. Conclusions • Peer-to-peer search has many advantages by design compared to centralized search • But, peer-to-peer search approaches have scalability issues • Two-layered approach to peer-to-peer search can improve efficiency and result quality of peer-topeer search • Offloading queries to cloud can be an effective method to handle query spikes – Desirable properties of p2p systems not lost
  • 38. Publications • Janakiram Dharanipragada and Harisankar Haridas, “Stabilizing peer-to-peer systems using public cloud: A case study of peer-topeer search”, In the The 11th International Symposium on Parallel and Distributed Computing(ISPDC 2012), held at Munich, Germany. • Janakiram Dharanipragada, Fausto Giunchiglia, Harisankar Haridas and Uladzimir Kharkevich, “Two-layered architecture for peer-topeer concept search”, In the 4th International Semantic Search Workshop located at the 20th Int. World Wide Web Conference(WWW 2011), 2011), held at Hyderabad, India. • Harisankar Haridas, Sriram Kailasam, Prateek Dhawalia, Prateek Shrivastava, Santosh Kumar and Janakiram Dharanipragada, “Vcloud: A Peer-to-peer Video Storage-Compute Cloud”, In the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing(HPDC 2012), held at Delft, The Netherlands[Poster].
  • 39. THANK YOU Questions/Suggestions harisankarh[ at ]gmail.com