Addressing scalability challenges in peer-to-peer search

841 views

Published on

Addressing scalability challenges in peer-to-peer search: Two-layered semantic search and cloud assistance to handle query spikes

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Addressing scalability challenges in peer-to-peer search

  1. 1. Addressing scalability challenges in peer-to-peer search PhD seminar 4 Feb, 2014 Harisankar H, PhD scholar, DOS lab, Dept. of CSE Advisor: Prof. D. Janakiram http://harisankarh.wordpress.com
  2. 2. Outline • Issues with centralized search – Can peer-to-peer search help? • Scalability challenges in peer-to-peer search • Proposed architectural extensions – Two-layered architecture for peer-to-peer concept search – Cloud-assisted approach to handle query spikes
  3. 3. Centralized search scenario • Scenario – Search engines crawl available content, index and maintain it in data centers – User queries directed to data centers, processed internally and results sent back – Centrally managed by single company Content End users Datacenters
  4. 4. Some issues with centralized search – Privacy concerns • All user queries accessible from a single location – Centralized control • Individual companies decide what to(not to) index, rank etc. – Transparency • Complete details of ranking, pre-processing etc. not made available publicly • Concerns of censorship and doctoring of results
  5. 5. Some issues with centralized search contd.. • Uses mostly syntactic search techniques – Based on word or multi-word phrases – Low quality of results due to ambiguity of natural language • Issues with centralized semantic search – Difficult to capture long tail of niche interests of users • Requires rich human generated knowledge bases in numerous niche areas
  6. 6. Peer-to-peer search approach • Edge nodes in the internet participate in providing and using the search service • Search as a collaborative service • Crawling, indexing and search distributed across the peers
  7. 7. How could peer-to-peer search help? • Each user query can be sent to a different peer among millions – Obtaining query logs in a single location difficult – Reduced privacy concerns • Distributed control across numerous peers – Avoids centralized control • Search application available with all peers – Better transparency in ranking etc. • Background knowledge of peers can be utilized for effective semantic search – Can help improve quality of results • Led to lot of academic research in the area as well as real world p2p search engines* * e.g., faroo.com, yacy.net; YacyPi Kickstarter project
  8. 8. Realizing peer-to-peer search • Distribution of search index – Term partitioning • Responsibility of individual terms assigned to different peers – E.g., peer1 is currently responsible for term “computer” • Term-to-peer mapping achieved through a structured overlay(e.g., DHT) Image src: http://wwarodomfr.blogspot.in/2008/09/chord-routing.html
  9. 9. Scalability challenges in peer-to-peer search • • • • Peers share only idle resources Peers join/leave autonomously Limited individual resources leads to No SLA – Peer bandwidth bottleneck during query processing • Particularly queries involving multiple terms(index transfer between multiple peers) – Instability during query spikes • Knowledge management issues at large scale – Difficult to have consensus at large scale – Need wide understanding and have to meet requirements of large diverse group
  10. 10. Two-layered architecture for peer-topeer concept search* • Peers organized as communities based on common interest • Each community maintains its own background knowledge to use in semantic search – Maintained in a distributed manner • A global layer with aggregate information to facilitate search across communities • Background knowledge bases extend from minimal universally accepted knowledge in upper layer • Search, indexing and knowledge management proceeds independently in each community *joint work with Prof. Fausto Guinchiglia and Uladzimir, Univ. of Trento
  11. 11. Two-layered architecture for peer-to-peer concept search GLOBAL Comm: index UK BK-1 doc index -1 BK-3 Community-1 doc index -3 Community-3 BK-2 doc index -2 Community-2
  12. 12. Two-layered architecture • Global layer – retrieves relevant communities for query based on universal knowledge • Community layer – retrieves relevant documents for query based on background knowledge of community
  13. 13. Overcoming the shortcomings of singlelayered approaches • Search can be scoped only to the relevant communities for a query – Results in less bandwidth-related issues • Two layers make knowledge management scalable and interoperable – Niche interests supported at community-level background knowledge bases – Minimal universal knowledge for interoperability • Search within community based on community’s background knowledge – Focused interest of community helps in better term-toconcept mapping
  14. 14. Two-layered approach • Index partitioning – Uses partition-by-term • Posting list for each term stored in different peers – Uses Distributed Hash Table(DHT) to realize dynamic termto-peer mapping • O(logN) hops for each lookup • Overlay network – Communities and global layer maintained using twolayered overlay • Based on our earlier work on computational grids* – O(logN) hops for lookup even with two-layers *M.V. Reddy, A.V. Srinivas, T. Gopinath, and D. Janakiram, “Vishwa: A reconfigurable P2P middleware for Grid Computations,” in ICPP'06
  15. 15. Two-layered approach • Community management – Similar to public communities in flickr, facebook etc. • Search within community – Uses Concept Search* as underlying semantic search scheme • Extends syntactic search with available knowledge to realize semantic search • Falls back to syntactic search when no knowledge is available *Fausto Giunchiglia, Uladzimir Kharkevich, Ilya Zaihrayeu, “Concept search”, ESWC 2009
  16. 16. Two-layered approach • Knowledge representation – Term -> concept mapping – Concept hierarchy • Concept relations expressed as subsumption relations • Concepts in documents/queries extracted – by analyzing words and natural language phrases – Nounphrases translated into conjunctions of atomic concepts (complex concepts) • Small-1Λdog-2 – Documents/queries represented as enumerated sequences of complex concepts • Eg: 1:small-1Λdog-2 2:big-1Λanimal-3
  17. 17. Two-layered approach • Relevance model – Documents having more specific concepts than query concepts considered relevant • Eg: poodle-1 relevant when searching for dog-2 – Ranking done by extending tf-idf relevance model • Incorporates term-concept and concept-concept similarities also • Distributed knowledge maintenance – Each atomic concept indexed on DHT with id – Node responsible for each atomic concept id also stores ids of • All immediate more specific atomic concepts • All concepts in the path to root of the atomic concept
  18. 18. Two-layered approach • Document indexing and search – Concepts mapped to peer using DHT – Query routed to peers responsible for the query concepts and related concepts – Results from multiple peers combined to give final results • Global search – The popularity(document frequency) of each concept indexed in upper layer – Tf-idf extended with universal knowledge to search for communities – Combined score of doc = (score of community)*(score of doc within community)
  19. 19. Experiments • Single layer syntactic vs semantic: TREC ad-hoc,TREC8 ( simulated with 10,000 peers) – Wordnet as knowledge base • Single vs 2 layer – 18 communities (doc: categories in dMoz*) • 18*1000 = 18,000 peers simulated – – – – – UK = domain-independent concepts and relations from wordnet BK = UK + wordnet domains + YAGO BK mapped to communities Queries selected as directory path to a specific subdirectory Standard result: documents in that subdirectory *http://www.dmoz.org/
  20. 20. Experiments • Tools – GATE(NLP), Lucene(search library), PeerSim(peer-topeer system simulator) • Performance metrics – Quality • Precision @10, precision @20 • Mean average precision, MAP – Network bandwidth • Average number of postings transferred – Response time • s-postings, s-hops
  21. 21. Results (1 layer syntactic vs semantic) • Quality improved • But, cost also increased
  22. 22. Results (1 layer vs 2 layer) • Quality improved • Cost decreased – 94% decrease in posting transfer for opt. case
  23. 23. Two-layered approach results • Proposed approach gives better quality and performance over single-layered approaches – Performance can further improved using optimizations like early termination • But, issue of query spikes remain
  24. 24. Query spikes in peer-to-peer search • Query spikes can lead to instability – Replication/caching insufficient due to high document creation rate* rate of queries related to “Bin laden” increased by 10,000 times within one hour in Google on May 1, 2011 after Operation Geronimo.
  25. 25. Some background • Term-partitioned search – Term/popular query responsibility assigned to individual peers • Updates and queries are sent to peer responsible which process them – Term -> peer mapping done using a Distributed Hash Table(DHTs) top-k result list of q
  26. 26. Cloud-assisted p2p search(CAPS) • Offload responsibilities of spiking queries to public cloud
  27. 27. Issues in realizing CAPS • Maintaining full index copy in cloud is very expensive – Storage alone will cost more than 5 million dollars per month* • Approach: transfer only relevant index portion to cloud – Need to be performed fast considering effect on user experience(result quality, response time) • Effect on the desirable properties of peer-to-peer search – Privacy, transparency, decentralized control etc.
  28. 28. CAPS components • Switching decision maker – Decide when to switch – Simple e.g., “switch when query rate increases by X% within last Y seconds” • Switching implementor – Switching algorithm to seamlessly transfer index partition – Dynamic creation of cloud instances
  29. 29. CAPS Switching algorithm • Ensures that result quality is not affected • Controlled bandwidth usage at peer
  30. 30. Addressing additional concerns • Transparency – Index resides both among peers and cloud • Centralized control – Query can switched back to peers or other clouds • Privacy – Only spiking queries(less revealing) are forwarded to cloud • Cost – Cloud used only transiently for spiking queries • Cloud payment model – Anonymous keyword-based advertising model*
  31. 31. CAPS Evaluation • Experimental setup – Target system consists of millions of peers – Implemented the relevant components in a realistic network • Responsible peer, preceding peers, cloud instance • Datasets – Real datasets on query/corresponding updates(rates) not publicly available – Used synthetic queries and updates with expected query/update rates/ratio
  32. 32. Experimental setup • 6 heterogeneous workstations with 4-6 cores, 8-16GB RAM used
  33. 33. Experiments • Two sets of experiments 1. Demonstrate effect of query spike with and without cloud-assistance 2. Effect of switching on user experience • Response time and result quality • Switching time
  34. 34. Results-1 With cloud assistance Without cloud assistance
  35. 35. Results-2(effect of switching on user experience) • Result freshness • Response time
  36. 36. Switching time
  37. 37. Conclusions • Peer-to-peer search has many advantages by design compared to centralized search • But, peer-to-peer search approaches have scalability issues • Two-layered approach to peer-to-peer search can improve efficiency and result quality of peer-topeer search • Offloading queries to cloud can be an effective method to handle query spikes – Desirable properties of p2p systems not lost
  38. 38. Publications • Janakiram Dharanipragada and Harisankar Haridas, “Stabilizing peer-to-peer systems using public cloud: A case study of peer-topeer search”, In the The 11th International Symposium on Parallel and Distributed Computing(ISPDC 2012), held at Munich, Germany. • Janakiram Dharanipragada, Fausto Giunchiglia, Harisankar Haridas and Uladzimir Kharkevich, “Two-layered architecture for peer-topeer concept search”, In the 4th International Semantic Search Workshop located at the 20th Int. World Wide Web Conference(WWW 2011), 2011), held at Hyderabad, India. • Harisankar Haridas, Sriram Kailasam, Prateek Dhawalia, Prateek Shrivastava, Santosh Kumar and Janakiram Dharanipragada, “Vcloud: A Peer-to-peer Video Storage-Compute Cloud”, In the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing(HPDC 2012), held at Delft, The Netherlands[Poster].
  39. 39. THANK YOU Questions/Suggestions harisankarh[ at ]gmail.com

×