A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''


Published on

Published in: Technology
1 Comment
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

  1. 1. A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud”[1]Speakers: Vasileios Komianos, Georgios Tsoumanis, Eleni MoustakaSupervisor: Spyridon SioutasIonian University, Dept. of Informatics, PostgraduateFor the course: Advanced Topics in Database Systems
  2. 2. The focus of this presentation is a distributedarchitecture, from now on called System, forindexing large datasets. Hadoop, MapReduce,HBase and NoSQL Databases are a few termsused often in this as these are the keystonetechnologies enabling such tasks.
  3. 3. Why Cloud?• Cost• Device and Location Independence• Virtualization• Performance• Scalability• Infrastructure as a Service• Platform as a Service• Software as a Service
  4. 4. Why Web Scale?• Google• Facebook• Wikipedia• Amazon• Internet Archive
  5. 5. Why Distributed?• Huge volumes of data• Computational problems• Failure tolerance• Scalability
  6. 6. What Hadoop[2] isIt is a open-source java framework capable ofdistributed processing of large data sets by usinga distributed file system called HDFS[3] andMapReduce[4] model.
  7. 7. Hadoop Architecture Hadoop HDFS MapReduceNameNode DataNodes JobTracker TaskTrackersUsually NameNode is at the same time JobTracker and DataNodesare also TaskTrackers.
  8. 8. What HBase[5] isAn open-source distributed data storebelonging to the known category of NoSQLdatabases. HBase is capable of storing largedata sets that can be structured, semi-structured and unstructured offering also rapidquery execution.
  9. 9. HBase Architecture HBase runs on top of Hadoop and it is modelled after Google’s BitTable[6]. ACIDity* is sacrificed to improve performance and scalability. HBase HMaster Region Servers*ACID: Atomicity, Consistency, Isolation and Durability
  10. 10. HBase characteristics• NoSQL• Schema free• Very large tables• Scalable• Sharding• JSON enable
  11. 11. NoSQL Paradigm MongoDB[7] MongoDB is an easy to use NoSQL Database, it is free and it is supported by a large community. Suitable if there is no previous NoSQL experience.> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({Presentation: "NoSQL databases"})> db.test.find(){ "_id" : ObjectId("4fbac827f119ef630e74638d"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac835f119ef630e74638e"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac85df119ef630e74638f"), "Presentation" : "NoSQL databases" }> NoSQL JSON Schema free
  12. 12. System Architecture Content Index Uploader Indexer table tableDatasets MapReduce MapReduce task task Get Search Consisting of: 1 master and 11 worker nodes. Client API Having: 66 Mappers and 22 Reducers. Dataset is composed of: 23GB of structured data, 300GB of semi-structured data and 20GB of unstructured data.
  13. 13. The experimentThe purpose was to test the System’s performancein various conditions such as:• several datasets sizes,• different datasets types,• varying number of nodes,• different index rules.
  14. 14. Index creation timeTXT dataset is the most demanding of processing when indexed.
  15. 15. 5GB HTML dataset index creation time for different index rules 12 10 8Time(min) 6 4 2 0 1 2 3 4 Iteration No 1) 7 indexed tags, 2) 14, 3) 19, 4) 27
  16. 16. 5GB HTML index size for different index rules 1,4 1,2 1Index size (GB) 0,8 0,6 0,4 0,2 0 1 2 3 4 Iteration No: 1) 7 tags indexed (table, li, p, b, I, u, title), 2) 14 tags, 3) 19, 4) 27
  17. 17. System performance under query load• Client instances were run concurrently on 14 machines sending queries to the system.• Types of queries: exact specific attribute, exact any attribute range any attribute.• Range query loads above 140 queries/sec failed.• Tests were run with load of 14 queries/sec.Response time per request:Exact specific queries: 20 ms.Exact any queries: 150ms.Range queries any: 27secs.
  18. 18. References[1] Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and Nectarios Koziris: Distributed Indexing of Web Scale Datasets for the Cloud. In MDAC ’10, April 26, 2010 Raleigh, NC, USA.[2] http://hadoop.apache.org/[3] HDFS Scalability: The limits to growth KV Shvachko - The USENIX Magazine. v35 i2, 2010 - usenix.org[4] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107- 113.[5] Ankur Khetrapal, Vinay Ganesh: HBase and Hypertable for large scale distributed storage systems, Dept. of Computer Science, Purdue University[6] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages.[7] http://www.mongodb.org