W3 S2008 Apache Heart Project Proposal Frederick Haebin Na


Published on

Highly Extensible & Accumulative RDF Table, Hadoop, RDF, Distributed

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

W3 S2008 Apache Heart Project Proposal Frederick Haebin Na

  1. 1. Heart Project Proposal Distributed RDF Table & Processing Engine <ul><li>Frederick Haebin Na </li></ul><ul><li>[email_address] </li></ul><ul><li>Heart Project Group </li></ul><ul><li>2008.10.23. </li></ul>
  2. 2. <ul><li>Heart Proposal Overview </li></ul><ul><li>Goals & Objectives </li></ul><ul><li>Backgrounds </li></ul><ul><li>Benefits </li></ul><ul><li>Features </li></ul>
  3. 3. <ul><li>Heart (Highly Extensible & Accumulative RDF Table) aims to provide a planet-scale RDF store and a set of features to process the data in distributed manner. Heart is based on Hadoop and HBase. Heart aims to be a batch processor, or analyzer, rather than a real-time database. </li></ul><ul><li>Heart Proposal Overview </li></ul><ul><li>Heart will be the heart of Web 3.0 where the machine extends human powered knowledge at a far greater rate than in Web 2.0. With this increasing rate of semantic data, Heart will be very useful after about a decade or so. Until then, Heart will play a crucial role in experimenting niche service models. </li></ul><ul><li>Massive Storage & Processor </li></ul><ul><ul><li>Highly Extensible & Accumulative Storage </li></ul></ul><ul><ul><li>Faster Loader/Query Processing/ Materializer for Massive RDF Data </li></ul></ul><ul><li>RDF Data Mining Platform </li></ul><ul><ul><li>Knowledge Discovery </li></ul></ul><ul><ul><ul><li>Prediction/Classification/Association </li></ul></ul></ul><ul><li>Semantic Search Platform </li></ul><ul><ul><li>Bulk Pre & Post Processing for Semantic Search </li></ul></ul><ul><li>Heart Data Loader </li></ul><ul><ul><li>Bulk Triples to HBase </li></ul></ul><ul><li>Heart Storage Manager </li></ul><ul><ul><li>Smart Triples Partitioning </li></ul></ul><ul><li>Heart Query Processor </li></ul><ul><ul><li>Optimized Query for Massive Data </li></ul></ul><ul><li>Heart Data Miner </li></ul><ul><ul><li>Extension to SparQL for Data Mining </li></ul></ul><ul><li>Heart Data Materializer </li></ul><ul><ul><li>Indexing for Implicit Statements </li></ul></ul><ul><li>Core (Billion Triples) 1) </li></ul><ul><ul><li>Garlik JXT (9.8) </li></ul></ul><ul><ul><li>YARS2 (7) </li></ul></ul><ul><ul><li>BigOWLIM (6.7) </li></ul></ul><ul><ul><li>Jena TDB (1.7) </li></ul></ul><ul><ul><li>Virtuoso (1) </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>PowerSet – Semantic Search Engine </li></ul></ul><ul><ul><li>A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data, Newman, et al. </li></ul></ul><ul><li>Benefits </li></ul><ul><li>Features </li></ul><ul><li>Relevant Projects </li></ul><ul><li>1 </li></ul><ul><li>1) http://esw.w3.org/topic/LargeTripleStores </li></ul>
  4. 4. <ul><li>Goals & Objectives </li></ul><ul><li>2 </li></ul><ul><li>The goals and objectives of Heart is to provide a massive RDF data storage and a batch processor for various RDF data mining. </li></ul><ul><li>Key problems must be addressed for the first objective which has the highest priority over the rests. </li></ul><ul><li>Goals </li></ul><ul><ul><li>To Provide Massive RDF Data Storage & Batch Processor for Various RDF Data Mining </li></ul></ul><ul><li>Key Problems Need to be Solved </li></ul><ul><ul><li>Would Sequential-read centric Hbase index be enough for random reads/writes for joins? </li></ul></ul><ul><ul><li>If not, then how to exploit HBase indexes or generate new ones for speeding up the processing? </li></ul></ul><ul><ul><ul><li>What is the best suitable index for semantic search? </li></ul></ul></ul><ul><ul><li>How to partition the triples for efficient joins? (By subject, predicate, grouped by named graphs) </li></ul></ul><ul><li>Objectives </li></ul><ul><ul><li>Faster Massive Data Processor </li></ul></ul><ul><ul><ul><li>Loader 1) – Better than Garlik JXT </li></ul></ul></ul><ul><ul><ul><li>Query Processor 1) – Better than Garlik JXT </li></ul></ul></ul><ul><ul><li>Highly Extensible & Accumulative RDF Table </li></ul></ul><ul><ul><ul><li>Supports more than 10 billion triples over more than 3,000 computers. </li></ul></ul></ul><ul><ul><li>Extensions for Data Mining </li></ul></ul><ul><ul><ul><li>Full Support for the Standard SparQL </li></ul></ul></ul><ul><ul><ul><li>Machine Learning Extensions 2) </li></ul></ul></ul><ul><li>http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/ </li></ul><ul><li>http://www.eswc2008.org/final-pdfs-for-web-site/qpI-4.pdf </li></ul>
  5. 5. <ul><li>More RDF Supporting Services </li></ul><ul><li>Needs for Contextual & Specific Search Result </li></ul><ul><li>Proliferation of Various RDF Schemes </li></ul><ul><li>Heart </li></ul><ul><li>Increase in RDF Data </li></ul><ul><li>Refinement in RDFS’s </li></ul><ul><li>Needs for Processing RDF Data </li></ul><ul><li>Backgrounds </li></ul><ul><li>3 </li></ul><ul><li>Environmentally, more and more services begin to provide and refine their RDF/S related features. Also, people begin to ask for more specific and contextual search result. For the service providers, they begin to have the data and its scheme to process RDF data for their customers’ needs. </li></ul>
  6. 6. <ul><li>Massive RDF Storage & Processor </li></ul><ul><li>RDF Data Mining Platform </li></ul><ul><li>Semantic Search Platform </li></ul><ul><li>Highly extensible and accumulative storage benefits are from Hadoop and HBase. </li></ul><ul><li>Faster processing over massive RDF data is possible by MapReduce model for distributed RDF data processing. </li></ul><ul><li>HBase based column-oriented partitioning gives performance increase because of the lesser joins. </li></ul><ul><li>Full Support for Standard SparQL over Massive RDF Data </li></ul><ul><ul><li>Converts SparQL to MapReduce query implementation </li></ul></ul><ul><li>Machine Learning Features for SparQL Extensions 1) </li></ul><ul><ul><li>Prediction </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Association </li></ul></ul><ul><li>Provides fundamental features for semantic search. </li></ul><ul><ul><li>Storage & Processor </li></ul></ul><ul><ul><li>Knowledge Discovery by Data Mining </li></ul></ul><ul><li>Massive RDF data can be mined to generate semantic search index. </li></ul><ul><ul><li>Support for User Defined Index Model </li></ul></ul><ul><li>Benefits </li></ul><ul><li>4 </li></ul><ul><li>Heart provides three benefits; a massive RDF storage/processor, RDF data mining and semantic search platform. </li></ul><ul><li>1) http://www.eswc2008.org/final-pdfs-for-web-site/qpI-4.pdf </li></ul>
  7. 7. <ul><li>Data Loader </li></ul><ul><li>Storage Manager </li></ul><ul><li>Query Processor </li></ul><ul><li>Fast Bulk Storing & Reasoning </li></ul><ul><li>Bulk Triples into HBase </li></ul><ul><li>Supports Various File Format </li></ul><ul><li>Smart Triples Partitioning </li></ul><ul><li>C-Store with Sequential-Read Centric Processing </li></ul><ul><ul><li>Reduce or Eliminate Random Access </li></ul></ul><ul><li>Full Standard SparQL </li></ul><ul><ul><li>Query Conversion to MapReduce Codes </li></ul></ul><ul><li>Features </li></ul><ul><li>5 </li></ul><ul><li>Heart provides 5 core features; data loader, storage manager, query processor, data miner and data materializer. </li></ul><ul><li>Data Miner </li></ul><ul><li>Machine Learning Extensions </li></ul><ul><ul><li>Prediction </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Association </li></ul></ul><ul><li>Data Materializer </li></ul><ul><li>Indexes for Implicit Statements </li></ul>
  8. 8. <ul><li>Thank you. </li></ul>