3. Contributors Working on the
code
Using the
code
Jorge Quiane
Zoi Kaoudi
Ioana Manolescu
François Goasdoué
Stamatis Zampetakis
Benjamin Djahandideh
4. Some details
● https://gforge.inria.fr/scm/viewvc.php/hadoop/cliquesquare/?root=xmlinthecloud
● 129 classes, 10k lines of code
Packages related to storing RDF data:
– fr.inria.oak.cliquesquare.partitioner.*
– two main package:
● cliquesquare.partitioner.simple → usefull for small
experiments
● cliquesquare.partitioner.skewed → better performance
for important datasets (>100Go)
5. Storing RDF data
● Input : RDF ntriples
● Output : files over HDFS partitioned by triple attribute (replication factor 3)
● Runs from command line now; GUI in development
● Functionality :
– take an RDF dataset (n-triple format)
– filter duplicates
– spread the data over the HDFS nodes (custom partitioning)
– at the end - RDF data replicated by a factor of 3
● Aim of custom partitioning - reduce the data shuffled across the network
6. CliqueSquare partitioning strategy
● input triples
→ hashed into key-value pairs
Key: subject / predicate / object
Value: triple
● same key → same node
● filename specifying the type of
hash key used (-S, -P or -O)
7. ● Dependency: Hadoop
● Project development : Maven, JUnit
● Known bugs:
– no data cleaning
● ToDo:
– branched version with value indexing for files
– requires redo-ing the partitioning code