Quadrupling your elephants - RDF and the Hadoop ecosystem

Rob Vesse
Rob VesseSoftware Engineer at YarcData
1 
RDF and the Hadoop Ecosystem 
Rob Vesse 
Twitter: @RobVesse 
Email: rvesse@apache.org
2 
 Software Engineer at YarcData (part of Cray Inc) 
 Working on big data analytics products 
 Active open source contributor primarily to RDF & SPARQL 
related projects 
 Apache Jena Committer and PMC Member 
 dotNetRDF Lead Developer 
 Primarily interested in RDF, SPARQL and Big Data Analytics 
technologies
3 
What's missing in the Hadoop ecosystem? 
What's needed to fill the gap? 
What's already available? 
 Jena Hadoop RDF Tools 
 GraphBuilder 
 Other Projects 
 Getting Involved 
 Questions
4
5 
Apache, the projects and their logo shown here are registered trademarks or 
trademarks of The Apache Software Foundation in the U.S. and/or other 
countries
6 
 No first class projects 
 Some limited support in other projects 
 E.g. Giraph supports RDF by bridging through the Tinkerpop stack 
 Some existing external projects 
 Lots of academic proof of concepts 
 Some open source efforts but mostly task specific 
 E.g. Infovore targeted at creating curated Freebase and DBPedia datasets
7
8 
 Need to efficiently represent RDF concepts as Writable 
types 
 Nodes, Triples, Quads, Graphs, Datasets, Query Results etc 
What's the minimum viable subset?
9 
 Need to be able to get data in and out of RDF formats 
Without this we can't use the power of the Hadoop 
ecosystem to do useful work 
 Lots of serializations out there: 
 RDF/XML 
 Turtle 
 NTriples 
 NQuads 
 JSON-LD 
 etc 
 Also would like to be able to produce end results as RDF
1 
0 
Map/Reduce building blocks 
 Common operations e.g. splitting 
 Enable developers to focus on their applications 
 User Friendly tooling 
 i.e. non-programmer tools
1 
1
1 
2 
CC BY-SA 3.0 Wikimedia Commons
1 
3 
 Set of modules part of the Apache Jena project 
 Originally developed at Cray and donated to the project earlier this year 
 Experimental modules on the hadoop-rdf branch of our 
 Currently only available as development SNAPSHOT 
releases 
 Group ID: org.apache.jena 
 Artifact IDs: 
 jena-hadoop-rdf-common 
 jena-hadoop-rdf-io 
 jena-hadoop-rdf-mapreduce 
 Latest Version: 0.9.0-SNAPSHOT 
 Aims to fulfill all the basic requirements for enabling RDF on 
Hadoop 
 Built against Hadoop Map/Reduce 2.x APIs
1 
4 
 Provides the Writable types for RDF primitives 
 NodeWritable 
 TripleWritable 
 QuadWritable 
 NodeTupleWritable 
 All backed by RDF Thrift 
 A compact binary serialization for RDF using Apache Thrift 
 See http://afs.github.io/rdf-thrift/ 
 Extremely efficient to serialize and deserialize 
 Allows for efficient WritableComparator implementations that perform binary comparisons
 Provides InputFormat and OutputFormat implementations 
1 
5 
 Supports most formats that Jena supports 
 Designed to be extendable with new formats 
Will split and parallelize inputs where the RDF serialization 
is amenable to this 
 Also transparently handles compressed inputs and outputs 
 Note that compression blocks splitting 
 i.e. trade off between IO and parallelism
1 
6 
 Various reusable building block Mapper and Reducer 
implementations: 
 Counting 
 Filtering 
 Grouping 
 Splitting 
 Transforming 
 Can be used as-is to do some basic Hadoop tasks or used as 
building blocks for more complex tasks
1 
7
 For NTriples inputs compared performance of a Text based 
node count versus RDF based node count 
1 
8 
 Performance as good (within 10%) and sometimes 
significantly better 
 Heavily dataset dependent 
 Varies considerably with cluster setup 
 Also depends on how the input is processed 
 YMMV! 
 For other RDF formats you would struggle to implement 
this at all
1 
9 
 Originally developed by Intel 
 Some contributions by Cray - awaiting merging at time of writing 
 Open source under Apache License 
 https://github.com/01org/graphbuilder/tree/2.0.alpha 
 2.0.alpha is the Pig based branch 
 Allows data to be transformed into graphs using Pig scripts 
 Provides set of Pig UDFs for translating data to graph formats 
 Supports both property graphs and RDF graphs
2 
0 
-- Declare our mappings 
x = FOREACH propertyGraph GENERATE (*, 
[ 'idBase' # 'http://example.org/instances/', 
'base' # 'http://example.org/ontology/', 
'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ], 
'propertyMap' # [ 'type' # 'a', 
'name' # 'foaf:name', 
'age' # 'foaf:age' ], 
'idProperty' # 'id' ]); 
-- Convert to NTriples 
rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*)); 
-- Write out NTriples 
STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
2 
1 
 Uses a declarative mapping based on Pig primitives 
 Maps and Tuples 
 Have to be explicitly joined to the data because Pig UDFs 
can only be called with String arguments 
 Has some benefits e.g. conditional mappings 
 RDF Mappings operate on Property Graphs 
 Requires original data to be mapped to a property graph first 
 Direct mapping to RDF is a future enhancement that has yet to be implemented
2 
2
2 
3 
 Infovore - Paul Houle 
 https://github.com/paulhoule/infovore/wiki 
 Cleaned and curated Freebase datasets processed with Hadoop 
 CumulusRDF - Institute of Applied Informatics and Formal 
Description Methods 
 https://code.google.com/p/cumulusrdf/ 
 RDF store backed by Apache Cassandra
2 
4 
 Please start playing with these projects 
 Please interact with the community: 
 dev@jena.apache.org 
 What works? 
 What is broken? 
 What is missing? 
 Contribute 
 Apache projects are ultimately driven by the community 
 If there's a feature you want please suggest it 
 Or better still contribute it yourself!
2 
5 
Questions? 
Personal Email: rvesse@apache.org 
Jena Mailing List: dev@jena.apache.org
2 
6
2 
7 
> bin/hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar 
org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output 
/user/output --input-type triples /user/input 
 --node-count requests the Node Count statistics be 
calculated 
 Assumes mixed quads and triples input if no --input-type 
specified 
 Using this for triples only data can skew statistics 
 e.g. can result in high node counts for default graph node 
 Hence we explicitly specify input as triples
2 
8
2 
9
3 
0
3 
1
3 
2
3 
3 
> ./pig -x local examples/property_graphs_and_rdf.pig 
> cat /tmp/rdf_triples/part-m-00000 
 Running in local mode for this demo 
 Output goes to /tmp/rdf_triples
3 
4
3 
5
3 
6
public abstract class AbstractNodeTupleNodeCountMapper<TKey, TValue, T extends AbstractNodeTupleWritable<TValue>> 
extends Mapper<TKey, T, NodeWritable, LongWritable> { 
3 
7 
private LongWritable initialCount = new LongWritable(1); 
@Override 
protected void map(TKey key, T value, Context context) throws IOException, InterruptedException { 
NodeWritable[] ns = this.getNodes(value); 
for (NodeWritable n : ns) { 
context.write(n, this.initialCount); 
} 
} 
protected abstract NodeWritable[] getNodes(T tuple); 
} 
public class TripleNodeCountMapper<TKey> extends AbstractNodeTupleNodeCountMapper<TKey, Triple, TripleWritable> { 
@Override 
protected NodeWritable[] getNodes(TripleWritable tuple) { 
Triple t = tuple.get(); 
return new NodeWritable[] { new NodeWritable(t.getSubject()), new NodeWritable(t.getPredicate()), 
new NodeWritable(t.getObject()) }; 
} 
}
3 
8 
public class NodeCountReducer extends Reducer<NodeWritable, LongWritable, NodeWritable, 
LongWritable> { 
@Override 
protected void reduce(NodeWritable key, Iterable<LongWritable> values, Context context) 
throws IOException, InterruptedException { 
long count = 0; 
Iterator<LongWritable> iter = values.iterator(); 
while (iter.hasNext()) { 
count += iter.next().get(); 
} 
context.write(key, new LongWritable(count)); 
} 
}
3 
9 
Job job = Job.getInstance(config); 
job.setJarByClass(JobFactory.class); 
job.setJobName("RDF Triples Node Usage Count"); 
// Map/Reduce classes 
job.setMapperClass(TripleNodeCountMapper.class); 
job.setMapOutputKeyClass(NodeWritable.class); 
job.setMapOutputValueClass(LongWritable.class); 
job.setReducerClass(NodeCountReducer.class); 
// Input and Output 
job.setInputFormatClass(TriplesInputFormat.class); 
job.setOutputFormatClass(NTriplesNodeOutputFormat.class); 
FileInputFormat.setInputPaths(job, StringUtils.arrayToString(inputPaths)); 
FileOutputFormat.setOutputPath(job, new Path(outputPath)); 
return job;
 https://github.com/Cray/graphbuilder/blob/2.0.alpha/exa 
mples/property_graphs_and_rdf_example.pig 
4 
0
1 of 40

Recommended

Sempala - Interactive SPARQL Query Processing on Hadoop by
Sempala - Interactive SPARQL Query Processing on HadoopSempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on HadoopAlexander Schätzle
1.6K views21 slides
Apache Jena Elephas and Friends by
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
2.4K views41 slides
Practical SPARQL Benchmarking Revisited by
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedRob Vesse
1.1K views35 slides
Semantic Integration with Apache Jena and Stanbol by
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolAll Things Open
3.5K views27 slides
Spark r under the hood with Hossein Falaki by
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiDatabricks
1.3K views27 slides
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass... by
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Chris Fregly
4.8K views42 slides

More Related Content

What's hot

Apache Spark MLlib 2.0 Preview: Data Science and Production by
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
13.9K views22 slides
Pandas UDF and Python Type Hint in Apache Spark 3.0 by
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
525 views38 slides
Scalable Data Science with SparkR by
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkRDataWorks Summit
2.9K views53 slides
Performant data processing with PySpark, SparkR and DataFrame API by
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
3.4K views26 slides
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph by
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache GiraphAvery Ching
2.1K views32 slides
Keeping Spark on Track: Productionizing Spark for ETL by
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
4.2K views33 slides

What's hot(20)

Apache Spark MLlib 2.0 Preview: Data Science and Production by Databricks
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks13.9K views
Pandas UDF and Python Type Hint in Apache Spark 3.0 by Databricks
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks525 views
Scalable Data Science with SparkR by DataWorks Summit
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit2.9K views
Performant data processing with PySpark, SparkR and DataFrame API by Ryuji Tamagawa
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa3.4K views
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph by Avery Ching
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
Avery Ching2.1K views
Keeping Spark on Track: Productionizing Spark for ETL by Databricks
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks4.2K views
Learn Apache Spark: A Comprehensive Guide by Whizlabs
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs591 views
Sparkly Notebook: Interactive Analysis and Visualization with Spark by felixcss
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss29.8K views
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit... by Databricks
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Databricks1.3K views
Introducing Apache Giraph for Large Scale Graph Processing by sscdotopen
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen24.5K views
Introduction of the Design of A High-level Language over MapReduce -- The Pig... by Yu Liu
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Yu Liu484 views
Processing edges on apache giraph by DataWorks Summit
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraph
DataWorks Summit21.2K views
Apache Spark MLlib - Random Foreset and Desicion Trees by Tuhin Mahmud
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud1.5K views
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung by Spark Summit
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit2.2K views
Introduction to Apache Spark Developer Training by Cloudera, Inc.
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.31.7K views
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr... by Databricks
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks9.7K views
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python by Christian Perone
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone9.5K views

Similar to Quadrupling your elephants - RDF and the Hadoop ecosystem

May 29, 2014 Toronto Hadoop User Group - Micro ETL by
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
1K views43 slides
Unit V.pdf by
Unit V.pdfUnit V.pdf
Unit V.pdfKennyPratheepKumar
2 views13 slides
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive by
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
3.3K views63 slides
spark_v1_2 by
spark_v1_2spark_v1_2
spark_v1_2Frank Schroeter
340 views8 slides
Apache spark - History and market overview by
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
3.5K views26 slides
Hadoop pig by
Hadoop pigHadoop pig
Hadoop pigSean Murphy
5.6K views54 slides

Similar to Quadrupling your elephants - RDF and the Hadoop ecosystem(20)

May 29, 2014 Toronto Hadoop User Group - Micro ETL by Adam Muise
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise1K views
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive by Sachin Aggarwal
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal3.3K views
Apache spark - History and market overview by Martin Zapletal
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal3.5K views
Boston Spark Meetup event Slides Update by vithakur
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur279 views
Overview of big data & hadoop v1 by Thanh Nguyen
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen1K views
Hadoop MapReduce Fundamentals by Lynn Langit
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit133.9K views
Apache Spark Introduction @ University College London by Vitthal Gogate
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Vitthal Gogate1.7K views
Fast Data Analytics with Spark and Python by Benjamin Bengfort
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort30.3K views
Introduction to Spark Internals by Pietro Michiardi
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi19.2K views
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro... by Anne Nicolas
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Anne Nicolas696 views
Presentation distro recipes-2013 by olberger
Presentation distro recipes-2013Presentation distro recipes-2013
Presentation distro recipes-2013
olberger1.7K views
Hadoop interview questions by Kalyan Hadoop
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop1.1K views
Geek Night - Functional Data Processing using Spark and Scala by Atif Akhtar
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar216 views
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15... by spinningmatt
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt3.6K views

More from Rob Vesse

Challenges and patterns for semantics at scale by
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleRob Vesse
390 views15 slides
Introducing JDBC for SPARQL by
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQLRob Vesse
2.4K views27 slides
Practical SPARQL Benchmarking by
Practical SPARQL BenchmarkingPractical SPARQL Benchmarking
Practical SPARQL BenchmarkingRob Vesse
2.8K views13 slides
Everyday Tools for the Semantic Web Developer by
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperRob Vesse
1.2K views10 slides
Everyday Tools for the Semantic Web Developer by
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperRob Vesse
2.6K views19 slides
dotNetRDF - A Semantic Web/RDF Library for .Net Developers by
dotNetRDF - A Semantic Web/RDF Library for .Net DevelopersdotNetRDF - A Semantic Web/RDF Library for .Net Developers
dotNetRDF - A Semantic Web/RDF Library for .Net DevelopersRob Vesse
2.8K views10 slides

More from Rob Vesse(6)

Challenges and patterns for semantics at scale by Rob Vesse
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scale
Rob Vesse390 views
Introducing JDBC for SPARQL by Rob Vesse
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQL
Rob Vesse2.4K views
Practical SPARQL Benchmarking by Rob Vesse
Practical SPARQL BenchmarkingPractical SPARQL Benchmarking
Practical SPARQL Benchmarking
Rob Vesse2.8K views
Everyday Tools for the Semantic Web Developer by Rob Vesse
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
Rob Vesse1.2K views
Everyday Tools for the Semantic Web Developer by Rob Vesse
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
Rob Vesse2.6K views
dotNetRDF - A Semantic Web/RDF Library for .Net Developers by Rob Vesse
dotNetRDF - A Semantic Web/RDF Library for .Net DevelopersdotNetRDF - A Semantic Web/RDF Library for .Net Developers
dotNetRDF - A Semantic Web/RDF Library for .Net Developers
Rob Vesse2.8K views

Recently uploaded

DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut... by
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...Deltares
7 views28 slides
Copilot Prompting Toolkit_All Resources.pdf by
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdfRiccardo Zamana
8 views4 slides
Unleash The Monkeys by
Unleash The MonkeysUnleash The Monkeys
Unleash The MonkeysJacob Duijzer
7 views28 slides
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ... by
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...Donato Onofri
795 views34 slides
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...Deltares
9 views24 slides
The Era of Large Language Models.pptx by
The Era of Large Language Models.pptxThe Era of Large Language Models.pptx
The Era of Large Language Models.pptxAbdulVahedShaik
5 views9 slides

Recently uploaded(20)

DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut... by Deltares
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
Deltares7 views
Copilot Prompting Toolkit_All Resources.pdf by Riccardo Zamana
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdf
Riccardo Zamana8 views
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ... by Donato Onofri
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Donato Onofri795 views
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by Deltares
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares9 views
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs by Deltares
DSD-INT 2023 The Danube Hazardous Substances Model - KovacsDSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
Deltares8 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
360 graden fabriek by info33492
360 graden fabriek360 graden fabriek
360 graden fabriek
info3349237 views
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t... by Deltares
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
Deltares9 views
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h... by Deltares
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
Deltares5 views
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports by Ra'Fat Al-Msie'deen
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug ReportsBushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
Headless JS UG Presentation.pptx by Jack Spektor
Headless JS UG Presentation.pptxHeadless JS UG Presentation.pptx
Headless JS UG Presentation.pptx
Jack Spektor7 views
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols by Deltares
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - DolsDSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols
Deltares7 views
MariaDB stored procedures and why they should be improved by Federico Razzoli
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved
DSD-INT 2023 Wave-Current Interaction at Montrose Tidal Inlet System and Its ... by Deltares
DSD-INT 2023 Wave-Current Interaction at Montrose Tidal Inlet System and Its ...DSD-INT 2023 Wave-Current Interaction at Montrose Tidal Inlet System and Its ...
DSD-INT 2023 Wave-Current Interaction at Montrose Tidal Inlet System and Its ...
Deltares10 views
Fleet Management Software in India by Fleetable
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India
Fleetable11 views

Quadrupling your elephants - RDF and the Hadoop ecosystem

  • 1. 1 RDF and the Hadoop Ecosystem Rob Vesse Twitter: @RobVesse Email: rvesse@apache.org
  • 2. 2  Software Engineer at YarcData (part of Cray Inc)  Working on big data analytics products  Active open source contributor primarily to RDF & SPARQL related projects  Apache Jena Committer and PMC Member  dotNetRDF Lead Developer  Primarily interested in RDF, SPARQL and Big Data Analytics technologies
  • 3. 3 What's missing in the Hadoop ecosystem? What's needed to fill the gap? What's already available?  Jena Hadoop RDF Tools  GraphBuilder  Other Projects  Getting Involved  Questions
  • 4. 4
  • 5. 5 Apache, the projects and their logo shown here are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries
  • 6. 6  No first class projects  Some limited support in other projects  E.g. Giraph supports RDF by bridging through the Tinkerpop stack  Some existing external projects  Lots of academic proof of concepts  Some open source efforts but mostly task specific  E.g. Infovore targeted at creating curated Freebase and DBPedia datasets
  • 7. 7
  • 8. 8  Need to efficiently represent RDF concepts as Writable types  Nodes, Triples, Quads, Graphs, Datasets, Query Results etc What's the minimum viable subset?
  • 9. 9  Need to be able to get data in and out of RDF formats Without this we can't use the power of the Hadoop ecosystem to do useful work  Lots of serializations out there:  RDF/XML  Turtle  NTriples  NQuads  JSON-LD  etc  Also would like to be able to produce end results as RDF
  • 10. 1 0 Map/Reduce building blocks  Common operations e.g. splitting  Enable developers to focus on their applications  User Friendly tooling  i.e. non-programmer tools
  • 11. 1 1
  • 12. 1 2 CC BY-SA 3.0 Wikimedia Commons
  • 13. 1 3  Set of modules part of the Apache Jena project  Originally developed at Cray and donated to the project earlier this year  Experimental modules on the hadoop-rdf branch of our  Currently only available as development SNAPSHOT releases  Group ID: org.apache.jena  Artifact IDs:  jena-hadoop-rdf-common  jena-hadoop-rdf-io  jena-hadoop-rdf-mapreduce  Latest Version: 0.9.0-SNAPSHOT  Aims to fulfill all the basic requirements for enabling RDF on Hadoop  Built against Hadoop Map/Reduce 2.x APIs
  • 14. 1 4  Provides the Writable types for RDF primitives  NodeWritable  TripleWritable  QuadWritable  NodeTupleWritable  All backed by RDF Thrift  A compact binary serialization for RDF using Apache Thrift  See http://afs.github.io/rdf-thrift/  Extremely efficient to serialize and deserialize  Allows for efficient WritableComparator implementations that perform binary comparisons
  • 15.  Provides InputFormat and OutputFormat implementations 1 5  Supports most formats that Jena supports  Designed to be extendable with new formats Will split and parallelize inputs where the RDF serialization is amenable to this  Also transparently handles compressed inputs and outputs  Note that compression blocks splitting  i.e. trade off between IO and parallelism
  • 16. 1 6  Various reusable building block Mapper and Reducer implementations:  Counting  Filtering  Grouping  Splitting  Transforming  Can be used as-is to do some basic Hadoop tasks or used as building blocks for more complex tasks
  • 17. 1 7
  • 18.  For NTriples inputs compared performance of a Text based node count versus RDF based node count 1 8  Performance as good (within 10%) and sometimes significantly better  Heavily dataset dependent  Varies considerably with cluster setup  Also depends on how the input is processed  YMMV!  For other RDF formats you would struggle to implement this at all
  • 19. 1 9  Originally developed by Intel  Some contributions by Cray - awaiting merging at time of writing  Open source under Apache License  https://github.com/01org/graphbuilder/tree/2.0.alpha  2.0.alpha is the Pig based branch  Allows data to be transformed into graphs using Pig scripts  Provides set of Pig UDFs for translating data to graph formats  Supports both property graphs and RDF graphs
  • 20. 2 0 -- Declare our mappings x = FOREACH propertyGraph GENERATE (*, [ 'idBase' # 'http://example.org/instances/', 'base' # 'http://example.org/ontology/', 'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ], 'propertyMap' # [ 'type' # 'a', 'name' # 'foaf:name', 'age' # 'foaf:age' ], 'idProperty' # 'id' ]); -- Convert to NTriples rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*)); -- Write out NTriples STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
  • 21. 2 1  Uses a declarative mapping based on Pig primitives  Maps and Tuples  Have to be explicitly joined to the data because Pig UDFs can only be called with String arguments  Has some benefits e.g. conditional mappings  RDF Mappings operate on Property Graphs  Requires original data to be mapped to a property graph first  Direct mapping to RDF is a future enhancement that has yet to be implemented
  • 22. 2 2
  • 23. 2 3  Infovore - Paul Houle  https://github.com/paulhoule/infovore/wiki  Cleaned and curated Freebase datasets processed with Hadoop  CumulusRDF - Institute of Applied Informatics and Formal Description Methods  https://code.google.com/p/cumulusrdf/  RDF store backed by Apache Cassandra
  • 24. 2 4  Please start playing with these projects  Please interact with the community:  dev@jena.apache.org  What works?  What is broken?  What is missing?  Contribute  Apache projects are ultimately driven by the community  If there's a feature you want please suggest it  Or better still contribute it yourself!
  • 25. 2 5 Questions? Personal Email: rvesse@apache.org Jena Mailing List: dev@jena.apache.org
  • 26. 2 6
  • 27. 2 7 > bin/hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output /user/output --input-type triples /user/input  --node-count requests the Node Count statistics be calculated  Assumes mixed quads and triples input if no --input-type specified  Using this for triples only data can skew statistics  e.g. can result in high node counts for default graph node  Hence we explicitly specify input as triples
  • 28. 2 8
  • 29. 2 9
  • 30. 3 0
  • 31. 3 1
  • 32. 3 2
  • 33. 3 3 > ./pig -x local examples/property_graphs_and_rdf.pig > cat /tmp/rdf_triples/part-m-00000  Running in local mode for this demo  Output goes to /tmp/rdf_triples
  • 34. 3 4
  • 35. 3 5
  • 36. 3 6
  • 37. public abstract class AbstractNodeTupleNodeCountMapper<TKey, TValue, T extends AbstractNodeTupleWritable<TValue>> extends Mapper<TKey, T, NodeWritable, LongWritable> { 3 7 private LongWritable initialCount = new LongWritable(1); @Override protected void map(TKey key, T value, Context context) throws IOException, InterruptedException { NodeWritable[] ns = this.getNodes(value); for (NodeWritable n : ns) { context.write(n, this.initialCount); } } protected abstract NodeWritable[] getNodes(T tuple); } public class TripleNodeCountMapper<TKey> extends AbstractNodeTupleNodeCountMapper<TKey, Triple, TripleWritable> { @Override protected NodeWritable[] getNodes(TripleWritable tuple) { Triple t = tuple.get(); return new NodeWritable[] { new NodeWritable(t.getSubject()), new NodeWritable(t.getPredicate()), new NodeWritable(t.getObject()) }; } }
  • 38. 3 8 public class NodeCountReducer extends Reducer<NodeWritable, LongWritable, NodeWritable, LongWritable> { @Override protected void reduce(NodeWritable key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long count = 0; Iterator<LongWritable> iter = values.iterator(); while (iter.hasNext()) { count += iter.next().get(); } context.write(key, new LongWritable(count)); } }
  • 39. 3 9 Job job = Job.getInstance(config); job.setJarByClass(JobFactory.class); job.setJobName("RDF Triples Node Usage Count"); // Map/Reduce classes job.setMapperClass(TripleNodeCountMapper.class); job.setMapOutputKeyClass(NodeWritable.class); job.setMapOutputValueClass(LongWritable.class); job.setReducerClass(NodeCountReducer.class); // Input and Output job.setInputFormatClass(TriplesInputFormat.class); job.setOutputFormatClass(NTriplesNodeOutputFormat.class); FileInputFormat.setInputPaths(job, StringUtils.arrayToString(inputPaths)); FileOutputFormat.setOutputPath(job, new Path(outputPath)); return job;

Editor's Notes

  1. Tons of active projects Accumulo, Ambari, Avro, Cassandra, Chukwa, Giraph, Ham, HBase, Hive, Mahout, Pig, Spark, Tez, ZooKeeper And those are just off the top of my head (and ignoring Incubating projects) However mostly focused on traditional data sources e.g. logs, relational databases, unstructured data
  2. Highlight benefit of WritableComparator - significant speed up in reduce phase
  3. Project also provides a demo JAR which shows how to use the building blocks to perform common Hadoop tasks on RDF So Node Count is essentially the Word Count "Hello World" of Hadoop programming
  4. Mention that Intel may not have yet merged our pull request that adds the declarative mapping approach
  5. ~20 lines of code (less if you remove unnecessary formatting)
  6. 11 lines of code