Apache Jena Elephas and Friends

Rob Vesse
Rob VesseSoftware Engineer at YarcData
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas
and Friends
RDF and the Hadoop Ecosystem
Rob Vesse
Twitter: @RobVesse
Email: rvesse@gmail.com
C O M P U T E | S T O R E | A N A L Y Z E
About Me
● Software Engineer at Cray Inc
● Working on:
● RDF and SPARQL
● Big Data Analytics
● Active open source contributor
● Apache Jena
● dotNetRDF
● Minor contributions to other Apache projects
● Assorted other bits and pieces on my GitHub and BitBucket
● Primarily interested in intersection of RDF/SPARQL world
with rest of Big Data world
C O M P U T E | S T O R E | A N A L Y Z E
Talk Overview
● What's missing in the Hadoop ecosystem?
● What's already available?
● Apache Jena Elephas
● Intel Graph Builder
● Other interesting projects
● Getting Involved
● Questions
C O M P U T E | S T O R E | A N A L Y Z E
What's missing in the Hadoop
ecosystem?
Apache, the projects and their logo shown here are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries
C O M P U T E | S T O R E | A N A L Y Z E
Where's RDF?
● No first class projects
● Some very limited support in other projects
● Giraph can support RDF by bridging through the Tinkerpop 2
stack
● Few existing projects
● Mostly academic proofs of concept (POC)
● Some open source efforts but often task specific
● e.g. Infovore targeted at creating curated Freebase and DBPedia
datasets
C O M P U T E | S T O R E | A N A L Y Z E
What's needed for RDF?
● Minimum Viable Product
● Standard Writable implementations for primitives
● Input and Output support
● Would be nice to have:
● Tools for translating data to and from RDF
● Integration with the common analytic frameworks
● e.g. Spark, Giraph, Hive, Pig
C O M P U T E | S T O R E | A N A L Y Z E
What's already available?
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Background
● Started as a POC at Cray
● Donated to the Apache Jena
project 1st April 2014
● JENA-666
● Originally known as Hadoop
RDF Tools
● Renamed to Elephas in
December 2014
● Name was suggested by
Claude Warren
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - What is it?
● Set of modules part of the Apache Jena project
● Currently only developer SNAPSHOT builds available
● Will be included as part of upcoming Jena 2.13.0 release
● Aims to fulfill all the basic requirements for enabling RDF
on Hadoop
● Built against Hadoop 2.x APIs
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - How do I use it?
● Read the documentation
● http://jena.apache.org/documentation/hadoop/
● Add appropriate Maven dependencies to your code
● http://jena.apache.org/documentation/hadoop/artifacts.html
● Will also need to declare relevant Hadoop dependencies as
"provided"
● Use the APIs as-is for basic tasks or use as starting point
for more complex applications
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Common API
● Provides Writable types for the RDF primitives
● NodeWritable
● TripleWritable
● QuadWritable
● NodeTupleWritable
● An arbitrarily sized tuples of RDF terms
● Backed by RDF Thrift
● A compact binary serialization for RDF using Apache Thrift
● See http://afs.github.io/rdf-thrift/
● Extremely efficient to serialize and de-serialize
● Allows for efficient WritableComparator implementations that
perform comparisons directly on the binary forms
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - IO API
● Provides Hadoop InputFormat and OutputFormat
implementations for RDF
● Covers all RDF serializations Jena supports
● Easily extended with custom formats
● Splits and parallelizes processing of input where the RDF
serialization allows it
● Blank Nodes can be awkward
● Transparently handles compressed IO
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Blank Nodes
● Blank Nodes can be
problematic
● Need to consistently assign
IDs in parallel
● However you will typically
produce multiple
intermediate output files in
multi-job workflows
● Thus need to allow for
document versus globally
scoped IDs
● Configuration setting
controls this
● See documentation for
more information
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Map/Reduce API
● Various reusable basic Mapper and Reducer
implementations
● Covers common tasks:
● Counting
● Filtering
● Grouping
● Splitting
● Transformation
● Mostly intended for use as a starting point
● Some of these are bundled into a RDF stats demo
application
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Example Job
● Node Count (aka word count for RDF)
● All the classes referenced (bar Example.class) are provided by Elephas
Job job = Job.getInstance(config);
job.setJarByClass(Example.class);
job.setJobName("RDF Triples Node Usage Count");
// Map/Reduce classes
job.setMapperClass(TripleNodeCountMapper.class);
job.setMapOutputKeyClass(NodeWritable.class);
job.setMapOutputValueClass(LongWritable.class);
job.setReducerClass(NodeCountReducer.class);
// Input and Output
job.setInputFormatClass(NTriplesInputFormat.class);
job.setOutputFormatClass(NTriplesNodeOutputFormat.class);
FileInputFormat.setInputPath(job, new Path("/inputs/rdf"));
FileOutputFormat.setOutputPath(job, new Path("/outputs/rdf"));
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Node Count
Demo
See end of slide deck for steps to run the demo
and screenshots
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Performance Notes
● For NTriples inputs we compared performance of a Text
based node count versus RDF based node count
● Performance typically as good (within 10%) and
sometimes significantly better
● Heavily dataset dependent
● Varies considerably with cluster setup
● Also depends on how the input is processed
● Be aware YMMV!
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - What is it?
● Tools for transforming/creating large graphs
● Developed by Intel
● Cray has some proposed improvements that are awaiting
merging at time of writing
● Open source under Apache License
● https://github.com/01org/graphbuilder/tree/2.0.alpha
● 2.0.alpha is the preferred branch
● See https://github.com/cray/graphbuilder for the version
discussed here
● Allows graphs to be created/transformed from arbitrary
data sources using Apache Pig
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - How do I use it?
● REGISTER the Graph Builder JAR in your Pig script
● May optionally want to IMPORT the pig/graphbuilder.pig
script which aliases some of the provided UDFs
● LOAD your data
● Use the provided UDFs to generate a graph
● Can create both property graphs and RDF
● Currently data must be mapped to a property graph and then
into RDF
● STORE the resulting graph
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - How it works?
● Uses a declarative mapping based on Pig primitives
● Has to be explicitly joined to the data
● Limitation of Pig UDFs
● RDF mappings operate on property graphs
● Must map data to a property graph first
● Direct mapping to RDF is a possible future enhancement
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - Pig Script Example
https://github.com/Cray/graphbuilder/blob/2.0.alpha/examples/property_graphs_and_rdf_example.pig
-- Rest of script omitted for brevity
-- Declare our mappings
propertyGraphWithMappings = FOREACH propertyGraph GENERATE (*,
[ 'idBase' # 'http://example.org/instances/',
'base' # 'http://example.org/ontology/',
'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ],
'propertyMap' # [ 'type' # 'a',
'name' # 'foaf:name',
'age' # 'foaf:age' ],
'uriProperties' # ( 'type' ),
'idProperty' # 'id' ]);
-- Convert to NTriples
rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*));
-- Write out NTriples
STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - RDF
Generation Demo
See end of slide deck for steps to run the demo
and screenshots
C O M P U T E | S T O R E | A N A L Y Z E
Other Projects - Infovore
● Framework developed by Paul Houle
● Open source on GitHub
● https://github.com/paulhoule/infovore/wiki
● Apache License 2.0
● Produces a cleaned and curated Freebase dataset using
Hadoop for the processing
● Designed to be easily self-deployed on Amazon EC2
● Also some related projects for working with Wikipedia
● https://github.com/paulhoule/telepath
● Currently unclear what direction these projects will take
after the Freebase shutdown at end of March this year
C O M P U T E | S T O R E | A N A L Y Z E
Other Projects - CumulusRDF
● Academic project from Institute of Applied Informatics
and Formal Description Methods
● https://code.google.com/p/cumulusrdf/
● RDF store backed by Apache Cassandra
● Reasonable performance compared to native RDF stores
● See NoSQL Databases for RDF: An Empirical Evaluation
● Philippe Cudŕe-Mauroux et al
● http://exascale.info/sites/default/files/nosqlrdf.pdf
● Reasonably active development
C O M P U T E | S T O R E | A N A L Y Z E
Getting Involved
C O M P U T E | S T O R E | A N A L Y Z E
How to contribute
● Please download and try out these projects
● Interact with the communities and developers involved
● What works?
● What is broken?
● What is missing?
● How could the documentation be better?
● Contribute
● Open source ultimately lives or dies with community
participation
● If there's a missing feature then suggest it
● Or better still contribute it yourself!
C O M P U T E | S T O R E | A N A L Y Z E
Questions?
Personal Email: rvesse@gmail.com
Apache Jena User List: users@jena.apache.org
These slides will be posted to my SlideShare:
http://www.slideshare.net/RobVesse
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Node Count
Demo
C O M P U T E | S T O R E | A N A L Y Z E
Environment Pre-requisites
● Hadoop 2.x cluster
● Assumes hadoop command is on your PATH
● Download the latest JAR file
● Or build youself from source
● jena-hadoop-rdf-stats-VERSION-hadoop-job.jar
● Upload some RDF data to a HDFS folder
C O M P U T E | S T O R E | A N A L Y Z E
Run the Demo
● --node-count requests the Node Count statistics be calculated
● Assumes mixed quads and triples input if no --input-type specified
● Using this for triples only data can skew statistics
● e.g. can result in high node counts for default graph node
● Hence we explicitly specify input as triples
> hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar
org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output /user/output --
input-type triples /user/input
Apache Jena Elephas and Friends
Apache Jena Elephas and Friends
Apache Jena Elephas and Friends
Apache Jena Elephas and Friends
Apache Jena Elephas and Friends
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - RDF
Generation Demo
C O M P U T E | S T O R E | A N A L Y Z E
Environment Pre-requisites
● Pig 0.12
● Should work with higher but not tested
● Assumes pig command is on your PATH
● Clone the Cray version of the Graph Builder code
● https://github.com/cray/graphbuilder
C O M P U T E | S T O R E | A N A L Y Z E
Run the Demo
● Running Pig in local mode for simplicity
● Output goes to /tmp/rdf_triples/
> pig -x local examples/property_graphs_and_rdf.pig
> cat /tmp/rdf_triples/part-m-00000
Apache Jena Elephas and Friends
Apache Jena Elephas and Friends
1 of 41

Recommended

Semantic Integration with Apache Jena and Stanbol by
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolAll Things Open
3.5K views27 slides
Practical SPARQL Benchmarking Revisited by
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedRob Vesse
1.1K views35 slides
Quadrupling your elephants - RDF and the Hadoop ecosystem by
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
5K views40 slides
Sempala - Interactive SPARQL Query Processing on Hadoop by
Sempala - Interactive SPARQL Query Processing on HadoopSempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on HadoopAlexander Schätzle
1.6K views21 slides
Linking the world with Python and Semantics by
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and SemanticsTatiana Al-Chueyr
14.4K views84 slides
WebTech Tutorial Querying DBPedia by
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaKatrien Verbert
8.2K views41 slides

More Related Content

What's hot

Why Scala Is Taking Over the Big Data World by
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldDean Wampler
41.7K views89 slides
Apache spark-melbourne-april-2015-meetup by
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
1.1K views68 slides
Pandas UDF and Python Type Hint in Apache Spark 3.0 by
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
527 views38 slides
Debugging Apache Spark - Scala & Python super happy fun times 2017 by
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
881 views47 slides
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python by
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
9.5K views72 slides
Holden Karau - Spark ML for Custom Models by
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
2.2K views14 slides

What's hot(20)

Why Scala Is Taking Over the Big Data World by Dean Wampler
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
Dean Wampler41.7K views
Apache spark-melbourne-april-2015-meetup by Ned Shawa
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa1.1K views
Pandas UDF and Python Type Hint in Apache Spark 3.0 by Databricks
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks527 views
Debugging Apache Spark - Scala & Python super happy fun times 2017 by Holden Karau
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
Holden Karau881 views
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python by Christian Perone
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone9.5K views
Holden Karau - Spark ML for Custom Models by sparktc
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
sparktc2.2K views
Learn Apache Spark: A Comprehensive Guide by Whizlabs
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs592 views
SPARQL Cheat Sheet by LeeFeigenbaum
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
LeeFeigenbaum92.4K views
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015 by Holden Karau
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Holden Karau4.6K views
SPARQL 1.1 Update (2013-03-05) by andyseaborne
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)
andyseaborne4.5K views
Getting started with Apache Spark in Python - PyLadies Toronto 2016 by Holden Karau
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau222 views
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15... by spinningmatt
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt3.6K views
Scaling with apache spark (a lesson in unintended consequences) strange loo... by Holden Karau
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau194 views
Apache Spark Super Happy Funtimes - CHUG 2016 by Holden Karau
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
Holden Karau149 views
Keeping Spark on Track: Productionizing Spark for ETL by Databricks
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks4.2K views
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro... by Spark Summit
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit18.2K views
Semantic web meetup – sparql tutorial by AdonisDamian
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorial
AdonisDamian6.2K views
Debugging PySpark - PyCon US 2018 by Holden Karau
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau286 views
Introduction to and Extending Spark ML by Holden Karau
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
Holden Karau298 views

Similar to Apache Jena Elephas and Friends

New Analytics Toolbox DevNexus 2015 by
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
1.4K views131 slides
Big Data Beyond the JVM - Strata San Jose 2018 by
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
689 views47 slides
Apache PIG by
Apache PIGApache PIG
Apache PIGAnuja Gunale
205 views48 slides
Making the big data ecosystem work together with Python & Apache Arrow, Apach... by
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
1.5K views64 slides
Making the big data ecosystem work together with python apache arrow, spark,... by
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
86 views64 slides
Big data beyond the JVM - DDTX 2018 by
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
155 views47 slides

Similar to Apache Jena Elephas and Friends(20)

Big Data Beyond the JVM - Strata San Jose 2018 by Holden Karau
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau689 views
Making the big data ecosystem work together with Python & Apache Arrow, Apach... by Holden Karau
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau1.5K views
Making the big data ecosystem work together with python apache arrow, spark,... by Holden Karau
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau86 views
Big data beyond the JVM - DDTX 2018 by Holden Karau
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau155 views
Apache spark - History and market overview by Martin Zapletal
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal3.5K views
Challenges and patterns for semantics at scale by Rob Vesse
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scale
Rob Vesse390 views
Introduction to Apache Beam by Knoldus Inc.
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Knoldus Inc.117 views
Slides semantic web and Drupal 7 NYCCamp 2012 by scorlosquet
Slides semantic web and Drupal 7 NYCCamp 2012Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012
scorlosquet1.6K views
Introduction to Spark Datasets - Functional and relational together at last by Holden Karau
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau640 views
Apache Spark: Lightning Fast Cluster Computing by All Things Open
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
All Things Open757 views
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ... by DataStax Academy
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
DataStax Academy5.6K views
Are general purpose big data systems eating the world? by Holden Karau
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
Holden Karau124 views
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop by Neo4j
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Neo4j188 views
Sharing (or stealing) the jewels of python with big data & the jvm (1) by Holden Karau
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Holden Karau53 views
The Semantic Web and Drupal 7 - Loja 2013 by scorlosquet
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013
scorlosquet1.9K views

Recently uploaded

Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
96 views46 slides
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITShapeBlue
206 views8 slides
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueShapeBlue
138 views15 slides
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...ShapeBlue
180 views18 slides
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...ShapeBlue
106 views12 slides
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPoolShapeBlue
123 views10 slides

Recently uploaded(20)

Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue206 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue138 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue180 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue106 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue123 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue198 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue147 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue159 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue186 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue139 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE79 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash158 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue166 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue132 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li85 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue221 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue130 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10139 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue135 views

Apache Jena Elephas and Friends

  • 1. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas and Friends RDF and the Hadoop Ecosystem Rob Vesse Twitter: @RobVesse Email: rvesse@gmail.com
  • 2. C O M P U T E | S T O R E | A N A L Y Z E About Me ● Software Engineer at Cray Inc ● Working on: ● RDF and SPARQL ● Big Data Analytics ● Active open source contributor ● Apache Jena ● dotNetRDF ● Minor contributions to other Apache projects ● Assorted other bits and pieces on my GitHub and BitBucket ● Primarily interested in intersection of RDF/SPARQL world with rest of Big Data world
  • 3. C O M P U T E | S T O R E | A N A L Y Z E Talk Overview ● What's missing in the Hadoop ecosystem? ● What's already available? ● Apache Jena Elephas ● Intel Graph Builder ● Other interesting projects ● Getting Involved ● Questions
  • 4. C O M P U T E | S T O R E | A N A L Y Z E What's missing in the Hadoop ecosystem?
  • 5. Apache, the projects and their logo shown here are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries
  • 6. C O M P U T E | S T O R E | A N A L Y Z E Where's RDF? ● No first class projects ● Some very limited support in other projects ● Giraph can support RDF by bridging through the Tinkerpop 2 stack ● Few existing projects ● Mostly academic proofs of concept (POC) ● Some open source efforts but often task specific ● e.g. Infovore targeted at creating curated Freebase and DBPedia datasets
  • 7. C O M P U T E | S T O R E | A N A L Y Z E What's needed for RDF? ● Minimum Viable Product ● Standard Writable implementations for primitives ● Input and Output support ● Would be nice to have: ● Tools for translating data to and from RDF ● Integration with the common analytic frameworks ● e.g. Spark, Giraph, Hive, Pig
  • 8. C O M P U T E | S T O R E | A N A L Y Z E What's already available?
  • 9. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Background ● Started as a POC at Cray ● Donated to the Apache Jena project 1st April 2014 ● JENA-666 ● Originally known as Hadoop RDF Tools ● Renamed to Elephas in December 2014 ● Name was suggested by Claude Warren
  • 10. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - What is it? ● Set of modules part of the Apache Jena project ● Currently only developer SNAPSHOT builds available ● Will be included as part of upcoming Jena 2.13.0 release ● Aims to fulfill all the basic requirements for enabling RDF on Hadoop ● Built against Hadoop 2.x APIs
  • 11. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - How do I use it? ● Read the documentation ● http://jena.apache.org/documentation/hadoop/ ● Add appropriate Maven dependencies to your code ● http://jena.apache.org/documentation/hadoop/artifacts.html ● Will also need to declare relevant Hadoop dependencies as "provided" ● Use the APIs as-is for basic tasks or use as starting point for more complex applications
  • 12. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Common API ● Provides Writable types for the RDF primitives ● NodeWritable ● TripleWritable ● QuadWritable ● NodeTupleWritable ● An arbitrarily sized tuples of RDF terms ● Backed by RDF Thrift ● A compact binary serialization for RDF using Apache Thrift ● See http://afs.github.io/rdf-thrift/ ● Extremely efficient to serialize and de-serialize ● Allows for efficient WritableComparator implementations that perform comparisons directly on the binary forms
  • 13. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - IO API ● Provides Hadoop InputFormat and OutputFormat implementations for RDF ● Covers all RDF serializations Jena supports ● Easily extended with custom formats ● Splits and parallelizes processing of input where the RDF serialization allows it ● Blank Nodes can be awkward ● Transparently handles compressed IO
  • 14. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Blank Nodes ● Blank Nodes can be problematic ● Need to consistently assign IDs in parallel ● However you will typically produce multiple intermediate output files in multi-job workflows ● Thus need to allow for document versus globally scoped IDs ● Configuration setting controls this ● See documentation for more information
  • 15. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Map/Reduce API ● Various reusable basic Mapper and Reducer implementations ● Covers common tasks: ● Counting ● Filtering ● Grouping ● Splitting ● Transformation ● Mostly intended for use as a starting point ● Some of these are bundled into a RDF stats demo application
  • 16. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Example Job ● Node Count (aka word count for RDF) ● All the classes referenced (bar Example.class) are provided by Elephas Job job = Job.getInstance(config); job.setJarByClass(Example.class); job.setJobName("RDF Triples Node Usage Count"); // Map/Reduce classes job.setMapperClass(TripleNodeCountMapper.class); job.setMapOutputKeyClass(NodeWritable.class); job.setMapOutputValueClass(LongWritable.class); job.setReducerClass(NodeCountReducer.class); // Input and Output job.setInputFormatClass(NTriplesInputFormat.class); job.setOutputFormatClass(NTriplesNodeOutputFormat.class); FileInputFormat.setInputPath(job, new Path("/inputs/rdf")); FileOutputFormat.setOutputPath(job, new Path("/outputs/rdf"));
  • 17. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Node Count Demo See end of slide deck for steps to run the demo and screenshots
  • 18. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Performance Notes ● For NTriples inputs we compared performance of a Text based node count versus RDF based node count ● Performance typically as good (within 10%) and sometimes significantly better ● Heavily dataset dependent ● Varies considerably with cluster setup ● Also depends on how the input is processed ● Be aware YMMV!
  • 19. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - What is it? ● Tools for transforming/creating large graphs ● Developed by Intel ● Cray has some proposed improvements that are awaiting merging at time of writing ● Open source under Apache License ● https://github.com/01org/graphbuilder/tree/2.0.alpha ● 2.0.alpha is the preferred branch ● See https://github.com/cray/graphbuilder for the version discussed here ● Allows graphs to be created/transformed from arbitrary data sources using Apache Pig
  • 20. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - How do I use it? ● REGISTER the Graph Builder JAR in your Pig script ● May optionally want to IMPORT the pig/graphbuilder.pig script which aliases some of the provided UDFs ● LOAD your data ● Use the provided UDFs to generate a graph ● Can create both property graphs and RDF ● Currently data must be mapped to a property graph and then into RDF ● STORE the resulting graph
  • 21. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - How it works? ● Uses a declarative mapping based on Pig primitives ● Has to be explicitly joined to the data ● Limitation of Pig UDFs ● RDF mappings operate on property graphs ● Must map data to a property graph first ● Direct mapping to RDF is a possible future enhancement
  • 22. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - Pig Script Example https://github.com/Cray/graphbuilder/blob/2.0.alpha/examples/property_graphs_and_rdf_example.pig -- Rest of script omitted for brevity -- Declare our mappings propertyGraphWithMappings = FOREACH propertyGraph GENERATE (*, [ 'idBase' # 'http://example.org/instances/', 'base' # 'http://example.org/ontology/', 'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ], 'propertyMap' # [ 'type' # 'a', 'name' # 'foaf:name', 'age' # 'foaf:age' ], 'uriProperties' # ( 'type' ), 'idProperty' # 'id' ]); -- Convert to NTriples rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*)); -- Write out NTriples STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
  • 23. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - RDF Generation Demo See end of slide deck for steps to run the demo and screenshots
  • 24. C O M P U T E | S T O R E | A N A L Y Z E Other Projects - Infovore ● Framework developed by Paul Houle ● Open source on GitHub ● https://github.com/paulhoule/infovore/wiki ● Apache License 2.0 ● Produces a cleaned and curated Freebase dataset using Hadoop for the processing ● Designed to be easily self-deployed on Amazon EC2 ● Also some related projects for working with Wikipedia ● https://github.com/paulhoule/telepath ● Currently unclear what direction these projects will take after the Freebase shutdown at end of March this year
  • 25. C O M P U T E | S T O R E | A N A L Y Z E Other Projects - CumulusRDF ● Academic project from Institute of Applied Informatics and Formal Description Methods ● https://code.google.com/p/cumulusrdf/ ● RDF store backed by Apache Cassandra ● Reasonable performance compared to native RDF stores ● See NoSQL Databases for RDF: An Empirical Evaluation ● Philippe Cudŕe-Mauroux et al ● http://exascale.info/sites/default/files/nosqlrdf.pdf ● Reasonably active development
  • 26. C O M P U T E | S T O R E | A N A L Y Z E Getting Involved
  • 27. C O M P U T E | S T O R E | A N A L Y Z E How to contribute ● Please download and try out these projects ● Interact with the communities and developers involved ● What works? ● What is broken? ● What is missing? ● How could the documentation be better? ● Contribute ● Open source ultimately lives or dies with community participation ● If there's a missing feature then suggest it ● Or better still contribute it yourself!
  • 28. C O M P U T E | S T O R E | A N A L Y Z E Questions? Personal Email: rvesse@gmail.com Apache Jena User List: users@jena.apache.org These slides will be posted to my SlideShare: http://www.slideshare.net/RobVesse
  • 29. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Node Count Demo
  • 30. C O M P U T E | S T O R E | A N A L Y Z E Environment Pre-requisites ● Hadoop 2.x cluster ● Assumes hadoop command is on your PATH ● Download the latest JAR file ● Or build youself from source ● jena-hadoop-rdf-stats-VERSION-hadoop-job.jar ● Upload some RDF data to a HDFS folder
  • 31. C O M P U T E | S T O R E | A N A L Y Z E Run the Demo ● --node-count requests the Node Count statistics be calculated ● Assumes mixed quads and triples input if no --input-type specified ● Using this for triples only data can skew statistics ● e.g. can result in high node counts for default graph node ● Hence we explicitly specify input as triples > hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output /user/output -- input-type triples /user/input
  • 37. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - RDF Generation Demo
  • 38. C O M P U T E | S T O R E | A N A L Y Z E Environment Pre-requisites ● Pig 0.12 ● Should work with higher but not tested ● Assumes pig command is on your PATH ● Clone the Cray version of the Graph Builder code ● https://github.com/cray/graphbuilder
  • 39. C O M P U T E | S T O R E | A N A L Y Z E Run the Demo ● Running Pig in local mode for simplicity ● Output goes to /tmp/rdf_triples/ > pig -x local examples/property_graphs_and_rdf.pig > cat /tmp/rdf_triples/part-m-00000