SlideShare a Scribd company logo
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas
and Friends
RDF and the Hadoop Ecosystem
Rob Vesse
Twitter: @RobVesse
Email: rvesse@gmail.com
C O M P U T E | S T O R E | A N A L Y Z E
About Me
● Software Engineer at Cray Inc
● Working on:
● RDF and SPARQL
● Big Data Analytics
● Active open source contributor
● Apache Jena
● dotNetRDF
● Minor contributions to other Apache projects
● Assorted other bits and pieces on my GitHub and BitBucket
● Primarily interested in intersection of RDF/SPARQL world
with rest of Big Data world
C O M P U T E | S T O R E | A N A L Y Z E
Talk Overview
● What's missing in the Hadoop ecosystem?
● What's already available?
● Apache Jena Elephas
● Intel Graph Builder
● Other interesting projects
● Getting Involved
● Questions
C O M P U T E | S T O R E | A N A L Y Z E
What's missing in the Hadoop
ecosystem?
Apache, the projects and their logo shown here are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries
C O M P U T E | S T O R E | A N A L Y Z E
Where's RDF?
● No first class projects
● Some very limited support in other projects
● Giraph can support RDF by bridging through the Tinkerpop 2
stack
● Few existing projects
● Mostly academic proofs of concept (POC)
● Some open source efforts but often task specific
● e.g. Infovore targeted at creating curated Freebase and DBPedia
datasets
C O M P U T E | S T O R E | A N A L Y Z E
What's needed for RDF?
● Minimum Viable Product
● Standard Writable implementations for primitives
● Input and Output support
● Would be nice to have:
● Tools for translating data to and from RDF
● Integration with the common analytic frameworks
● e.g. Spark, Giraph, Hive, Pig
C O M P U T E | S T O R E | A N A L Y Z E
What's already available?
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Background
● Started as a POC at Cray
● Donated to the Apache Jena
project 1st April 2014
● JENA-666
● Originally known as Hadoop
RDF Tools
● Renamed to Elephas in
December 2014
● Name was suggested by
Claude Warren
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - What is it?
● Set of modules part of the Apache Jena project
● Currently only developer SNAPSHOT builds available
● Will be included as part of upcoming Jena 2.13.0 release
● Aims to fulfill all the basic requirements for enabling RDF
on Hadoop
● Built against Hadoop 2.x APIs
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - How do I use it?
● Read the documentation
● http://jena.apache.org/documentation/hadoop/
● Add appropriate Maven dependencies to your code
● http://jena.apache.org/documentation/hadoop/artifacts.html
● Will also need to declare relevant Hadoop dependencies as
"provided"
● Use the APIs as-is for basic tasks or use as starting point
for more complex applications
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Common API
● Provides Writable types for the RDF primitives
● NodeWritable
● TripleWritable
● QuadWritable
● NodeTupleWritable
● An arbitrarily sized tuples of RDF terms
● Backed by RDF Thrift
● A compact binary serialization for RDF using Apache Thrift
● See http://afs.github.io/rdf-thrift/
● Extremely efficient to serialize and de-serialize
● Allows for efficient WritableComparator implementations that
perform comparisons directly on the binary forms
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - IO API
● Provides Hadoop InputFormat and OutputFormat
implementations for RDF
● Covers all RDF serializations Jena supports
● Easily extended with custom formats
● Splits and parallelizes processing of input where the RDF
serialization allows it
● Blank Nodes can be awkward
● Transparently handles compressed IO
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Blank Nodes
● Blank Nodes can be
problematic
● Need to consistently assign
IDs in parallel
● However you will typically
produce multiple
intermediate output files in
multi-job workflows
● Thus need to allow for
document versus globally
scoped IDs
● Configuration setting
controls this
● See documentation for
more information
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Map/Reduce API
● Various reusable basic Mapper and Reducer
implementations
● Covers common tasks:
● Counting
● Filtering
● Grouping
● Splitting
● Transformation
● Mostly intended for use as a starting point
● Some of these are bundled into a RDF stats demo
application
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Example Job
● Node Count (aka word count for RDF)
● All the classes referenced (bar Example.class) are provided by Elephas
Job job = Job.getInstance(config);
job.setJarByClass(Example.class);
job.setJobName("RDF Triples Node Usage Count");
// Map/Reduce classes
job.setMapperClass(TripleNodeCountMapper.class);
job.setMapOutputKeyClass(NodeWritable.class);
job.setMapOutputValueClass(LongWritable.class);
job.setReducerClass(NodeCountReducer.class);
// Input and Output
job.setInputFormatClass(NTriplesInputFormat.class);
job.setOutputFormatClass(NTriplesNodeOutputFormat.class);
FileInputFormat.setInputPath(job, new Path("/inputs/rdf"));
FileOutputFormat.setOutputPath(job, new Path("/outputs/rdf"));
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Node Count
Demo
See end of slide deck for steps to run the demo
and screenshots
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Performance Notes
● For NTriples inputs we compared performance of a Text
based node count versus RDF based node count
● Performance typically as good (within 10%) and
sometimes significantly better
● Heavily dataset dependent
● Varies considerably with cluster setup
● Also depends on how the input is processed
● Be aware YMMV!
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - What is it?
● Tools for transforming/creating large graphs
● Developed by Intel
● Cray has some proposed improvements that are awaiting
merging at time of writing
● Open source under Apache License
● https://github.com/01org/graphbuilder/tree/2.0.alpha
● 2.0.alpha is the preferred branch
● See https://github.com/cray/graphbuilder for the version
discussed here
● Allows graphs to be created/transformed from arbitrary
data sources using Apache Pig
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - How do I use it?
● REGISTER the Graph Builder JAR in your Pig script
● May optionally want to IMPORT the pig/graphbuilder.pig
script which aliases some of the provided UDFs
● LOAD your data
● Use the provided UDFs to generate a graph
● Can create both property graphs and RDF
● Currently data must be mapped to a property graph and then
into RDF
● STORE the resulting graph
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - How it works?
● Uses a declarative mapping based on Pig primitives
● Has to be explicitly joined to the data
● Limitation of Pig UDFs
● RDF mappings operate on property graphs
● Must map data to a property graph first
● Direct mapping to RDF is a possible future enhancement
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - Pig Script Example
https://github.com/Cray/graphbuilder/blob/2.0.alpha/examples/property_graphs_and_rdf_example.pig
-- Rest of script omitted for brevity
-- Declare our mappings
propertyGraphWithMappings = FOREACH propertyGraph GENERATE (*,
[ 'idBase' # 'http://example.org/instances/',
'base' # 'http://example.org/ontology/',
'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ],
'propertyMap' # [ 'type' # 'a',
'name' # 'foaf:name',
'age' # 'foaf:age' ],
'uriProperties' # ( 'type' ),
'idProperty' # 'id' ]);
-- Convert to NTriples
rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*));
-- Write out NTriples
STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - RDF
Generation Demo
See end of slide deck for steps to run the demo
and screenshots
C O M P U T E | S T O R E | A N A L Y Z E
Other Projects - Infovore
● Framework developed by Paul Houle
● Open source on GitHub
● https://github.com/paulhoule/infovore/wiki
● Apache License 2.0
● Produces a cleaned and curated Freebase dataset using
Hadoop for the processing
● Designed to be easily self-deployed on Amazon EC2
● Also some related projects for working with Wikipedia
● https://github.com/paulhoule/telepath
● Currently unclear what direction these projects will take
after the Freebase shutdown at end of March this year
C O M P U T E | S T O R E | A N A L Y Z E
Other Projects - CumulusRDF
● Academic project from Institute of Applied Informatics
and Formal Description Methods
● https://code.google.com/p/cumulusrdf/
● RDF store backed by Apache Cassandra
● Reasonable performance compared to native RDF stores
● See NoSQL Databases for RDF: An Empirical Evaluation
● Philippe Cudŕe-Mauroux et al
● http://exascale.info/sites/default/files/nosqlrdf.pdf
● Reasonably active development
C O M P U T E | S T O R E | A N A L Y Z E
Getting Involved
C O M P U T E | S T O R E | A N A L Y Z E
How to contribute
● Please download and try out these projects
● Interact with the communities and developers involved
● What works?
● What is broken?
● What is missing?
● How could the documentation be better?
● Contribute
● Open source ultimately lives or dies with community
participation
● If there's a missing feature then suggest it
● Or better still contribute it yourself!
C O M P U T E | S T O R E | A N A L Y Z E
Questions?
Personal Email: rvesse@gmail.com
Apache Jena User List: users@jena.apache.org
These slides will be posted to my SlideShare:
http://www.slideshare.net/RobVesse
C O M P U T E | S T O R E | A N A L Y Z E
Apache Jena Elephas - Node Count
Demo
C O M P U T E | S T O R E | A N A L Y Z E
Environment Pre-requisites
● Hadoop 2.x cluster
● Assumes hadoop command is on your PATH
● Download the latest JAR file
● Or build youself from source
● jena-hadoop-rdf-stats-VERSION-hadoop-job.jar
● Upload some RDF data to a HDFS folder
C O M P U T E | S T O R E | A N A L Y Z E
Run the Demo
● --node-count requests the Node Count statistics be calculated
● Assumes mixed quads and triples input if no --input-type specified
● Using this for triples only data can skew statistics
● e.g. can result in high node counts for default graph node
● Hence we explicitly specify input as triples
> hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar
org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output /user/output --
input-type triples /user/input
C O M P U T E | S T O R E | A N A L Y Z E
Intel Graph Builder - RDF
Generation Demo
C O M P U T E | S T O R E | A N A L Y Z E
Environment Pre-requisites
● Pig 0.12
● Should work with higher but not tested
● Assumes pig command is on your PATH
● Clone the Cray version of the Graph Builder code
● https://github.com/cray/graphbuilder
C O M P U T E | S T O R E | A N A L Y Z E
Run the Demo
● Running Pig in local mode for simplicity
● Output goes to /tmp/rdf_triples/
> pig -x local examples/property_graphs_and_rdf.pig
> cat /tmp/rdf_triples/part-m-00000
Apache Jena Elephas and Friends
Apache Jena Elephas and Friends

More Related Content

What's hot

Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
Dean Wampler
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
Holden Karau
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
sparktc
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
LeeFeigenbaum
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Holden Karau
 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)andyseaborne
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
Holden Karau
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Semantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorial
AdonisDamian
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
Holden Karau
 

What's hot (20)

Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Semantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorialSemantic web meetup – sparql tutorial
Semantic web meetup – sparql tutorial
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
 

Similar to Apache Jena Elephas and Friends

New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Apache PIG
Apache PIGApache PIG
Apache PIG
Anuja Gunale
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Apache pig
Apache pigApache pig
Apache pig
Sadiq Basha
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scale
Rob Vesse
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Knoldus Inc.
 
Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012scorlosquet
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
New Analytics Toolbox
New Analytics ToolboxNew Analytics Toolbox
New Analytics Toolbox
Robbie Strickland
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
DataStax Academy
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
Holden Karau
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Neo4j
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Holden Karau
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013
scorlosquet
 

Similar to Apache Jena Elephas and Friends (20)

New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Apache pig
Apache pigApache pig
Apache pig
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scale
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012Slides semantic web and Drupal 7 NYCCamp 2012
Slides semantic web and Drupal 7 NYCCamp 2012
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
New Analytics Toolbox
New Analytics ToolboxNew Analytics Toolbox
New Analytics Toolbox
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013
 

Recently uploaded

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 

Recently uploaded (20)

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 

Apache Jena Elephas and Friends

  • 1. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas and Friends RDF and the Hadoop Ecosystem Rob Vesse Twitter: @RobVesse Email: rvesse@gmail.com
  • 2. C O M P U T E | S T O R E | A N A L Y Z E About Me ● Software Engineer at Cray Inc ● Working on: ● RDF and SPARQL ● Big Data Analytics ● Active open source contributor ● Apache Jena ● dotNetRDF ● Minor contributions to other Apache projects ● Assorted other bits and pieces on my GitHub and BitBucket ● Primarily interested in intersection of RDF/SPARQL world with rest of Big Data world
  • 3. C O M P U T E | S T O R E | A N A L Y Z E Talk Overview ● What's missing in the Hadoop ecosystem? ● What's already available? ● Apache Jena Elephas ● Intel Graph Builder ● Other interesting projects ● Getting Involved ● Questions
  • 4. C O M P U T E | S T O R E | A N A L Y Z E What's missing in the Hadoop ecosystem?
  • 5. Apache, the projects and their logo shown here are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries
  • 6. C O M P U T E | S T O R E | A N A L Y Z E Where's RDF? ● No first class projects ● Some very limited support in other projects ● Giraph can support RDF by bridging through the Tinkerpop 2 stack ● Few existing projects ● Mostly academic proofs of concept (POC) ● Some open source efforts but often task specific ● e.g. Infovore targeted at creating curated Freebase and DBPedia datasets
  • 7. C O M P U T E | S T O R E | A N A L Y Z E What's needed for RDF? ● Minimum Viable Product ● Standard Writable implementations for primitives ● Input and Output support ● Would be nice to have: ● Tools for translating data to and from RDF ● Integration with the common analytic frameworks ● e.g. Spark, Giraph, Hive, Pig
  • 8. C O M P U T E | S T O R E | A N A L Y Z E What's already available?
  • 9. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Background ● Started as a POC at Cray ● Donated to the Apache Jena project 1st April 2014 ● JENA-666 ● Originally known as Hadoop RDF Tools ● Renamed to Elephas in December 2014 ● Name was suggested by Claude Warren
  • 10. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - What is it? ● Set of modules part of the Apache Jena project ● Currently only developer SNAPSHOT builds available ● Will be included as part of upcoming Jena 2.13.0 release ● Aims to fulfill all the basic requirements for enabling RDF on Hadoop ● Built against Hadoop 2.x APIs
  • 11. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - How do I use it? ● Read the documentation ● http://jena.apache.org/documentation/hadoop/ ● Add appropriate Maven dependencies to your code ● http://jena.apache.org/documentation/hadoop/artifacts.html ● Will also need to declare relevant Hadoop dependencies as "provided" ● Use the APIs as-is for basic tasks or use as starting point for more complex applications
  • 12. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Common API ● Provides Writable types for the RDF primitives ● NodeWritable ● TripleWritable ● QuadWritable ● NodeTupleWritable ● An arbitrarily sized tuples of RDF terms ● Backed by RDF Thrift ● A compact binary serialization for RDF using Apache Thrift ● See http://afs.github.io/rdf-thrift/ ● Extremely efficient to serialize and de-serialize ● Allows for efficient WritableComparator implementations that perform comparisons directly on the binary forms
  • 13. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - IO API ● Provides Hadoop InputFormat and OutputFormat implementations for RDF ● Covers all RDF serializations Jena supports ● Easily extended with custom formats ● Splits and parallelizes processing of input where the RDF serialization allows it ● Blank Nodes can be awkward ● Transparently handles compressed IO
  • 14. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Blank Nodes ● Blank Nodes can be problematic ● Need to consistently assign IDs in parallel ● However you will typically produce multiple intermediate output files in multi-job workflows ● Thus need to allow for document versus globally scoped IDs ● Configuration setting controls this ● See documentation for more information
  • 15. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Map/Reduce API ● Various reusable basic Mapper and Reducer implementations ● Covers common tasks: ● Counting ● Filtering ● Grouping ● Splitting ● Transformation ● Mostly intended for use as a starting point ● Some of these are bundled into a RDF stats demo application
  • 16. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Example Job ● Node Count (aka word count for RDF) ● All the classes referenced (bar Example.class) are provided by Elephas Job job = Job.getInstance(config); job.setJarByClass(Example.class); job.setJobName("RDF Triples Node Usage Count"); // Map/Reduce classes job.setMapperClass(TripleNodeCountMapper.class); job.setMapOutputKeyClass(NodeWritable.class); job.setMapOutputValueClass(LongWritable.class); job.setReducerClass(NodeCountReducer.class); // Input and Output job.setInputFormatClass(NTriplesInputFormat.class); job.setOutputFormatClass(NTriplesNodeOutputFormat.class); FileInputFormat.setInputPath(job, new Path("/inputs/rdf")); FileOutputFormat.setOutputPath(job, new Path("/outputs/rdf"));
  • 17. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Node Count Demo See end of slide deck for steps to run the demo and screenshots
  • 18. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Performance Notes ● For NTriples inputs we compared performance of a Text based node count versus RDF based node count ● Performance typically as good (within 10%) and sometimes significantly better ● Heavily dataset dependent ● Varies considerably with cluster setup ● Also depends on how the input is processed ● Be aware YMMV!
  • 19. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - What is it? ● Tools for transforming/creating large graphs ● Developed by Intel ● Cray has some proposed improvements that are awaiting merging at time of writing ● Open source under Apache License ● https://github.com/01org/graphbuilder/tree/2.0.alpha ● 2.0.alpha is the preferred branch ● See https://github.com/cray/graphbuilder for the version discussed here ● Allows graphs to be created/transformed from arbitrary data sources using Apache Pig
  • 20. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - How do I use it? ● REGISTER the Graph Builder JAR in your Pig script ● May optionally want to IMPORT the pig/graphbuilder.pig script which aliases some of the provided UDFs ● LOAD your data ● Use the provided UDFs to generate a graph ● Can create both property graphs and RDF ● Currently data must be mapped to a property graph and then into RDF ● STORE the resulting graph
  • 21. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - How it works? ● Uses a declarative mapping based on Pig primitives ● Has to be explicitly joined to the data ● Limitation of Pig UDFs ● RDF mappings operate on property graphs ● Must map data to a property graph first ● Direct mapping to RDF is a possible future enhancement
  • 22. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - Pig Script Example https://github.com/Cray/graphbuilder/blob/2.0.alpha/examples/property_graphs_and_rdf_example.pig -- Rest of script omitted for brevity -- Declare our mappings propertyGraphWithMappings = FOREACH propertyGraph GENERATE (*, [ 'idBase' # 'http://example.org/instances/', 'base' # 'http://example.org/ontology/', 'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ], 'propertyMap' # [ 'type' # 'a', 'name' # 'foaf:name', 'age' # 'foaf:age' ], 'uriProperties' # ( 'type' ), 'idProperty' # 'id' ]); -- Convert to NTriples rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*)); -- Write out NTriples STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
  • 23. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - RDF Generation Demo See end of slide deck for steps to run the demo and screenshots
  • 24. C O M P U T E | S T O R E | A N A L Y Z E Other Projects - Infovore ● Framework developed by Paul Houle ● Open source on GitHub ● https://github.com/paulhoule/infovore/wiki ● Apache License 2.0 ● Produces a cleaned and curated Freebase dataset using Hadoop for the processing ● Designed to be easily self-deployed on Amazon EC2 ● Also some related projects for working with Wikipedia ● https://github.com/paulhoule/telepath ● Currently unclear what direction these projects will take after the Freebase shutdown at end of March this year
  • 25. C O M P U T E | S T O R E | A N A L Y Z E Other Projects - CumulusRDF ● Academic project from Institute of Applied Informatics and Formal Description Methods ● https://code.google.com/p/cumulusrdf/ ● RDF store backed by Apache Cassandra ● Reasonable performance compared to native RDF stores ● See NoSQL Databases for RDF: An Empirical Evaluation ● Philippe Cudŕe-Mauroux et al ● http://exascale.info/sites/default/files/nosqlrdf.pdf ● Reasonably active development
  • 26. C O M P U T E | S T O R E | A N A L Y Z E Getting Involved
  • 27. C O M P U T E | S T O R E | A N A L Y Z E How to contribute ● Please download and try out these projects ● Interact with the communities and developers involved ● What works? ● What is broken? ● What is missing? ● How could the documentation be better? ● Contribute ● Open source ultimately lives or dies with community participation ● If there's a missing feature then suggest it ● Or better still contribute it yourself!
  • 28. C O M P U T E | S T O R E | A N A L Y Z E Questions? Personal Email: rvesse@gmail.com Apache Jena User List: users@jena.apache.org These slides will be posted to my SlideShare: http://www.slideshare.net/RobVesse
  • 29. C O M P U T E | S T O R E | A N A L Y Z E Apache Jena Elephas - Node Count Demo
  • 30. C O M P U T E | S T O R E | A N A L Y Z E Environment Pre-requisites ● Hadoop 2.x cluster ● Assumes hadoop command is on your PATH ● Download the latest JAR file ● Or build youself from source ● jena-hadoop-rdf-stats-VERSION-hadoop-job.jar ● Upload some RDF data to a HDFS folder
  • 31. C O M P U T E | S T O R E | A N A L Y Z E Run the Demo ● --node-count requests the Node Count statistics be calculated ● Assumes mixed quads and triples input if no --input-type specified ● Using this for triples only data can skew statistics ● e.g. can result in high node counts for default graph node ● Hence we explicitly specify input as triples > hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output /user/output -- input-type triples /user/input
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. C O M P U T E | S T O R E | A N A L Y Z E Intel Graph Builder - RDF Generation Demo
  • 38. C O M P U T E | S T O R E | A N A L Y Z E Environment Pre-requisites ● Pig 0.12 ● Should work with higher but not tested ● Assumes pig command is on your PATH ● Clone the Cray version of the Graph Builder code ● https://github.com/cray/graphbuilder
  • 39. C O M P U T E | S T O R E | A N A L Y Z E Run the Demo ● Running Pig in local mode for simplicity ● Output goes to /tmp/rdf_triples/ > pig -x local examples/property_graphs_and_rdf.pig > cat /tmp/rdf_triples/part-m-00000