ONTOLOGY2 PRODUCT LINE
SUMMER 2014
PAUL A HOULE
CONFIDENTIAL EARLY DRAFT
PRODUCT AND SERVICE CATEGORIES
• Software
• Data Products
• Services
OPEN-SOURCE SOFTWARE
Centipede
Infovore
Telepath
RDFeasy
birthday_machine
CENTIPEDE:
EASY COMMAND LINE APPLICATIONS IN JAVA
INFOVORE
LARGE-SCALE RDF PROCESSING IN THE CLOUD
HARUHI
CLUSTER CONTROLLER
EXECUTES HADOOP JOBS IN
AWS OR LOCAL CLUSTER
BAKEMONO
MULTIPLE MAP/REDUCE
APPLICATIONS PACKAGED
IN SUPER JAR
AMAZON ELASTIC MAP/REDUCE
Amazon S3 (Permanent Storage)
freebaseRDFPrefilter removes…
Wasteful Facts
• 120M+ copies of the “a” predicate
• 60M+ access control statements
Violent and Dangerous facts
ns:common.topic ns:type.type.instance ?o .
Is repeated 30M times, and if you group on ?s and keep
them in memory…
PARALLEL SUPER
EYEBALL III
“triples”
valid triples junk
Currently, 250,000 or so triples in Freebase are rejected by PSE3
TELEPATH
SCALABLE PROCESSING OF WIKIMEDIA LOGS
EXTENDS INFOVORE
PROCESSES 4TB OF WEB LOGS
DATA PRODUCTS
:BaseKB
Family
:SubjectiveEye
Family
:BASEKB
FREEBASE IN STANDARD COMPATIBLE RDF
Freebase
RDF
Infovore
Software
:BaseKB
:BASEKB RELEASE SCHEDULE
2014-02-24
2014-03-02
2014-03-09
:BASEKB GOLD (PERMANENT)
:BASEKB NOW – UPDATED WEEKLY
SIEVE3
literal facts (ex. ?s ?p 55. )
?s :a ?p .
?s ?p ns:some_topic .
?s rdfs:label ?o .
HORIZONTAL DIVISIONS OF FREEBASE
PERCENTAGE OF COMPRESSED FILE SIZE
a
5%
description
18%
key
11%
keyNs
13%
label
6%
name
6%
notability
0%nfp
0%
text
8%
web
6%
links
20%
other
7%
:BASEKB AVAILABILITY
BASEKB NOW
Created weekly, published as requester pays in AWS S3
BASEKB GOLD
Published quarterly or so
Free download with BitTorrent
Available pre-loaded into a triple store with RDFeasy
:SUBJECTIVEEYE
RAW DATA
pagecounts-20140101-000000.gz
6,460,092 records per hour
65,743+ hours of data
:SUBJECTIVEEYE
PAGECOUNT DATA MIRROR
AMAZON S3
4 TB – one month to transfer to S3
:SUBJECTIVEEYE
PRODUCTION PROCESS
:SUBJECTIVEYE 3D AND 4D
Time-Averaged Time-Dependent
:SUBJECTIVEEYE3D
RDFEASY
WHAT’S IN THE BOX
Hardware
Amazon Web Services R3 series
Intel Xeon E5-2670 v2 Hardware Virtualization
15-244 GB RAM 32-640 GB direct-attach SSD
Software
Ubuntu Linux OpenLink Virtuoso Open Source Edition
RDFeasy scripts
Data
Baked into an AMI that satisfies requirements for AWS marketplace
RDFEASY ZERO
EMPTY DATABASE
one-click
10 minutes
0.45 $/hr
SPARQL 1.1 Triple Store
High-performance Software and Hardware
Runbook Documentation
RDFEASY
LOADING DATA AND SNAPSHOT
empty database
RDF
files
full database
Amazon Machine Image
Loads 1.3 M triples, 20 GB of compressed data in <4 hours, cost <$5
RDFEASY PRODUCTS
ZERO
empty database
BASEKB GOLD COMPACT
770M facts from Freebase (minus repetitive facts and full-text descriptions)
BASEKB GOLD COMPLETE
1.3G facts from Freebase
DBPEDIA EXPERIENCE
400M facts from Dbpedia 3.9
ADDITIONAL DATA SETS
OOKABOO RDF DATA DUMP
DBPEDIA PAGE RANK SCORES FOR :BASEKB
IMAGE COLLECTION WEB SITES
animalphotos.info carpictures.cc
ny-pictures.com ookaboo.com
Thanks: Javier Lastras, Eric Castro ,Zero One, Heurig

Ontology2 Platform Evolution