The Virtual Repository

Fabio Simeoni (FAO)
the virtual repository
standards-based import and publication
Monday, 17 June 13

2
outline
• about data import and publication
• why it is a problem
• how it can be simplified
• the virtual repository
• where we are
• where we are going
Monday, 17 June 13

3
context
• there is an app
- manages data of some type: adds some value to it
• there is data out there
- quite a lot: waiting to be managed
• there are places out there
- quite a few: waiting to disseminate the added value
- repositories: specialised network services
• the app wants to reach out
- with first-class import and publication facilities
Monday, 17 June 13

4
what we mean by...
• data import
- pull data from some “source”
- transform it, store it, and use it for app-local purposes
- this is no real-time, fine-grained access
• data publication
- transform data for dissemination purposes
- push it to some “sink”
- this is no real-time, fine-grained update
• “coarse” I/O
Monday, 17 June 13

5
scope illustrated
app
repo 1
repo 2
publish
import
transforms
internal model
Monday, 17 June 13

6
the “average joe”
• import = file upload
- users are the sources: they have the data
- just one use case: what about data in repositories?
- should users discover it and retrieve it on behalf of the app?
• publication = export to file
- users are the sinks: they use the data
- just one use case: what about other consumers?
- should users disseminate data on behalf of repositories?
Monday, 17 June 13

7
“average joe” illustrated
app
donwload
upload
transforms
internal model
Monday, 17 June 13

8
(fancier variations)
• URI resolution
- users provide URIs, app resolves them
- a step forward, but onus of discovery remains on users
- repositories not ‘on the Web’ still out of the picture
• no publication, app disseminates
- doubles as a repository service
- two different missions/roles/competencies
- require different models, designs, technologies
- would rather integrate specialised solutions in infra
Monday, 17 June 13

9
imagine this
• users browse all data ‘nearby’ the app
- metadata describes contents, provenance, size ...
• users pick what data to import
- providing directives on how the app should convert it
• users browse repositories ‘nearby’ the app
- metadata describes location, policy, formats, ...
• users pick where to publish
- providing directives on how the app should convert for it
Monday, 17 June 13

10
imagine this
IMPORT
an asset … … …
another asset … … …
that asset … … …
my asset … … …
your asset … … …
… … … …
VERS. ORIGIN ...NAME
…
10 Applications
10 Documents
10 Servers
CHOOSES
customises
PUBLISH
a repo … … …
another repo … … …
that repo … … …
my repo … … …
your repo … … …
… … … …
…. …. ...NAME
10 Applications
10 Documents
10 Servers
CHOOSES
customises
Monday, 17 June 13

11
why don’t we see it
• it’s not simple
- many sources/sinks, APIs, formats, transforms
- difficult to paper over differences for users
- difficult to handle distributed interactions properly
- overall, a non-trivial interoperability problem
• it’s not cost-effective
- it’s not the core business of the app
- core business is to manage, not I/O
Monday, 17 June 13

12
wrong assumptions
• costs should fall entirely on the app
- to bridge across many formats and APIs over the network
• repositories can’t help
- yet their core business is precisely to disseminate
• tools can’t help
- yet the same problem recurs in many apps
Monday, 17 June 13

13
different assumptions
• users are there to choose
- what to import, where to publish: it’s their privilege
• app is there to map
- to/form internal model: it’s its job
• repositories are there to ingest and disseminate
- should make it easy to publish and import: it’s their mission
• tools should provide the glue
- factor out common tasks in reusable solutions: it’s well in their scope
Monday, 17 June 13

14
virtual repository
• a client library, a Jar
- helps the app build first-class import/publication facilities
• materialises an imaginary repository
- client API to discover, retrieve and publish data
• tailored to app
- contains/takes what app can transform (not other way around)
• seemingly local
- as if the data was right there, no ‘network-awareness’
Monday, 17 June 13

15
virtual repository
• a view over real repositories
- defines the ‘data hood’ of the app
• modular
- built out of repository-specific plugins
- plugins implement SPI in their own Jars
- app cherry-picks plugins and deploys Jars
• network-aware
- e.g. parallel data discovery
- e.g. timed out retrieval and updates
Monday, 17 June 13

16
virtual repository
• defines “standard” rules of exchange
- the formats of the data types, the APIs of the formats
• app transforms standards
- no custom work, less transformations
• plugins take/return standards
- do the custom work, as per repository mission
• standards-based rendezvous
- app and plugins sync on data
- ignore each other otherwise: technologies in the back seat
Monday, 17 June 13

17
virtual repository illustrated
app
virtual
repo
publish
discover
import
plugin
repo
repo
repo
"standard"
API SPI
Data
HOOD
client-side server-side
Monday, 17 June 13

18
a use case
• app manages code lists
- SDMX is a standard for code lists
- app implements internal ⇿ SDMX
• some repos disseminate code lists
- e.g. triple-store as SKOS, RDBMS as custom CSV
- plugins implement SKOS ⇿ SDMX, CSV ⇾ SDMX
• some flows are enabled
- DB ⇾ DB plugin ⇾ SDMX ⇾ app
- TS ⇾ TS plugin ⇾ SDMX ⇾ app
- DB ⇾ DB plugin ⇾ SDMX ⇾ app ⇾ SDMX ⇾ TS plugin ⇾ TS
Monday, 17 June 13

19
what we expect
• for apps
- one or two transforms reach the ‘data hood’
- no network awareness: easy coding
- no dependency on repos, including legacy ones: data before technologies
• for repositories
- an API for Java clients
- a low-cost one: plugins are easy
- no dependencies on clients: handle evolution in one place
• net gains
- max results, least effort
- loose coupling
Monday, 17 June 13

20
minimal client API
• AssetType
- what can be exchanged: just a named standard
• Asset
- a description of what is exchanged: a named instance of an AssetType
- bound to RepositoryService that has it/can take it
- specialised: SdmxAsset, SdmxCodelist, CsvAsset, CsvCodelist, ...
- well-known properties induced by type, arbitrary ones specific to instance
• VirtualRepository
- what mediates the exchange of Assets;
- can discover of Assets given AssetTypes
- can retrieve/publish their content in one or more standard APIs
Monday, 17 June 13

21
asset discovery
//somewhere
in
the
app
VirtualRepository
repo

=
…;
//factories,
injection,
new()
//elsewhere:
discovery
is
a
remote
operation
int
discovered
=
repo.discover(SdmxCodelist.type,
CsvCodelist.type);

//elsewhere:
build
discovery
screen
for
users
for
(Asset
codelist
:
repo)
{

…

…codelist.id()…

…codelist.name()…

…codelist().service().name()…

for
(Property
p
:
codelist.properties())

…p.name()…

…p.value()…

…p.description…

…

}
Monday, 17 June 13

22
asset retrieval
//use
chosen
an
asset
String
codelistId
=
…;
//retrieve
metadata
previously
discovered
Asset
asset
=
repo.lookup(codelistId);
//DISCLAIMER:
there
are
more
elegant
ways
to
dispatch!!!
if
(asset
instanceof
SdmxCodelist)
{

//a
remote
operation:
CodelistBean
is
a
standard
API
for
SDMX

CodelistBean
codelist
=
repo.retrieve(asset,
CodelistBean.class)

importFromSdmx(codelist);
//app’s
transform
}
else

if
(asset
instanceof
CsvCodelist)
{

//a
remote
operation:
Table
is
a
standard
API
for
CSV
Table
codelist
=
repo.retrieve(asset,Table.class)
importFromCsv((CsvCodelist)
codelist,codelist);
//app’s
transform
}
Monday, 17 June 13

23
asset publication (1)
//build
publication
screen
for
users
Collection<RepositoryService>
sinks
=

repo.sinks(SdmxCodelist.type,CsvCodelist.type);
//retrieve
metadata
previously
discovered
for
(RepositoryService
sink
:
sinks)
{

…sink().name()…

…for
(Property
p
:
singk.properties())

…p.name()…

…p.value()…

…p.description…
}
//elsewhere:
user
has
chosen
an
asset
String
codelistId
=
…
MyCodelist
codelist
=
…codelistId…
//
app
retrieves
it
//elsewhere:
user
has
chosen
a
repository

String
serviceId
=
…;
RepositoryService
sink
=
repo.services().lookup(sinkId);
//
app
retrieves
it
Monday, 17 June 13

24
asset publication (2)
if
(sink.publishes(SdmxCodelist.type))
{
SdmxCodelist
codelist
=
new
SdmxCodelist(...sink...);
CodelistBean
sdmxStream
=

publishToSdmx(codelist);
//app’s
transform

//publication
is
a
remote
operation
repo.publish(asset,sdmxStream);

}

else
if
(sink.publishes(CsvCodelist.type))

{
CsvCodelist
codelist
=
new
CSVCodelist(...sink...);
Table
table
=

publishToCsv(codelist);
//app’s
transform

repo.publish(asset,table);

}
Monday, 17 June 13

25
where are we
• virtual-repository-1.0.0
- out end of the month, snapshots in gcube-snapshots
• virtual-sdmx-registry-1.0.0
- plugin for one or more SDMX registries
- including iMarine’s (uses CNR’s library)
• virtual-semantic-repository-1.0.0
- plugin for FAO’s triple-store of reference data
• virtual-rtms-1.0.0
- plugin for FAO’s Figis RDBMS
• quick turnaround
- one month development activities, part-time (3 devs)
Monday, 17 June 13

26
where are we
• the approach is viable
- Cotrix integration: expected benefits delivered at expected costs
- plugin development: expected costs, 3-4 days fulltime
- but needs supervision: new standards require new releases
• we have learned a thing or two
- e.g. SDMX is self-describing and flexible, but of bounded expressiveness
- e.g. CSV is less self-describing and regular, but unbounded in principle
• we have much more to learn still
- can we stand production ?
- can we move outside reference data and into ‘big data’ ?
- can we scale when many plugins flog the app’s classpath ?
- what range of apps can we really support?
Monday, 17 June 13

27
where we are going
• grow the ‘data hood’
- more standards (including non-reference data)
- more repositories (i.e. more plugins)
- on demand
• grow the apps
- the new TimeSeries ?
- AssetExplorer ?
- built entirely and solely on VR plus all known plugins
- browse the ‘data hood’ to download in required format
- put those transform to practical use
- killer app for VR
Monday, 17 June 13

The Virtual Repository

Recommended

Recommended

More Related Content

Similar to The Virtual Repository

Similar to The Virtual Repository (20)

More from Fabio Simeoni

More from Fabio Simeoni (10)

Recently uploaded

Recently uploaded (20)

The Virtual Repository