Apache MetaModel - unified access to all your data points

2015- © GMC - 1
with Apache MetaModel
Unified access to all your
data points

2015- © GMC - 2
Who am I?
Kasper Sørensen, dad, geek, guitarist …
@kaspersor
Long-time developer and
PMC member of:
Founder also of another
nice open source project:
Principal Software Engineer @

2015- © GMC - 3
Session agenda
- Introduction to Apache MetaModel
- Use case: Query composition
- The role of Apache MetaModel in Big Data
- (New) architectural possibilities with Apache MetaModel

Introduction
to Apache MetaModel

2015- © GMC - 5
The Apache journey
2011-xx: MetaModel founded (outside of Apache)
2013-06: Incubation of Apache MetaModel starts
2014-12: Apache MetaModel is officially an Apache TLP
2015-08: Latest stable release, 4.3.6
2015-09: Latest RC, 4.4.0-RC1
- We're still a small project, just 10 committers (plus some
extra contributors), so we have room for you!

2015- © GMC - 6
Helicopter view
You can look at Apache MetaModel by it's formal description:
"… a uniform connector and query API
to many very different datastore types …"
But let's start with a problem to solve ...

2015- © GMC - 7
A problem
How to process multiple data sources while:
- Staying agnostic to the database engine.
- Respecting the metadata of the underlying database.
- Not repeating yourself.
- Avoiding fragility towards metadata changes.
- Mutating the data structure when needed.
We used to rely on ORM frameworks to handle the bulk of these ...

2015- © GMC - 8
ORM - Queries via domain models
public class Person {
public String getName() {...}
}
ORM.query(Person.class).where(Person::getName).eq("John Doe");

2015- © GMC - 9
Why an ORM might not work
Requires a domain model
- While many data-oriented apps are agnostic to a domain model.
- Sometimes the data itself is the domain.
Model cannot change at runtime
- Usually needed especially if your app creates new tables/entities
Type-safety through metadata assumptions
- This is quite common, yet it might be a problem that the model is statically
mapped and cannot survive e.g. column renames etc.

2015- © GMC - 10
Alternative to ORM: Use JDBC
Metadata discoverable via java.sql.DatabaseMetaData.
Queries can be assembled safely with a bit of String magic:
"SELECT " + columnNames + " FROM " + tableName + …
...

2015- © GMC - 11
Alternative to ORM: Use JDBC
Wrong!
It turns out that …
- not all databases have the same SQL "dialect".
- they also don't all implement DatabaseMetaData the same way.
- you cannot use this on much else than relational databases that use SQL.
- What about NoSQL, various file formats, web services etc.

2015- © GMC - 12
MetaModel - Queries via metadata model
DataContext dc = …
Table[] tables = dc.getDefaultSchema().getTables();
Table table = tables[0];
Column[] primaryKeys = table.getPrimaryKeys();
dc.query().from(table).selectAll().where(primaryKeys[0]).eq(42);
dc.query().from("person").selectAll().where("id").eq(42);

2015- © GMC - 13
MetaModel updates
UpdateableDataContext dc = …
dc.executeUpdate(new UpdateScript() {
// multiple updates go here - transactional characteristics (ACID, synchronization etc.)
// as per the capabilities of the concrete DataContext implementation.
});
dc.executeUpdate(new BatchUpdateScript() {
// multiple "batch/bulk" (non atomic) updates go here
});
// if I only want to do a single update, there are convenience classes for this
InsertInto insert = new InsertInto(table).value(nameColumn, "John Doe");
dc.executeUpdate(insert);

2015- © GMC - 14
MetaModel - Connectivity
Sometimes we have SQL, sometimes we have another native query engine (e.g.
Lucene) and sometimes we use MetaModel's own query engine.
(A few more connectors available via the MetaModel-extras (LGPL) project)

2015- © GMC - 15
Tables, Columns and SQL
– that doesn't sound right.
This is one of our trade-offs to make life easier!
We map NoSQL models (doc, key/value etc.) to a table-based model.
As a user you have some choices:
• Supply your own mapping (non-dynamic).
• Use schema inference provided by Apache MetaModel.
• Provide schema inference instructions
(e.g. turn certain key/value pairs into separate tables).

2015- © GMC - 16
Other key characteristic about MetaModel
• Non-intrusive, you can use it when you want to.
• It's just a library – no need for a running a separate service.
• Easy to test, stub and mock. Inject a replacement DataContext
instead of the real one (typically PojoDataContext).
• Easy to implement a new DataContext for your curious
database or data format – reuse our abstract
implementations.

2015- © GMC - 18
MetaModel schema model

2015- © GMC - 19
MetaModel query and schema model

2015- © GMC - 20
Query representation for developers?
"SELECT * FROM foo"
Query as a string
- Easy to write.
- Easy to read.
- Prone to parsing errors.
- Prone to typos etc.
- Fails at runtime.
Query q = new Query();
q.from(table).selectAll();
Query as an object
- More involved code to
read and write.
- Lends itself to inspection
and mutation.
- Fails at compile time.

2015- © GMC - 21
This STATUS=DELIVERED
filter never actually executes, it
just updates the query on the
ORDERS table.
Context-based Query optimization
You might not always need to
pass data around between
components …
Sometimes you can just pass
the query around!

The role of Apache MetaModel
in Big Data

2015- © GMC - 23
So how does this all
relate to Big Data?

2015- © GMC - 24
Big Data and the need for metadata
Variety
- Not only structured data
- Social
- Sensors
- Many new sources, but
also the “old”:
• Relational
• NoSQL
• Files
• Cloud

Volume
Velocity
Variety
Veracity ?

2015- © GMC - 29
Query language examples
SELECT * FROM customers
WHERE country_code = 'GB'
OR country_code IS NULL
db.customers.find({
$or: [
{"country.code": "GB"},
{"country.code": {$exists: false}}
]
})
for (line : customers.csv) {
values = parse(line);
country = values[country_index];
if (country == null || "GB".equals(country) {
emit(line);
}
}
SQL
CSV
MongoDB

2015- © GMC - 30
Query language examples
dataContext.query()
.from(customers)
.selectAll()
.where(countryCode).eq("GB")
.or(countryCode).isNull();
Any datastore

2015- © GMC - 31
Make it easy to ingest data in your lake

2015- © GMC - 32
Metadata to enable automation
Big Data Variety and Veracity means we will be handling:
● Different physical formats of the same data
● Different query engines
● Different quality-levels of data
How can we automate data ingestion in such a landscape?

2015- © GMC - 33
We need a uniform metamodel for all the datastores.
And enough metadata to infer the ingestion transformations needed.
We need a uniform query API based on the metamodel.
We need our ingestion target to be aware of the ingestion sources.

2015- © GMC - 34
We need a uniform metamodel for all the datastores.
And enough metadata to infer the ingestion transformations needed.
We need a uniform query API based on the metamodel.
We need our ingestion target to be aware of the ingestion sources.
As an industry we lack more
elaborate metadata to support this!

2015- © GMC - 35
Traditional view on metadata
user
id (pk) CHAR(32) java.lang.String not null
username VARCHAR(64) java.lang.String not null, unique
real_name VARCHAR(256) java.lang.String not null
address VARCHAR(256) java.lang.String nullable
age INT int nullable

2015- © GMC - 36
Traditional view on metadata
user
customer
id (pk) BIGINT long not null
firstname VARCHAR(128) java.lang.String not null
lastname VARCHAR(128) java.lang.String not null
street VARCHAR(64) java.lang.String not null
house number INT int not null

2015- © GMC - 37
A more elaborate metadata view
user
customer
id (pk) BIGINT long not null
firstname VARCHAR(128) java.lang.String not null
lastname VARCHAR(128) java.lang.String not null
street VARCHAR(64) java.lang.String not null
house number INT int not null
32 char UUID
Person full name
Address (unstructured)
Person first name
Person last name
Address part - street
Same real-
world
entity?
Address part - house number
Numeric nominal variable
Person age
Numeric ratio variable

2015- © GMC - 38
Elaborate metadata and querying
SELECT (columns with header 'name') FROM (customer and user)
SELECT firstname, lastname FROM customer
SELECT real_name FROM user
SELECT EXTRACT(full name FROM (name columns))
FROM (tables containing person names)
SELECT (person name columns) FROM (customer and user)
Static
Manual
Dynamic
Automated
(or supervised)

2015- © GMC - 39
SELECT street, house_number FROM customer
SELECT address FROM user
SELECT EXTRACT(street, hno, city, zip) FROM (Address part columns))
FROM (tables containing addresses)
SELECT (Address part columns) FROM (customer and user)
Static
Manual
Dynamic
Automated
(or supervised)

2015- © GMC - 40
Individual/specific DB connectors
JDBC
Apache MetaModel (today)
Apache MetaModel (future)
Static
Manual
Dynamic
Automated
(or supervised)

(New) architectural possibilities
with Apache MetaModel

2015- © GMC - 44
What about speed?
Performance
- Performance is important in
development of MetaModel.
- But not in favor of uniformness.
- In some cases the metadata may
benefit performance by
automatically tweaking query
parameters (e.g. fetching
strategies).
- We usually expose the native
connector objects too.
Time to market
- MetaModel makes it easy to cover
all the data sources with the same
codebase.
- Typical 80/20 trade-off scenario.
- Avoid premature optimization.

2015- © GMC - 45
The future of Apache MetaModel?
Stuff currently being built (version 4.4.0):
• Pluggable operators in queries
• Pluggable functions in queries
• Phase-out Java 6 support
Ideas being prototyped:
• Elaborate metadata service
• Spark module (Turn a Query into a RDD or DataFrame)
• More connectors - Apache Solr, Neo4j, Couchbase, Riak

Let's see some code
Query a remote CSV file
https://github.com/kaspersorensen/ApacheBigDataMetaModelExample

Apache MetaModel - unified access to all your data points

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache MetaModel - unified access to all your data points

Similar to Apache MetaModel - unified access to all your data points (20)

Recently uploaded

Recently uploaded (20)

Apache MetaModel - unified access to all your data points