zStore

A TALE OF BUILDING
A LIBRARY IN SCALA
Yaakov Breuer
Photo License: “CC0 Public Domain” https://creativecommons.org/publicdomain/zero/1.0/
{facebook,github}.com/bryaakov

Agenda
• Background
• The Problem
• Design Goals
• The Solution
• Caching Layer

• Distributed high-scale data warehouse
• Combines Big Data with Linked Data
Linked Data is a way of modeling the world by
a graph with typed edges:
Background: “CM-Well”
Yaakov TR
worksAt

• Was created about 9 years ago
• Holds 4B objects in Production
• Was open-sourced a year ago
 Usually not an easy task in large corporations
 Keeping us in shape

Play
Kafka
BG
Process
BG
Process
BG
Process
Akka Sterams
ElasticsearchCassandra

Example 1: Read/Query
$ curl localhost:9000/_sparql --data-binary '
SELECT ?comp WHERE {
<http://example.org/yaakov>
<http://example.org/ont/worksAt> ?comp . }'
------------------------------------------
| comp |
|========================================|
| <http://permid.org/1-4295861160> |
------------------------------------------

Example 1: Read/Query (what actually happens)
HTTP GET is received:
• Translate payload to a case class: SparqlRequest
• Query Elasticsearch
• Fetch data from Cassandra
• Return human-readable response

Example 2: Data ingest
$ curl -X PUT localhost:9000/_in?format=n3 --data-binary
'@prefix example: <http://example.org/ont/> .
<http://example.org/Yaakov>
example:worksAt <http://permid.org/1-4295861160> .'
{"success":true}

Example 2: Data ingest (what actually happens)
HTTP PUT/POST was received:
• Data is parsed
• Kafka messages are being prdocued
• The user gets 200 OK
• (Eventually) Kafka Messages are consumed
• Data is persisted in Cassandra
• Data is indexed in Elasticsearch

• Normally, we store documents / objects
• We do support large files as objects
• Kafka messages should be small *
• “Any problem in computer science can be
solved with another layer of indirection”
(David Wheeler)
* https://kafka.apache.org/documentation/#configuration
The Problem

We wanted…
• A key/value store
• Distributed and Persisted
• put/get API
• To keep it simple
• An in-process solution
Design Goals

Why re-invent the wheel?
(The everlasting trade-off…)
It seems twitter/util has a util-cache module
 Might be a good fit
 No persistence
 Twitter Futures
Other options?

The zStore Trait
trait ZStore {
def put(uzid: String, value: Array[Byte]): Future[Unit]
def put(uzid: String, value: Array[Byte], secondsToLive: Int):
Future[Unit]
def get(uzid: String): Future[Array[Byte]]
def remove(uzid: String): Future[Unit]
}

• ZStoreImpl – uses Cassandra
• ZStoreMem – in memory (for testing)
Implementations

• Main usage – large files
• Key is hash(content)
• Keeping internal state (e.g. Kafka offsets)
• Caching
• Using TTL
• Using Memoization
Usages

Next Level: From zStore to zCache
• zStore has a String => Future[Array[Byte]] API
• We need to generalize it to K => Future[V]
• And let’s use memoization

Memoize
• Traditionally, “memoize” is a function that takes
one function and returns a new function with same
singnature that caches results.
• So you can simply wrap existing heavylifting
method by it; no need to refactor.

Memoize Example
Reminder – HTTP GET is received:
• Translate payload to a case class: SparqlRequest
• Query Elasticsearch
• Fetch data from Cassandra
• Return human-readable response

Memoize Example
def handleHttpGet = {
val request: SparqlRequest = ???
val response = execute(request)
response.map(Ok.apply) // 200 OK
}

Memoize Example
val cachedExecute = memoize(execute)
def handleHttpGet = {
val request: SparqlRequest = ???
val response = cachedExecute(request)
response.map(Ok.apply) // 200 OK
}

zCache.memoize
When used, will have to do the following:
Given a key,
• Get from zStore (with retries)
• If exists, return value
• Else:
• Evaluate the “task”
• Put results in zStore (with TTL)
• Return value

zCache.memoize
In order achieve that, we are going to need:
Given a key, convert from key:K to uzid:String
• Get from zStore (with retries)
• If exists, return value but map from Array[Byte] to V
• Else:
• Evaluate the “task”
• Put results in zStore (with TTL)
but map from V to Array[Byte]
• Return value

def memoize[K,V](task: K => Future[V])
: K => Future[V] = ???
zCache: Constructing memoize API

def memoize[K,V](task: K => Future[V])(
digest: K => String,
deserializer: Array[Byte] => V,
serializer: V => Array[Byte]
): K => Future[V] = ???

serializer: V => Array[Byte],
isCachable: V => Boolean
)
: K => Future[V] = ???

isCachable: V => Boolean = (_: V) => true
)
: K => Future[V] = ???

)(ttlSeconds: Int = 10, pollingMaxRetries: Int = 5,
pollingInterval: Int = 1)
: K => Future[V] = ???

)(ttlSeconds: Int = 10, pollingMaxRetries: Int = 5,
pollingInterval: Int = 1)(
implicit ec: ExecutionContext
): K => Future[V] = ???

FAQ: Can I reuse this library?
• Probably not as-is…
• You’re welcome to be inspired

• Contacts:
facebook.com/bryaakov
github.com/bryaakov
• Questions?

zStore

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to zStore

Similar to zStore (20)

Recently uploaded

Recently uploaded (20)

zStore