Hugfr SPARK & RIAK -20160114_hug_france

SPARK & RIAK
INTRODUCTION TO THE SPARK-RIAK-CONNECTOR
LATERAL
THOUGHTS

Me, Myself & I
Associate at LateralThoughts.com

Scala, Java, Python Developer

Data Engineer @ Axa & Carrefour

Apache Spark Trainer with Databricks
LATERAL
THOUGHTS

And the Other One …
Director Sales @ Basho Technologies

(Basho make Riak)
Ex of MySQL France

Co-Founder MariaDB

Funny Accent

Quick Introduction …
2011 Creators of Riak
Riak KV: NoSQL key value database
Riak S2: Large Object Storage
2015 New Products
Basho Data Platform: Integrated NoSQL databases, caching,
in-memory analytics, and search
Riak TS: NoSQL Time Series database
120+ employees

Global Oﬃces

Seattle (HQ), Washington DC, London, Paris, Tokyo

300+ Enterprise customers, 1/3 of the Fortune 50

PRIORITIZED NEEDS
High Availability - Critical Data
High Scale –
Heavy Reads & Writes
Geo Locality –
Multiple Data Centers
Operational Simplicity –
Resources
Don’t Scale as Clusters
Data Accuracy –
Write Conflict Options
∂
RIAK S2 USE CASES
Large Object Store
Content Distribution
Web & Cloud Services
Active Archives
∂
RIAK KV USE CASES
User Data
Session Data
Profile Data
Real-time Data
Log Data
∂
RIAK TS USE CASES
IoT/Devices
Financial/Economic
Scientific Observations
Log Data

The Evolution of NoSQL
Unstructured
Data Platforms
Multi-Model
Solutions
Point
Solutions

Spark & Riak
Disclaimer, the following presentation uses :

Spark v1.5.2

Spark-Riak-Connector v1.1.0

Pre-Requisites
To use the Spark Riak Connector, as of now, you need to build it
yourself :

Clone https://github.com/basho/spark-riak-connector

`git checkout v1.1.0`

`mvn clean install`

Reading from
Connect to a Riak KV Cluster from Spark

Query it :

Full Scan

Using Keys

Using secondary indexes (2i)

Loading data from
riakBucket[V](bucketName: String): RiakRDD[V]
riakBucket[V](bucketName: String, bucketType: String): RiakRDD[V]
riakBucket[K, V](bucketName: String, convert: (Location,
RiakObject) => (K, V)): RiakRDD[(K, V)]
…
On your Spark Context, you can use :

Implicits
that will give you
the riak* methods

Reading from
Using case classes
Using Secondary Indexes

Spark Riak Connector - Roadmap
Better Integration with Riak TS

Enhanced DataFrames - based on Riak TS Schema APIs

Server-side aggregations and grouping - using TS SQL commands

Speed

Data Locality (partition RDDs according to replication in the cluster) - launch Spark executors on the same nodes where the data resides.

Better mapping from vnodes to Spark workers using coverage plan

Better support for Riak data types (CRDT) and Search queries

Today requires using Java Riak client APIs

Spark Streaming

Provide example and sample integration with Apache Kafka

Improve reliability using Riak for checkpoints and WAL

Add examples and documentation for Python support
DRAFT

Thank you
@ogirardot

o.girardot@lateral-thoughts.com

https://github.com/ogirardot/spark-riak-example

https://speakerdeck.com/ogirardot/spark-and-riak-introduction-to-
the-spark-riak-connector

@mcarney23

michael.carney@basho.com

fr.basho.com

Hugfr SPARK & RIAK -20160114_hug_france

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hugfr SPARK & RIAK -20160114_hug_france

Similar to Hugfr SPARK & RIAK -20160114_hug_france (20)

More from Modern Data Stack France

More from Modern Data Stack France (20)

Recently uploaded

Recently uploaded (20)

Hugfr SPARK & RIAK -20160114_hug_france