HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution)

February 16th
2016
louis.rabiet@squidsolutions.com
Migrating structured data between Hadoop
and RDBMS

Who am I?
• Full Stack engineer at Squid Solutions.
• Specialised in Big data.
• Fun fact: sleeping by myself in my tent on
the top of the highest mountains of the world

What I do ?
• Develop of an analytics toolbox.
• No setup. No SQL. No compromise.
• Generate SQL with a REST API.
It is open source!
https://github.com/openbouquet

Topic of today
• You need Scalability?
• You need a machine learning toolbox?
Hadoop is the solution.
•But you still need structured data?
Our tool provide a solution.
=> We need both!

What does that mean?
• Creation of dataset in Bouquet
• Send the dataset to Spark
• Enrich inside Spark
• Re-injection in original database

How we do it?
User input
Relational
DB
SparkBouquet

How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
User select the data. Bouquet generate the corresponding SQL Code
Kafka

How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
Data is read from the SQL database
Kafka

How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
The BI tool creates an avro schema and send the data to Kafka
Kafka

How does it work?
BouquetRelational
DB
Spark
Kafka
HDFS/
Tachyon
Hive
Metastore
Kafka Broker(s) receive the data

How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
Kafka
The hive metastore is updated and the hdfs connectors writes into hdfs

How to keep the data structured?
Use a schema registry (Avro in Kafka).
each schema has a corresponding kafka topic and a distinct hive table.
{
"type": "record",
"name": "ArtistGender",
"fields" : [
{"name": "count", "type": "long"},
{"name": "gender", "type": "String"]}
]
}

Challenges
- Auto creation of topics/table in Hive for each datasets from Bouquet.
- JDBC reads are too slow for something like Kafka.
- Issue with types conversion: null is not supported for all cases for example (issue
272 on schema-registry).
- Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec
2015)
- Hive: Setting the warehouse directory.
- In tachyon: Setting up hostname.

Tachyon?
• Use it as in memory filesystem to replace
HDFS.
• Interact with Spark using the hdfs plugin.
• Transparent from user point of view

Status
Injection SQL -> Spark: OK
Spark usage: OK
Re-injection: In alpha stage.

Re-injection
Two solutions:
• Spark user notifies Bouquet that data has
changed (using a custom function)
• Bouquet pulls the data from spark

We use it for real!
Collaborating with La Poste to be able to
use Spark and the re-injection mechanism
to use Bouquet and a geographical
visualisation.

In the future
• Notebook integration
• We got a DSL for bouquet API, we may
want to have built-in support spark.
• Improve scalability (Bulk Unload and
Kafka fine tuning)

DB HD
Bouquet Architecture
Bouquet Server
SQL DATA
JDBC
Dynamic Caching
& Indexing
REST APIBusiness Modeling OAuth2
Generic Apps
Multi-Tenant
REDIS Elastic MongoDB
JS/SDK Custom Apps

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution)

More Related Content

What's hot

Similar to HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution)

More from Modern Data Stack France

Recently uploaded

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution)