February 16th
2016
louis.rabiet@squidsolutions.com
Migrating structured data between Hadoop
and RDBMS
Who am I?
• Full Stack engineer at Squid Solutions.
• Specialised in Big data.
• Fun fact: sleeping by myself in my tent on
the top of the highest mountains of the world
What I do ?
• Develop of an analytics toolbox.
• No setup. No SQL. No compromise.
• Generate SQL with a REST API.
It is open source!
https://github.com/openbouquet
Topic of today
• You need Scalability?
• You need a machine learning toolbox?
Hadoop is the solution.
•But you still need structured data?
Our tool provide a solution.
=> We need both!
What does that mean?
• Creation of dataset in Bouquet
• Send the dataset to Spark
• Enrich inside Spark
• Re-injection in original database
How we do it?
User input
Relational
DB
SparkBouquet
Create and Send
How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
User select the data. Bouquet generate the corresponding SQL Code
Kafka
How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
Data is read from the SQL database
Kafka
How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
The BI tool creates an avro schema and send the data to Kafka
Kafka
How does it work?
BouquetRelational
DB
Spark
Kafka
HDFS/
Tachyon
Hive
Metastore
Kafka Broker(s) receive the data
How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
Kafka
The hive metastore is updated and the hdfs connectors writes into hdfs
How to keep the data structured?
Use a schema registry (Avro in Kafka).
each schema has a corresponding kafka topic and a distinct hive table.
{
"type": "record",
"name": "ArtistGender",
"fields" : [
{"name": "count", "type": "long"},
{"name": "gender", "type": "String"]}
]
}
Challenges
- Auto creation of topics/table in Hive for each datasets from Bouquet.
- JDBC reads are too slow for something like Kafka.
- Issue with types conversion: null is not supported for all cases for example (issue
272 on schema-registry).
- Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec
2015)
- Hive: Setting the warehouse directory.
- In tachyon: Setting up hostname.
Tachyon?
• Use it as in memory filesystem to replace
HDFS.
• Interact with Spark using the hdfs plugin.
• Transparent from user point of view
Status
Injection SQL -> Spark: OK
Spark usage: OK
Re-injection: In alpha stage.
Re-injection
Two solutions:
• Spark user notifies Bouquet that data has
changed (using a custom function)
• Bouquet pulls the data from spark
We use it for real!
Collaborating with La Poste to be able to
use Spark and the re-injection mechanism
to use Bouquet and a geographical
visualisation.
In the future
• Notebook integration
• We got a DSL for bouquet API, we may
want to have built-in support spark.
• Improve scalability (Bulk Unload and
Kafka fine tuning)
QUESTIONS
OPENBOUQUET.IO
DB HD
Bouquet Architecture
Bouquet Server
SQL DATA
JDBC
Dynamic Caching
& Indexing
REST APIBusiness Modeling OAuth2
Generic Apps
Multi-Tenant
REDIS Elastic MongoDB
JS/SDK Custom Apps

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution)

  • 1.
  • 2.
    Who am I? •Full Stack engineer at Squid Solutions. • Specialised in Big data. • Fun fact: sleeping by myself in my tent on the top of the highest mountains of the world
  • 3.
    What I do? • Develop of an analytics toolbox. • No setup. No SQL. No compromise. • Generate SQL with a REST API. It is open source! https://github.com/openbouquet
  • 4.
    Topic of today •You need Scalability? • You need a machine learning toolbox? Hadoop is the solution. •But you still need structured data? Our tool provide a solution. => We need both!
  • 5.
    What does thatmean? • Creation of dataset in Bouquet • Send the dataset to Spark • Enrich inside Spark • Re-injection in original database
  • 6.
    How we doit? User input Relational DB SparkBouquet
  • 7.
  • 8.
    How does itwork? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore User select the data. Bouquet generate the corresponding SQL Code Kafka
  • 9.
    How does itwork? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore Data is read from the SQL database Kafka
  • 10.
    How does itwork? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore The BI tool creates an avro schema and send the data to Kafka Kafka
  • 11.
    How does itwork? BouquetRelational DB Spark Kafka HDFS/ Tachyon Hive Metastore Kafka Broker(s) receive the data
  • 12.
    How does itwork? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore Kafka The hive metastore is updated and the hdfs connectors writes into hdfs
  • 13.
    How to keepthe data structured? Use a schema registry (Avro in Kafka). each schema has a corresponding kafka topic and a distinct hive table. { "type": "record", "name": "ArtistGender", "fields" : [ {"name": "count", "type": "long"}, {"name": "gender", "type": "String"]} ] }
  • 14.
    Challenges - Auto creationof topics/table in Hive for each datasets from Bouquet. - JDBC reads are too slow for something like Kafka. - Issue with types conversion: null is not supported for all cases for example (issue 272 on schema-registry). - Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec 2015) - Hive: Setting the warehouse directory. - In tachyon: Setting up hostname.
  • 15.
    Tachyon? • Use itas in memory filesystem to replace HDFS. • Interact with Spark using the hdfs plugin. • Transparent from user point of view
  • 16.
    Status Injection SQL ->Spark: OK Spark usage: OK Re-injection: In alpha stage.
  • 17.
    Re-injection Two solutions: • Sparkuser notifies Bouquet that data has changed (using a custom function) • Bouquet pulls the data from spark
  • 18.
    We use itfor real! Collaborating with La Poste to be able to use Spark and the re-injection mechanism to use Bouquet and a geographical visualisation.
  • 19.
    In the future •Notebook integration • We got a DSL for bouquet API, we may want to have built-in support spark. • Improve scalability (Bulk Unload and Kafka fine tuning)
  • 20.
  • 21.
    DB HD Bouquet Architecture BouquetServer SQL DATA JDBC Dynamic Caching & Indexing REST APIBusiness Modeling OAuth2 Generic Apps Multi-Tenant REDIS Elastic MongoDB JS/SDK Custom Apps