Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project

Rakuten Technology Conference 2017
A Distributed SQL Database
For Data Analysis, Astra Project
2017-10-28
Yosuke Hara (原陽亮) 
Rakuten Institute of Technology 
Rakuten, Inc. rev. 1.0.5

Skylab
A Microservices Framework
11 0101
0010111011
110110010011
01110111011001
011101110110010
2
LeoFS
A Distributed Storage
11 0101
0010111011
110110010011
01110111011001
011101110110010
Astra
A Distributed SQL Database
For Data Analytics
11 0101
0010111011
110110010011
01110111011001
011101110110010
R&D Projects

Introducing To Astra
* “Astra” is a code name of a product under development

One of Backgrounds
More “Connected Things” In The World
Consumer Applications to Represent 63% of Total IoT Applications in 2017
IoT Units Installed Base by Category
MillionsofUnits
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
20,000
22,000
2016 2017 2018 2020
1,316.6
1,635.4
2,027.7
3,171
1,102.1
1,501
2,132.6
4,381.4
3,963
5,244.3
7,036.3
12,863
Consumer
Business: Cross-Industry
Business: Vertical-Specific
Source: Gartner (January 2017)
+31%
4
63%
18%
19%
20.4B
8.4B
6.4B
11.2B

Providing A Database That
Anyone Who Can Analyze Data

Initial Concept
6
Provides Components of DataLake as a Service
Data Science
+
DataLake
Data Governance Job Scheduler
+
Distributed
Computing
Data Store
Astra Skylab
Spark, Hadoop
Self-Service
Analytics
11 0101
0010111011
110110010011
01110111011001
011101110110

7
Current Concept
Advanced Data Analysis In Semi-Realtime At Low Cost
Aggregate, and
Analyze Data
Find Insights
Streaming Data
Un/Semi-
Structured Data
1100101
10010111011
110110010011
0110111011001
1101110110
Store Data
Into Astra
Data Intelligence Action
Tools / Apps
Automated
Systems

8
Current Concept: Depends on Single Source Of Truth
Self-Service Analytics
Data Governance
Distributed Computing
For Massive-Parallel
Processing
Distributed Database
For Aggregation and
Analysis
+
Distributed Storage
(DataLake Store)
+
Astra’s Components
1100101
10010111011
110110010011
0110111011001
1101110110
In-place Analysis

Database
SQL Engine
Data Science
Analysis Functions
On The Distributed
Computing
Reliability, Scalability, and
Massive Parallel Processing
Ad-hoc Query
Various Data
Without Limit
Data Store
10
Unified Components

Confirms To ANSI SQL99 Standard
• Communication With Any BI / Data Visualization Tools, and Apps
• Able To Call All Astra’s Functions, UDFs and ML With SQL
The Features - ANSI SQL99 Standard
11
astra:test> SELECT workclass, COUNT(income)
-> AS income_count
-> FROM adult_income
-> WHERE income = '<=50K'
-> GROUP BY workclass
-> ORDER BY workclass;
workclass | income_count
------------------+--------------
? | 2534
Federal-gov | 871
Local-gov | 2209
Never-worked | 10
Private | 26519
Self-emp-inc | 757
Self-emp-not-inc | 2785
State-gov | 1451
Without-pay | 19
(9 rows)

Advanced Data Analytics On The Distributed Computing, Massive-
Parallel Processing
• Built-In Analysis Functions and UDF
• Machine Learning
The Features - Advanced Data Analytics
12
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
1100101
10010111011
110110010011
0110111011001
1101110110
Feedback
Able To Repeat
Trial And Error
w/o Limit

The Features - Availability and Scalability
High Availability
• Automated Data Replication And Recovery, and Failover
High Scalability
• An Elastic Cluster - Nodes That Can Flexibly Attach And Detach
13
Worker
Worker
Worker
Worker
Request
Worker
Response
Clients
Coordinator(s)
HTTP
Message with
Gossip Protocol
Monitoring Resources
Scheduling Jobs
* Circuit Breaker: martinfowler.com/bliki/CircuitBreaker.html
Circuit Breaker
Figure: Akka Circuit breaker
Requesting Jobs

15
High-level ArchitectureSQLEngine
Workers
Database
Layer
DataStore
Layer
Astra
CLIClients
SQL over ODBC/JDBC
Astra DataStore
AstraSQL
AstraBase
- Original Data
- Semi-Structured Data
- Cold Data
- Columnar Tables
- Metadata Store
- Record Operation
- Record Set Cache (Hot Data)
- Distributed Computing
- Data Analysis
- Data Converter
- Semi-Structured Data To
Columnar Table
Original Data Load
Operate Astra
Multi-Coordinator

LeoFS is a software defined storage (SDS)
for DataLake and Web
LeoFS is an Enterprise Open Source Storage, and it is a highly
available, distributed, eventually consistent object/blob store
Goals:
- High Availability
- High Cost Performance Ratio
- High Scalability
LeoFS For Astra DataStore
16

Astra DataStore (LeoFS)
AstraSQL
AstraCLI
1-1. Put Original Data w/AstraCLI
2. Store the Data and Metadata
4. Request Converting Data Format of a Table
5. Convert Data Format of a Table
and Change Table’s Metadata
Processing Flow - Store a CSV file, Then Query Data
AstraBase 6. Store Converted Data
1-2. Create Metadata
[Store a CSV File]
[Convert Data Format At Async]
[Execute Query]
3. Query Data For Aggregation Or Data Analysis
1-1
1-2
2
3
17
REST-API
gRPCS3-API
gRPC
O/JDBC
AstraBase
Coordinator(s)
AstraBase
Workers
Resource Monitor
+ Scheduler
S3-API
gRPC
gRPC
AstraBase
Coordinator(s)
6
4
5

AstraSQL 3-1. Retrieve Target Records from the Cache
4. Process Data Analysis in Parallel
5. Reply To AstraBase Coordinator,
Then Summarize the Result on the Coordinator
Processing Flow - Query for Advanced Analysis
AstraBase
3-2. Retrieve Target Records From LeoFS
(Cache Miss)
[Retrieve Records]
[Reply]
[Execute Query]
1. Execute SQL For Data Analysis
3-2
1
2-1
2-1. Request Data Analisys to AstraBase
gRPC
18
gRPCO/JDBC
AstraBase
Coordinator(s)
AstraBase
Workers
Resource Monitor
+ Scheduler
S3-API
3-1, 4
AstraBase
Coordinator(s)
5
gRPC
gRPC
2-2
2-2. Request Message to AstraBase’s Workers

Store Files Into Astra
(Original Data,
Semi-Structured Files)
Data Validation
Data Verification
Data Type Inference
Store Chunks and
Metadata
1. Data Load
To Handle Plural Data Formats In A Table
Partition Into Plural
Chunks
CSV / TSV / JSON
To Parquet / CarbonData SerDes
19
Able To Do Self Data
Analytics Even If During
Data Conversion
Data is partitioned by a condition
of a specified column
2. Data Conversion At Async

Data Storage
Supports Data Format and SerDes
- CSV, TSV, and Custom Delimiter Files
- JSON
- RegEx SerDes for Unstructured Data
- Parquet SerDes (A Columnar Storage Format)
- CarbonData SerDes (A Columnar Storage Format)
Supports Compression Methods
- SNAPPY
- ZLIB
- GZIP
- LZO
20
Supports Plural Data Formats And SerDes

Table Schema Parquet Format
CSV Format
An Example of METADATA as JSON
21
Stores Each File
Into Astra Data Store, LeoFS
Data Type
Inference

AstraBase
Coordinator(s)
AstraSQL
AstraBase
3
2, 5
1
22
gRPCO/JDBC
Machine Learning on Astra - Modeling
[Create A Model, Then Store It]
2. Generate Tasks From A Job On A Coordinator
3. Request A Task To Workers
[Request A Modeling]
1. Request A Modeling To An Initiator Of AstraBase
4-1. Execute Function(s)
In Parallel On Each Worker
5. Summarize The Result On A Coordinator
Then Store The Model Into The Cluster To Reuse
4-2
4-2. Load Data From Data Store If Not Exists On Cache
S3-API
AstraBase
Workers
gRPC 4-1
gRPC
Resource Monitor
+ Scheduler
AstraBase
Coordinator(s)
S3-API

25
Visualizing Data With 3rd Party Tools
Communicates With Visualizing Data And BI Tools
Dundas BI
Qlik Sense
Microsoft PowerBI

Future Plans
By Oct/E, 2017 Nov, 2017 - June/E, 2018 Q3 2018
Alpha 1st Beta
2nd Beta
Publish It
- Alpha
- Un/Semi-Structured Data and Parquet SerDes Support
- BI Tools and Visualization Tools Integration
- 1st Beta, Step-Growth Phase
- Record Set Cache
- Distributed Computing For UDF and ML
- Other SerDes Support
27

Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project

Similar to Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project (20)

More from Rakuten Group, Inc.

More from Rakuten Group, Inc. (20)

Recently uploaded

Recently uploaded (20)

Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project