Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project

500 views

Published on

Astra is a distributed SQL database for data analysis and prediction. We're aiming to achieve near real-time data analysis, and to deliver the components of a Data Lake as a Service which contains it. Astra’s another feature is integration with Machine learning to support many kinds of data analysis.

Published in: Technology
  • Login to see the comments

  • Be the first to like this

Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project

  1. 1. Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project 2017-10-28 Yosuke Hara (原 陽亮)
 Rakuten Institute of Technology
 Rakuten, Inc. rev. 1.0.5
  2. 2. Skylab A Microservices Framework 11 0101 0010111011 110110010011 01110111011001 011101110110010 2 LeoFS A Distributed Storage 11 0101 0010111011 110110010011 01110111011001 011101110110010 Astra A Distributed SQL Database For Data Analytics 11 0101 0010111011 110110010011 01110111011001 011101110110010 R&D Projects
  3. 3. Introducing To Astra * “Astra” is a code name of a product under development
  4. 4. One of Backgrounds More “Connected Things” In The World Consumer Applications to Represent 63% of Total IoT Applications in 2017 IoT Units Installed Base by Category MillionsofUnits 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000 22,000 2016 2017 2018 2020 1,316.6 1,635.4 2,027.7 3,171 1,102.1 1,501 2,132.6 4,381.4 3,963 5,244.3 7,036.3 12,863 Consumer Business: Cross-Industry Business: Vertical-Specific Source: Gartner (January 2017) +31% 4 63% 18% 19% 20.4B 8.4B 6.4B 11.2B
  5. 5. Providing A Database That Anyone Who Can Analyze Data
  6. 6. Initial Concept 6 Provides Components of DataLake as a Service Data Science + DataLake Data Governance Job Scheduler + Distributed Computing Data Store Astra Skylab Spark, Hadoop Self-Service Analytics 11 0101 0010111011 110110010011 01110111011001 011101110110
  7. 7. 7 Current Concept Advanced Data Analysis In Semi-Realtime At Low Cost Aggregate, and Analyze Data Find Insights Streaming Data Un/Semi- Structured Data 1100101 10010111011 110110010011 0110111011001 1101110110 Store Data Into Astra Data Intelligence Action Tools / Apps Automated Systems
  8. 8. 8 Current Concept: Depends on Single Source Of Truth Self-Service Analytics Data Governance Distributed Computing For Massive-Parallel Processing Distributed Database For Aggregation and Analysis + Distributed Storage (DataLake Store) + Astra’s Components 1100101 10010111011 110110010011 0110111011001 1101110110 In-place Analysis
  9. 9. Features
  10. 10. Database SQL Engine Data Science Analysis Functions On The Distributed Computing Reliability, Scalability, and Massive Parallel Processing Ad-hoc Query Various Data Without Limit Data Store 10 Unified Components
  11. 11. Confirms To ANSI SQL99 Standard • Communication With Any BI / Data Visualization Tools, and Apps • Able To Call All Astra’s Functions, UDFs and ML With SQL The Features - ANSI SQL99 Standard 11 astra:test> SELECT workclass, COUNT(income) -> AS income_count -> FROM adult_income -> WHERE income = '<=50K' -> GROUP BY workclass -> ORDER BY workclass; workclass | income_count ------------------+-------------- ? | 2534 Federal-gov | 871 Local-gov | 2209 Never-worked | 10 Private | 26519 Self-emp-inc | 757 Self-emp-not-inc | 2785 State-gov | 1451 Without-pay | 19 (9 rows)
  12. 12. Advanced Data Analytics On The Distributed Computing, Massive- Parallel Processing • Built-In Analysis Functions and UDF • Machine Learning The Features - Advanced Data Analytics 12 Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment 1100101 10010111011 110110010011 0110111011001 1101110110 Feedback Able To Repeat Trial And Error w/o Limit
  13. 13. The Features - Availability and Scalability High Availability • Automated Data Replication And Recovery, and Failover High Scalability • An Elastic Cluster - Nodes That Can Flexibly Attach And Detach 13 Worker Worker Worker Worker Request Worker Response Clients Coordinator(s) HTTP Message with Gossip Protocol Monitoring Resources Scheduling Jobs * Circuit Breaker: martinfowler.com/bliki/CircuitBreaker.html Circuit Breaker Figure: Akka Circuit breaker Requesting Jobs
  14. 14. Architecture
  15. 15. 15 High-level ArchitectureSQLEngine Workers Database Layer DataStore Layer Astra CLIClients SQL over ODBC/JDBC Astra DataStore AstraSQL AstraBase - Original Data - Semi-Structured Data - Cold Data - Columnar Tables - Metadata Store - Record Operation - Record Set Cache (Hot Data) - Distributed Computing - Data Analysis - Data Converter - Semi-Structured Data To Columnar Table Original Data Load Operate Astra Multi-Coordinator
  16. 16. LeoFS is a software defined storage (SDS) for DataLake and Web LeoFS is an Enterprise Open Source Storage, and it is a highly available, distributed, eventually consistent object/blob store Goals: - High Availability - High Cost Performance Ratio - High Scalability LeoFS For Astra DataStore 16
  17. 17. Astra DataStore (LeoFS) AstraSQL AstraCLI 1-1. Put Original Data w/AstraCLI 2. Store the Data and Metadata 4. Request Converting Data Format of a Table 5. Convert Data Format of a Table and Change Table’s Metadata Processing Flow - Store a CSV file, Then Query Data AstraBase 6. Store Converted Data 1-2. Create Metadata [Store a CSV File] [Convert Data Format At Async] [Execute Query] 3. Query Data For Aggregation Or Data Analysis 1-1 1-2 2 3 17 REST-API gRPCS3-API gRPC O/JDBC AstraBase Coordinator(s) AstraBase Workers Resource Monitor + Scheduler S3-API gRPC gRPC AstraBase Coordinator(s) 6 4 5
  18. 18. Astra DataStore (LeoFS) AstraSQL 3-1. Retrieve Target Records from the Cache 4. Process Data Analysis in Parallel 5. Reply To AstraBase Coordinator, Then Summarize the Result on the Coordinator Processing Flow - Query for Advanced Analysis AstraBase 3-2. Retrieve Target Records From LeoFS (Cache Miss) [Retrieve Records] [Reply] [Execute Query] 1. Execute SQL For Data Analysis 3-2 1 2-1 2-1. Request Data Analisys to AstraBase gRPC 18 gRPCO/JDBC AstraBase Coordinator(s) AstraBase Workers Resource Monitor + Scheduler S3-API 3-1, 4 AstraBase Coordinator(s) 5 gRPC gRPC 2-2 2-2. Request Message to AstraBase’s Workers
  19. 19. Store Files Into Astra (Original Data, Semi-Structured Files) Data Validation Data Verification Data Type Inference Store Chunks and Metadata 1. Data Load To Handle Plural Data Formats In A Table Partition Into Plural Chunks CSV / TSV / JSON To Parquet / CarbonData SerDes 19 Able To Do Self Data Analytics Even If During Data Conversion Data is partitioned by a condition of a specified column 2. Data Conversion At Async
  20. 20. Data Storage Supports Data Format and SerDes - CSV, TSV, and Custom Delimiter Files - JSON - RegEx SerDes for Unstructured Data - Parquet SerDes (A Columnar Storage Format) - CarbonData SerDes (A Columnar Storage Format) Supports Compression Methods - SNAPPY - ZLIB - GZIP - LZO 20 Supports Plural Data Formats And SerDes
  21. 21. Table Schema Parquet Format CSV Format An Example of METADATA as JSON 21 Stores Each File Into Astra Data Store, LeoFS Data Type Inference
  22. 22. AstraBase Coordinator(s) Astra DataStore (LeoFS) AstraSQL AstraBase 3 2, 5 1 22 gRPCO/JDBC Machine Learning on Astra - Modeling [Create A Model, Then Store It] 2. Generate Tasks From A Job On A Coordinator 3. Request A Task To Workers [Request A Modeling] 1. Request A Modeling To An Initiator Of AstraBase 4-1. Execute Function(s) In Parallel On Each Worker 5. Summarize The Result On A Coordinator Then Store The Model Into The Cluster To Reuse 4-2 4-2. Load Data From Data Store If Not Exists On Cache S3-API AstraBase Workers gRPC 4-1 gRPC Resource Monitor + Scheduler AstraBase Coordinator(s) S3-API
  23. 23. Integration With BI Tools
  24. 24. Integration With Tableau (BI Tool) astra:test> DESCRIBE adult_income -> ; Column | Type | Extra | Comment -----------------+---------+-------+--------- age | integer | | workclass | varchar | | fnlwgt | integer | | education | varchar | | educational-num | integer | | marital-status | varchar | | occupation | varchar | | relationship | varchar | | race | varchar | | gender | varchar | | capital-gain | integer | | capital-loss | integer | | hours-per-week | varchar | | native-country | varchar | | income | varchar | | (15 rows) astra:test> SELECT workclass, COUNT(income) -> as income_count -> FROM adult_income -> WHERE income = '<=50K' -> GROUP BY workclass -> ORDER BY workclass; workclass | income_count ------------------+-------------- ? | 2534 Federal-gov | 871 Local-gov | 2209 Never-worked | 10 Private | 26519 Self-emp-inc | 757 Self-emp-not-inc | 2785 State-gov | 1451 Without-pay | 19 (9 rows) 24
  25. 25. 25 Visualizing Data With 3rd Party Tools Communicates With Visualizing Data And BI Tools Dundas BI Qlik Sense Microsoft PowerBI
  26. 26. Future Plans
  27. 27. Future Plans By Oct/E, 2017 Nov, 2017 - June/E, 2018 Q3 2018 Alpha 1st Beta 2nd Beta Publish It - Alpha - Un/Semi-Structured Data and Parquet SerDes Support - BI Tools and Visualization Tools Integration - 1st Beta, Step-Growth Phase - Record Set Cache - Distributed Computing For UDF and ML - Other SerDes Support 27
  28. 28. THANK YOU

×