DATA ORCHESTRATION SUMMIT
2019
Alluxio Innovations for Structured Data
Gene Pang | Alluxio
Motivation
Alluxio Structured Data Management
Developer Preview in Alluxio 2.1.0
Demo
Conclusion
Outline
2
DATA ORCHESTRATION
SUMMIT
2019
Motivation
Common Alluxio Use Cases
4
…
…
Unified Interface
Unified Namespace
Caching and Locality
SQL Engines are popular
5
Storage Systems SQL Frameworks
Files/Objects
Directories
Raw Bytes
Cost-efficiency
Durability
Tables
Schemas
Rows/Columns
Compute-optimized
Computation
Impedance Mismatch
Further Expand Benefits!
Benefits of Alluxio Data Orchestration
6
Storage
Systems
SQL
Frameworks
Caching
Unified Interface/Namespace
Schema-Aware Optimizations
Compute-Optimized Formats
Physical Data Independence
7
Demo
20 seconds 3 seconds
DATA ORCHESTRATION
SUMMIT
2019
Alluxio
Structured Data Management
Integrating SQL Engines with Alluxio
Provide Structured Data APIs
Focus on how frameworks interact with data
High-Level Philosophy
9
Cache Logical Data Access
Focus on caching what frameworks want
Alluxio Structured Data Management
Alluxio Structured Data ManagementVision
10
Storage
System
Transformation
Service
Structured Data
and Metadata
Logical Data
Access Layer
Structured
Data Client
SQL
Engine
Engine
DATA ORCHESTRATION
SUMMIT
2019
Developer Preview in Alluxio 2.1.0
Try out initial components!
Target Environment
12
Presto
Hive
Connector
Hive
Metastore
Storage
Alluxio Structured Data Management Preview
13
Presto
Alluxio Caching
Service
Alluxio Catalog
Service
AlluxioTransformation
Service
Hive
Connector
Alluxio
Connector
Hive
Metastore
Storage
Alluxio Catalog Service
14
Alluxio Catalog Service
Hive Metastore
Hive Under Database
Functionality
Manages metadata for structured data
Abstracts other database catalogs as
Under Database (UDB)
Benefits
Schema-aware optimizations
Simple deployment
alluxio table attachdb <udb type> <udb uri> <udb db name>
associate an Alluxio database with an UDB database
alluxio table detachdb <db name>
remove the association from “attach”
alluxio table ls [<db name> [<table name>]]
display information in the catalog
alluxio table sync <db name>
synchronize the Alluxio catalog with the UDB metadata
How to Use Alluxio Catalog CLI
15
Tighter integration with Presto
New plugin based on the Presto Hive connector
Available in Alluxio 2.1.0 distribution
Future: Merge connector into Presto codebase
Alluxio Presto Connector
16
etc/catalog/cat_alluxio.properties
connector.name=hive-alluxio
hive.metastore=alluxio
hive.metastore.alluxio.master.address=HOSTNAME:19998
Alluxio Presto Connector Configuration
17
Transformation Service
18
Transform data to be compute-optimized
independent of the storage format
Coalesce Format Conversion
parquetcsv
alluxio table transform <db name> <table name>
initiate a transformation on a table
alluxio table transformStatus <transform id>
display the status for a transformation
How to UseTransformation CLI
19
Familiar Alluxio Concepts
20
Structured Data Filesystem
Under Database Under Filesystem
Attach/Detach database Mount/Unmount filesystem
Transparent access of tables Transparent access of files
Transparent caching portions of tables Transparent caching portions of files
DATA ORCHESTRATION
SUMMIT
2019
Demo!
2 isolated AWS 10-node clusters
Presto + Hive Metastore + S3 Data
Presto + Alluxio + Hive Metastore + S3 Data
TPCDS dataset on S3
CSV format
~10,000 files
Demo
22
Attached existing Hive database into Alluxio Catalog
Alluxio Catalog served table metadata for Presto
Transformed store_sales by coalescing and converting CSV to Parquet
Demo Summary
23
Presto Without
Alluxio
20s
Alluxio
Transformations
7s
AlluxioTransformations
With Caching
3s
DATA ORCHESTRATION
SUMMIT
2019
Conclusion
User community feedback/collaboration is important!
Potential projects
New UDB implementations (AWS Glue)
More conversion formats (json)
DDL/DML workloads (CREATETABLE, INSERT, etc.)
New Client APIs for structured data (Arrow)
Future Work
25
Try it out!
Documentation
Provide feedback
Feature requests and issues in Github Alluxio/alluxio
Developer Preview Available in Alluxio 2.1.0
26
ThankYou!
27

Alluxio Innovations for Structured Data