Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.
2. Gene Davis, Splice Machine
Splice Machine’s use of
Apache Spark and MLflow
#UnifiedAnalytics #SparkAISummit
3. Splice Machine
• What are we?
– A scale-out RDBMS that enables simultaneous transactions (OLTP)
and analytics (OLAP)
– Powers Operational AI: the ability to run AI applications in real time
• Who uses us?
– Companies in financial services, healthcare, supply chain, etc.
– One example: 7PB, 2B record updates/day, 2M queries/day with sub-
second response time
• How do we do it?
– Transactional SQL engine on top of HBase and Spark
• ”Dual engine” architecture
– Many delivery options (on-premise, cloud service (AWS, Azure,
bespoke cloud, etc.))
3#UnifiedAnalytics #SparkAISummit
5. The Three Dimensions of
Intelligence
5#UnifiedAnalytics #SparkAISummit
OLTPOLAP
What has happened in
the past that might
impact you?
What is happening right
now?
What will happen in the
future?
ML
6. The Three Dimensions of
Intelligence
6#UnifiedAnalytics #SparkAISummit
OLTPOLAP
Key platforms are duct-taped together leading to
High Infrastructure Costs • Latency in Decision Making • Isolation from Business Processes
ML
9. Data Science Pain Points
9#UnifiedAnalytics #SparkAISummit
Data
Scientist
Data
Engineer
● Is my data ready to go?
● Is it still relevant?
● Do my features still align?
● The ETL process changed again – now
what?
● The Data Scientist requested a different
level of granularity – how do I do that?
10. Data Science Pain Points
10#UnifiedAnalytics #SparkAISummit
Data
Scientist
Data
Engineer
● Is my data ready to go?
● Is it still relevant?
● Do my features still align?
● The ETL process changed again – now
what?
● The Data Scientist requested a different
level of granularity – how do I do that?
● What data did I use?
● What algorithms/parameters
gave the best model?
● Why didn’t I get the same results?
● What libraries are used?
● What model version is deployed?
11. MLflow and ML Manager
• Splice Machine chose MLflow
– MLflow Tracking: Track experiment runs and parameters
– MLflow Models: packaging model artifacts
• Splice ML Manager
– Machine Learning on the Splice Machine Stack
– MLflow Tracking and Models
– Includes UI to Deploy to Amazon SageMaker
11#UnifiedAnalytics #SparkAISummit
17. ML Manager
• Beta Launched in March
– MLflow v0.8
• Available at cloud.splicemachine.com
• MLManager() API Open Source at:
– https://github.com/splicemachine/pysplice
– (subject to change per MLflow 1.0 API)
17#UnifiedAnalytics #SparkAISummit