Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Splice Machine's use of Apache Spark and MLflow


Published on

Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.

Published in: Data & Analytics
  • Login to see the comments

Splice Machine's use of Apache Spark and MLflow

  1. 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  2. 2. Gene Davis, Splice Machine Splice Machine’s use of Apache Spark and MLflow #UnifiedAnalytics #SparkAISummit
  3. 3. Splice Machine • What are we? – A scale-out RDBMS that enables simultaneous transactions (OLTP) and analytics (OLAP) – Powers Operational AI: the ability to run AI applications in real time • Who uses us? – Companies in financial services, healthcare, supply chain, etc. – One example: 7PB, 2B record updates/day, 2M queries/day with sub- second response time • How do we do it? – Transactional SQL engine on top of HBase and Spark • ”Dual engine” architecture – Many delivery options (on-premise, cloud service (AWS, Azure, bespoke cloud, etc.)) 3#UnifiedAnalytics #SparkAISummit
  4. 4. Operational AI 4#UnifiedAnalytics #SparkAISummit INTELLIGENT DECISIONS Operational Database • Scale-out • OLTP • Fast Enterprise Data Warehouse • In-Memory • OLAP • Massively Parallel ARTIFICIAL INTELLIGENCE BUSINESS INTELLIGENCEML Models • Notebooks • Algorithms • Model Workflow OPERATIONAL INTELLIGENCE Integrated data platform for real-time AI applications On Premise
  5. 5. The Three Dimensions of Intelligence 5#UnifiedAnalytics #SparkAISummit OLTPOLAP What has happened in the past that might impact you? What is happening right now? What will happen in the future? ML
  6. 6. The Three Dimensions of Intelligence 6#UnifiedAnalytics #SparkAISummit OLTPOLAP Key platforms are duct-taped together leading to High Infrastructure Costs • Latency in Decision Making • Isolation from Business Processes ML
  7. 7. Intelligent Action - Before 7#UnifiedAnalytics #SparkAISummit
  8. 8. Intelligent Action - After 8#UnifiedAnalytics #SparkAISummit
  9. 9. Data Science Pain Points 9#UnifiedAnalytics #SparkAISummit Data Scientist Data Engineer ● Is my data ready to go? ● Is it still relevant? ● Do my features still align? ● The ETL process changed again – now what? ● The Data Scientist requested a different level of granularity – how do I do that?
  10. 10. Data Science Pain Points 10#UnifiedAnalytics #SparkAISummit Data Scientist Data Engineer ● Is my data ready to go? ● Is it still relevant? ● Do my features still align? ● The ETL process changed again – now what? ● The Data Scientist requested a different level of granularity – how do I do that? ● What data did I use? ● What algorithms/parameters gave the best model? ● Why didn’t I get the same results? ● What libraries are used? ● What model version is deployed?
  11. 11. MLflow and ML Manager • Splice Machine chose MLflow – MLflow Tracking: Track experiment runs and parameters – MLflow Models: packaging model artifacts • Splice ML Manager – Machine Learning on the Splice Machine Stack – MLflow Tracking and Models – Includes UI to Deploy to Amazon SageMaker 11#UnifiedAnalytics #SparkAISummit
  12. 12. ML Manager Architecture 12#UnifiedAnalytics #SparkAISummit On Premises Splice Machine Data Platform Native Spark Data Source Deployment Automation
  13. 13. Native Spark Datasource • Efficient interface from the Splice relational tables into Spark DataFrames (and back again) • No serialization/deserialization • Examples: – interestingDf = spliceContext.df(“select * from interesting_table”) – spliceContext.insert(dfWithData,’table_name’) 13#UnifiedAnalytics #SparkAISummit
  14. 14. Accessing MLflow Capabilities • Start with Splice’s MLManager – manager = MLManager() – Convenience class on top of MLflow • API’s – manager.create_experiment() – manager.set_active_experiment() – manager.create_new_run() – manager.log_param() – manager.log_metric() – manager.log_spark_model() 14#UnifiedAnalytics #SparkAISummit
  15. 15. MLflow UI 15#UnifiedAnalytics #SparkAISummit
  16. 16. Deployment Automation 16#UnifiedAnalytics #SparkAISummit
  17. 17. ML Manager • Beta Launched in March – MLflow v0.8 • Available at • MLManager() API Open Source at: – – (subject to change per MLflow 1.0 API) 17#UnifiedAnalytics #SparkAISummit
  18. 18. DEMO 18#UnifiedAnalytics #SparkAISummit
  20. 20. Many Disparate Tools 20#UnifiedAnalytics #SparkAISummit Data Sources OLTP - Oracle, Cassandra, Dynamo OLAP - Redshift, Snowflake, S3 Notebooks Apache Zeppelin Jupyter Data Manipulation Python Pandas Scikit Spark Machine Learning MLLib, R Experimenta tion Tracking MLflow Deployment Sagemaker AzureML
  21. 21. Insurance Claim Example 21#UnifiedAnalytics #SparkAISummit
  22. 22. Insurance Claim Example 22#UnifiedAnalytics #SparkAISummit