ARCHITECTING AN OPEN SOURCE
DATA SCIENCE PLATFORM: 2018 EDITION
Dr. David Talby
CTO, Pacific AI
@davidtalby
in/davidtalby
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting It All Together
AT THE BEGINNING, THERE WAS SEARCH
Integrate Data
ETL
Streaming
Quality
Enrichment
Dataflows
Data Analyst Data Scientist
SCOPE
Discover & Visualize
SQL
Search
Visualization
Dashboards
Real-Time Alerts
Train Models
ML, DL, DM, NLP, …
Explore & Visualize
Train & Optimize
Collaboration
Workflows
Productize Models
Deploy API’s
Publish API’s
CI & CD for Models
Measurement
Feedback
App DeveloperData Engineer
Infrastructure
Deployment Orchestration Security Monitoring Single Sign-On Backup Scaling
GOALS
Enterprise Grade
Scales from GB to PB Unified & Modular
Cutting Edge
CONSTRAINTS
No Commercial Software
No Copyleft
No Saas
Built It
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting it All Together
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
APACHE NIFI
NIFI FEATURES
Web-based dataflow user interface
Seamless experience between design, control, feedback, and monitoring
Highly configurable
Loss tolerant vs guaranteed delivery
Low latency vs high throughput
Dynamic prioritization
Flow can be modified at runtime
Back pressure
Data Provenance
Track dataflow from beginning to end
Designed for extension
Build your own processors and more (120+ available out-of-the-box)
Enables rapid development and effective testing
Secure
SSL, SSH, HTTPS, encrypted content, etc...
Multi-tenant authorization and internal authorization/policy management
APACHE KYLO
• Like Nifi but wish for something simpler?
• Small team: Looking to quickly get started
• Large enterprise: Enable self service
• Meet Apache Kylo
• Self-Serve data ingestion & wrangling
• Search metadata, data lineage and profile
• Monitor data quality health in your feeds
• A layer on top of Apache Nifi
APACHE SPARK
SPARK SQL FEATURES
Distributed SQL Engine
Seamless integration with Spark DataFrames
ANSI SQL 2003 support: All 99 queries of TPC-DS
Extensible via User Defined Functions (UDF)
High performance
New “Catalyst” cost-based optimizer in Spark 2.2
Project Tungsten: “Joining a Billion Rows per Second on a Laptop”
2.5x performance gains between 1.6 and 2.0
2018 Improvements
Spark on Kubernetes, Pandas UDF’s with Apache Arrow, new ML algorithms
Structured Streaming: Kafka, Stream to Stream Joins, ML models on Streams
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
• No coding environment for
exploring and visualization data
• Open source alternative to Tableau,
Power BI, Qlik, Looker, SiSense, etc.
• Build & share dashboards
• 30+ out of the box visualizations
• Authentication & role-based access
• Integrates with most RDBMS using
SQL Alchemy
APACHE SUPERSET
KIBANA
TIMELION
KIBANA FEATURES
Full-text and faceted search
Full text query language: Boolean operators, proximity, boosting
Faceted search: Filter by field, value ranges, date ranges, sort, limit, pagination
Time series analysis: aggregates, windowing, offsetting, trending, comparisons
Geospatial search: Search by shape, bounding box, polygon, by distance or range
Visualizations & Dashboards
All the basics: Area, pie, bar, heatmap, table, metric, map, scatter, timeline, tile
Custom visualizations using Vega and Vega-Lite
Drag & drop creation and editing of visualizations and dashboards
Dashboards can be dynamically filtered by time, queries, filters
Publish, embed and share dashboards
Real-time updates
Performant
Fast interactive queries, faceting and filtering
REST API and clients in all major languages
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
THE NOTEBOOK: WHAT ABOUT AN IDE?
• If developing on a local machine,
let people use what they like
• If that’s not possible due to
privacy or compliance issues, or
if you must provide the solution:
• A Jupyter Notebook is an
interactive web app that includes
code, graphs, & documentation
• It supports over 60+ languages
THE NEXT GENERATION: JUPYTER LAB
• Multi-tab code consoles
• Multiple languages
• Side by side editing
• Mirror outputs
• Data & image viewers
• File explorer
• Console editor
• Extensible
TEAM WORK: JUPYTER HUB
LANGUAGE & LIBRARY SELECTION
“While in 2014 I wrote about Four main languages for Analytics,
Data Mining, Data Science being R, Python, SQL, and SAS, the
5 main languages in 2017 appear to be Python, R, SQL, Spark
& Tensorflow.”
2018 vs. 2017: Python +11%, R -14%
0%
5%
10%
15%
20%
25%
30%
19th annual KDnuggets Software Poll:
Use of Deep Learning Tools
2018 vs. 2017: TensorFlow +32%, Keras +108%
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
CLIPPER.AI: LOW LATENCY, MULTI-ENGINE MODEL SERVER
• Built-in support for scikit-learn,
R, Spark ML, TensorFlow
• Easily extend to your framework
• Designed for millisecond latency
• Container per model version
• Update & roll back models
• Scaling with Kubernetes
• Deploy models on CPU’s, GPU’s
or a hybrid combination
• In the works: Policies, feedback
KONG API GATEWAY
API Gateway on nginx
Scalable
Modular with plugins
Authentication
Basic Auth, Open ID,
OAuth, HMAC, LDAP, JWT
Security
ACL, CORS, IP Restriction,
Bot Detection, SSL
Traffic Control
Proxy Caching, Rate limit,
Size limits, terminations
Logging & Analytics
Galileo, Datadog, Runscope
TCP, HTTP, File, Syslog, StatsD
COLLABORATION, CI & CD
Plan
Projects, Boards, Issues,
Milestones, Teams
Create
Merge, Preview, Commit,
Branch, Lock, Discuss
Verify
Automated pipelines,
graphs, history, scaling
Package
Built-in container registry
Release
Continuous integration &
continuous deployment
Configure & Monitor
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
KUBERNETES
Portable Containers
Public, Private, Hybrid,
or Multi-Cloud
Deployment
Automation, Co-Location,
Storage Mounting, Secrets
Auto-*
-Scaling, -Healing, -Restart,
-Placement, -Replication
Rolling Updates
Load Balancing
Service Discovery
Monitoring Resources
Accessing & Ingesting Logs
PROMETHEUS & GRAFANA
KEYCLOAK
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting It All Together
The Big Picture
• This is a complex, major enterprise platform
• It’s far from free: Cost is in integration, training & ops
• Why open source?
1. Often, outright better technology
2. Faster innovation
3. More native integrations
4. More books, talks, tutorials, posts & answers
5. Cheaper, both to begin and to scale
Common Questions
Q: Do I need it all on Day One?
A: No. Use what you need, know where it fits later.
Q: What if I already have another tool in place?
A: Keep it. Architecture is about incremental evolution.
Q: What if I don’t have the in-house knowledge?
A: Outsource, but require training & onboarding.
Q: What often gets overlooked?
A: Keeping components continuously up to date.
Summary: If you remember one thing…
Build the simplest platform that serves
everyone required to turn science into $$$
Data Analyst Data Scientist App DeveloperData Engineer
david@pacific.ai
@davidtalby
in/davidtalby
THANK YOU!

Architecting an Open Source AI Platform 2018 edition

  • 1.
    ARCHITECTING AN OPENSOURCE DATA SCIENCE PLATFORM: 2018 EDITION Dr. David Talby CTO, Pacific AI @davidtalby in/davidtalby
  • 2.
    LET’S BUILD APLATFORM 1. Ground Rules 2. Components 3. Putting It All Together
  • 3.
    AT THE BEGINNING,THERE WAS SEARCH
  • 4.
    Integrate Data ETL Streaming Quality Enrichment Dataflows Data AnalystData Scientist SCOPE Discover & Visualize SQL Search Visualization Dashboards Real-Time Alerts Train Models ML, DL, DM, NLP, … Explore & Visualize Train & Optimize Collaboration Workflows Productize Models Deploy API’s Publish API’s CI & CD for Models Measurement Feedback App DeveloperData Engineer Infrastructure Deployment Orchestration Security Monitoring Single Sign-On Backup Scaling
  • 5.
    GOALS Enterprise Grade Scales fromGB to PB Unified & Modular Cutting Edge
  • 6.
    CONSTRAINTS No Commercial Software NoCopyleft No Saas Built It
  • 7.
    LET’S BUILD APLATFORM 1. Ground Rules 2. Components 3. Putting it All Together
  • 8.
    Integrate Data Data AnalystData Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 9.
  • 10.
    NIFI FEATURES Web-based dataflowuser interface Seamless experience between design, control, feedback, and monitoring Highly configurable Loss tolerant vs guaranteed delivery Low latency vs high throughput Dynamic prioritization Flow can be modified at runtime Back pressure Data Provenance Track dataflow from beginning to end Designed for extension Build your own processors and more (120+ available out-of-the-box) Enables rapid development and effective testing Secure SSL, SSH, HTTPS, encrypted content, etc... Multi-tenant authorization and internal authorization/policy management
  • 11.
    APACHE KYLO • LikeNifi but wish for something simpler? • Small team: Looking to quickly get started • Large enterprise: Enable self service • Meet Apache Kylo • Self-Serve data ingestion & wrangling • Search metadata, data lineage and profile • Monitor data quality health in your feeds • A layer on top of Apache Nifi
  • 12.
  • 13.
    SPARK SQL FEATURES DistributedSQL Engine Seamless integration with Spark DataFrames ANSI SQL 2003 support: All 99 queries of TPC-DS Extensible via User Defined Functions (UDF) High performance New “Catalyst” cost-based optimizer in Spark 2.2 Project Tungsten: “Joining a Billion Rows per Second on a Laptop” 2.5x performance gains between 1.6 and 2.0 2018 Improvements Spark on Kubernetes, Pandas UDF’s with Apache Arrow, new ML algorithms Structured Streaming: Kafka, Stream to Stream Joins, ML models on Streams
  • 14.
    Integrate Data Data AnalystData Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 15.
    • No codingenvironment for exploring and visualization data • Open source alternative to Tableau, Power BI, Qlik, Looker, SiSense, etc. • Build & share dashboards • 30+ out of the box visualizations • Authentication & role-based access • Integrates with most RDBMS using SQL Alchemy APACHE SUPERSET
  • 16.
  • 17.
  • 18.
    KIBANA FEATURES Full-text andfaceted search Full text query language: Boolean operators, proximity, boosting Faceted search: Filter by field, value ranges, date ranges, sort, limit, pagination Time series analysis: aggregates, windowing, offsetting, trending, comparisons Geospatial search: Search by shape, bounding box, polygon, by distance or range Visualizations & Dashboards All the basics: Area, pie, bar, heatmap, table, metric, map, scatter, timeline, tile Custom visualizations using Vega and Vega-Lite Drag & drop creation and editing of visualizations and dashboards Dashboards can be dynamically filtered by time, queries, filters Publish, embed and share dashboards Real-time updates Performant Fast interactive queries, faceting and filtering REST API and clients in all major languages
  • 19.
    Integrate Data Data AnalystData Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 20.
    THE NOTEBOOK: WHATABOUT AN IDE? • If developing on a local machine, let people use what they like • If that’s not possible due to privacy or compliance issues, or if you must provide the solution: • A Jupyter Notebook is an interactive web app that includes code, graphs, & documentation • It supports over 60+ languages
  • 21.
    THE NEXT GENERATION:JUPYTER LAB • Multi-tab code consoles • Multiple languages • Side by side editing • Mirror outputs • Data & image viewers • File explorer • Console editor • Extensible
  • 22.
  • 23.
    LANGUAGE & LIBRARYSELECTION “While in 2014 I wrote about Four main languages for Analytics, Data Mining, Data Science being R, Python, SQL, and SAS, the 5 main languages in 2017 appear to be Python, R, SQL, Spark & Tensorflow.” 2018 vs. 2017: Python +11%, R -14% 0% 5% 10% 15% 20% 25% 30% 19th annual KDnuggets Software Poll: Use of Deep Learning Tools 2018 vs. 2017: TensorFlow +32%, Keras +108%
  • 24.
    Integrate Data Data AnalystData Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 25.
    CLIPPER.AI: LOW LATENCY,MULTI-ENGINE MODEL SERVER • Built-in support for scikit-learn, R, Spark ML, TensorFlow • Easily extend to your framework • Designed for millisecond latency • Container per model version • Update & roll back models • Scaling with Kubernetes • Deploy models on CPU’s, GPU’s or a hybrid combination • In the works: Policies, feedback
  • 26.
    KONG API GATEWAY APIGateway on nginx Scalable Modular with plugins Authentication Basic Auth, Open ID, OAuth, HMAC, LDAP, JWT Security ACL, CORS, IP Restriction, Bot Detection, SSL Traffic Control Proxy Caching, Rate limit, Size limits, terminations Logging & Analytics Galileo, Datadog, Runscope TCP, HTTP, File, Syslog, StatsD
  • 27.
    COLLABORATION, CI &CD Plan Projects, Boards, Issues, Milestones, Teams Create Merge, Preview, Commit, Branch, Lock, Discuss Verify Automated pipelines, graphs, history, scaling Package Built-in container registry Release Continuous integration & continuous deployment Configure & Monitor
  • 28.
    Integrate Data Data AnalystData Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 29.
    KUBERNETES Portable Containers Public, Private,Hybrid, or Multi-Cloud Deployment Automation, Co-Location, Storage Mounting, Secrets Auto-* -Scaling, -Healing, -Restart, -Placement, -Replication Rolling Updates Load Balancing Service Discovery Monitoring Resources Accessing & Ingesting Logs
  • 30.
  • 31.
  • 32.
    LET’S BUILD APLATFORM 1. Ground Rules 2. Components 3. Putting It All Together
  • 33.
    The Big Picture •This is a complex, major enterprise platform • It’s far from free: Cost is in integration, training & ops • Why open source? 1. Often, outright better technology 2. Faster innovation 3. More native integrations 4. More books, talks, tutorials, posts & answers 5. Cheaper, both to begin and to scale
  • 34.
    Common Questions Q: DoI need it all on Day One? A: No. Use what you need, know where it fits later. Q: What if I already have another tool in place? A: Keep it. Architecture is about incremental evolution. Q: What if I don’t have the in-house knowledge? A: Outsource, but require training & onboarding. Q: What often gets overlooked? A: Keeping components continuously up to date.
  • 35.
    Summary: If youremember one thing… Build the simplest platform that serves everyone required to turn science into $$$ Data Analyst Data Scientist App DeveloperData Engineer
  • 36.

Editor's Notes

  • #4 You need a platform if you’re building a set of solutions in the data science space.
  • #6 We’re going open source to get a better solution, not just to save money.
  • #7 The constraints are not meant to say that other types of solutions are worth the money – they often are, but starting with these constraints gives you a baseline of expectations.
  • #12 Automate the flow of data between systems