Architecting an Open Source AI Platform 2018 edition

ARCHITECTING AN OPEN SOURCE
DATA SCIENCE PLATFORM: 2018 EDITION
Dr. David Talby
CTO, Pacific AI
@davidtalby
in/davidtalby

LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting It All Together

AT THE BEGINNING, THERE WAS SEARCH

Integrate Data
ETL
Streaming
Quality
Enrichment
Dataflows
Data Analyst Data Scientist
SCOPE
Discover & Visualize
SQL
Search
Visualization
Dashboards
Real-Time Alerts
Train Models
ML, DL, DM, NLP, …
Explore & Visualize
Train & Optimize
Collaboration
Workflows
Productize Models
Deploy API’s
Publish API’s
CI & CD for Models
Measurement
Feedback
App DeveloperData Engineer
Infrastructure
Deployment Orchestration Security Monitoring Single Sign-On Backup Scaling

GOALS
Enterprise Grade
Scales from GB to PB Unified & Modular
Cutting Edge

CONSTRAINTS
No Commercial Software
No Copyleft
No Saas
Built It

LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting it All Together

Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure

NIFI FEATURES
Web-based dataflow user interface
Seamless experience between design, control, feedback, and monitoring
Highly configurable
Loss tolerant vs guaranteed delivery
Low latency vs high throughput
Dynamic prioritization
Flow can be modified at runtime
Back pressure
Data Provenance
Track dataflow from beginning to end
Designed for extension
Build your own processors and more (120+ available out-of-the-box)
Enables rapid development and effective testing
Secure
SSL, SSH, HTTPS, encrypted content, etc...
Multi-tenant authorization and internal authorization/policy management

APACHE KYLO
• Like Nifi but wish for something simpler?
• Small team: Looking to quickly get started
• Large enterprise: Enable self service
• Meet Apache Kylo
• Self-Serve data ingestion & wrangling
• Search metadata, data lineage and profile
• Monitor data quality health in your feeds
• A layer on top of Apache Nifi

SPARK SQL FEATURES
Distributed SQL Engine
Seamless integration with Spark DataFrames
ANSI SQL 2003 support: All 99 queries of TPC-DS
Extensible via User Defined Functions (UDF)
High performance
New “Catalyst” cost-based optimizer in Spark 2.2
Project Tungsten: “Joining a Billion Rows per Second on a Laptop”
2.5x performance gains between 1.6 and 2.0
2018 Improvements
Spark on Kubernetes, Pandas UDF’s with Apache Arrow, new ML algorithms
Structured Streaming: Kafka, Stream to Stream Joins, ML models on Streams

• No coding environment for
exploring and visualization data
• Open source alternative to Tableau,
Power BI, Qlik, Looker, SiSense, etc.
• Build & share dashboards
• 30+ out of the box visualizations
• Authentication & role-based access
• Integrates with most RDBMS using
SQL Alchemy
APACHE SUPERSET

KIBANA FEATURES
Full-text and faceted search
Full text query language: Boolean operators, proximity, boosting
Faceted search: Filter by field, value ranges, date ranges, sort, limit, pagination
Time series analysis: aggregates, windowing, offsetting, trending, comparisons
Geospatial search: Search by shape, bounding box, polygon, by distance or range
Visualizations & Dashboards
All the basics: Area, pie, bar, heatmap, table, metric, map, scatter, timeline, tile
Custom visualizations using Vega and Vega-Lite
Drag & drop creation and editing of visualizations and dashboards
Dashboards can be dynamically filtered by time, queries, filters
Publish, embed and share dashboards
Real-time updates
Performant
Fast interactive queries, faceting and filtering
REST API and clients in all major languages

THE NOTEBOOK: WHAT ABOUT AN IDE?
• If developing on a local machine,
let people use what they like
• If that’s not possible due to
privacy or compliance issues, or
if you must provide the solution:
• A Jupyter Notebook is an
interactive web app that includes
code, graphs, & documentation
• It supports over 60+ languages

THE NEXT GENERATION: JUPYTER LAB
• Multi-tab code consoles
• Multiple languages
• Side by side editing
• Mirror outputs
• Data & image viewers
• File explorer
• Console editor
• Extensible

LANGUAGE & LIBRARY SELECTION
“While in 2014 I wrote about Four main languages for Analytics,
Data Mining, Data Science being R, Python, SQL, and SAS, the
5 main languages in 2017 appear to be Python, R, SQL, Spark
& Tensorflow.”
2018 vs. 2017: Python +11%, R -14%
0%
5%
10%
15%
20%
25%
30%
19th annual KDnuggets Software Poll:
Use of Deep Learning Tools
2018 vs. 2017: TensorFlow +32%, Keras +108%

CLIPPER.AI: LOW LATENCY, MULTI-ENGINE MODEL SERVER
• Built-in support for scikit-learn,
R, Spark ML, TensorFlow
• Easily extend to your framework
• Designed for millisecond latency
• Container per model version
• Update & roll back models
• Scaling with Kubernetes
• Deploy models on CPU’s, GPU’s
or a hybrid combination
• In the works: Policies, feedback

KONG API GATEWAY
API Gateway on nginx
Scalable
Modular with plugins
Authentication
Basic Auth, Open ID,
OAuth, HMAC, LDAP, JWT
Security
ACL, CORS, IP Restriction,
Bot Detection, SSL
Traffic Control
Proxy Caching, Rate limit,
Size limits, terminations
Logging & Analytics
Galileo, Datadog, Runscope
TCP, HTTP, File, Syslog, StatsD

COLLABORATION, CI & CD
Plan
Projects, Boards, Issues,
Milestones, Teams
Create
Merge, Preview, Commit,
Branch, Lock, Discuss
Verify
Automated pipelines,
graphs, history, scaling
Package
Built-in container registry
Release
Continuous integration &
continuous deployment
Configure & Monitor

KUBERNETES
Portable Containers
Public, Private, Hybrid,
or Multi-Cloud
Deployment
Automation, Co-Location,
Storage Mounting, Secrets
Auto-*
-Scaling, -Healing, -Restart,
-Placement, -Replication
Rolling Updates
Load Balancing
Service Discovery
Monitoring Resources
Accessing & Ingesting Logs

The Big Picture
• This is a complex, major enterprise platform
• It’s far from free: Cost is in integration, training & ops
• Why open source?
1. Often, outright better technology
2. Faster innovation
3. More native integrations
4. More books, talks, tutorials, posts & answers
5. Cheaper, both to begin and to scale

Common Questions
Q: Do I need it all on Day One?
A: No. Use what you need, know where it fits later.
Q: What if I already have another tool in place?
A: Keep it. Architecture is about incremental evolution.
Q: What if I don’t have the in-house knowledge?
A: Outsource, but require training & onboarding.
Q: What often gets overlooked?
A: Keeping components continuously up to date.

Summary: If you remember one thing…
Build the simplest platform that serves
everyone required to turn science into $$$
Data Analyst Data Scientist App DeveloperData Engineer

david@pacific.ai
@davidtalby
in/davidtalby
THANK YOU!

Architecting an Open Source AI Platform 2018 edition

More Related Content

What's hot

Similar to Architecting an Open Source AI Platform 2018 edition

More from David Talby

Recently uploaded

Architecting an Open Source AI Platform 2018 edition

Editor's Notes