Automating Machine Learning
API, bindings, BigMLer and Basic Workflows
#BSSML16
December 2016
#BSSML16 Automating Machine Learning December 2016 1 / 29
Outline
1 Introduction: ML as a System Service
2 ML as a RESTful Cloudy Service
3 Client-side workflows: REST API and bindings
4 Client-side workflows: Bigmler
#BSSML16 Automating Machine Learning December 2016 2 / 29
Outline
1 Introduction: ML as a System Service
2 ML as a RESTful Cloudy Service
3 Client-side workflows: REST API and bindings
4 Client-side workflows: Bigmler
#BSSML16 Automating Machine Learning December 2016 3 / 29
Machine Learning as a System Service
The goal
Machine Learning as a system
level service
The means
• APIs: ML building blocks
• Abstraction layer over feature
engineering
• Abstraction layer over
algorithms
• Automation
#BSSML16 Automating Machine Learning December 2016 4 / 29
The Roadmap
#BSSML16 Automating Machine Learning December 2016 5 / 29
Outline
1 Introduction: ML as a System Service
2 ML as a RESTful Cloudy Service
3 Client-side workflows: REST API and bindings
4 Client-side workflows: Bigmler
#BSSML16 Automating Machine Learning December 2016 6 / 29
RESTful-ish ML Services
#BSSML16 Automating Machine Learning December 2016 7 / 29
RESTful-ish ML Services
#BSSML16 Automating Machine Learning December 2016 8 / 29
RESTful-ish ML Services
#BSSML16 Automating Machine Learning December 2016 9 / 29
RESTful-ish ML Services
• Excellent abstraction layer
• Transparent data model
• Immutable resources and UUIDs: traceability
• Simple yet effective interaction model
• Easy access from any language (API bindings)
Algorithmic complexity and computing resources
management problems mostly washed away
#BSSML16 Automating Machine Learning December 2016 10 / 29
RESTful done right: Whitebox resources
• Your data, your model
• Model reverse engineering becomes
moot
• Maximizes reach (Web, CLI, desktop,
IoT)
#BSSML16 Automating Machine Learning December 2016 11 / 29
Outline
1 Introduction: ML as a System Service
2 ML as a RESTful Cloudy Service
3 Client-side workflows: REST API and bindings
4 Client-side workflows: Bigmler
#BSSML16 Automating Machine Learning December 2016 12 / 29
Higher-level Machine Learning
#BSSML16 Automating Machine Learning December 2016 13 / 29
Example workflow: Batch Centroid
Objective: Label each row in a Dataset with its associated centroid.
We need to...
• Create Dataset
• Create Cluster
• Create BatchCentroid from Cluster
and Dataset
• Save BatchCentroid as new Dataset
#BSSML16 Automating Machine Learning December 2016 14 / 29
Example workflow: building blocks
curl -X POST "https://bigml.io?$AUTH/dataset" 
-D '{"source": "source/56fbbfea200d5a3403000db7"}'
curl -X POST "https://bigml.io?$AUTH/cluster" 
-D '{"source": "dataset/43ffe231a34fff333000b65"}'
curl -X POST "https://bigml.io?$AUTH/batchcentroid" 
-D '{"dataset": "dataset/43ffe231a34fff333000b65",
"cluster": "cluster/33e2e231a34fff333000b65"}'
curl -X GET "https://bigml.io?$AUTH/dataset/1234ff45eab8c0034334"
#BSSML16 Automating Machine Learning December 2016 15 / 29
Example workflow: Web UI
#BSSML16 Automating Machine Learning December 2016 16 / 29
Automation via bindings
from bigml.api import BigML
api = BigML()
project = api.create_project({'name': 'ToyBoost'})
orig_source =
api.create_source(source,
{"name": "ToyBoost",
"project": project['resource']})
api.ok(orig_source)
orig_dataset =
api.create_dataset(orig_source, {"name": "Boost"})
api.ok(orig_dataset)
trainset = api.get_dataset(trainset)
for loop in range(0,10):
api.ok(trainset)
model = api.create_model(trainset, {
"name": "ToyBoost - Model%d" % loop,
"objective_fields": ["letter"],
"excluded_fields": ["weight"],
"weight_field": "100011"})
api.ok(model)
batchp =
api.create_batch_prediction(model, trainset, {
"name": "ToyBoost - Result%d" % loop,
"all_fields": True,
"header": True})
api.ok(batchp)
batchp = api.get_batch_prediction(batchp)
batchp_dataset =
api.get_dataset(batchp['object'])#BSSML16 Automating Machine Learning December 2016 17 / 29
Example workflow: Python bindings
from bigml.api import BigML
api = BigML()
source = 'source/5643d345f43a234ff2310a3e'
# create dataset and cluster, waiting for both
dataset = api.create_dataset(source)
api.ok(dataset)
cluster = api.create_cluster(dataset)
api.ok(cluster)
# create a batch centroid with output to dataset
centroid = api.create_batch_centroid(cluster, dataset,
{'output_dataset': True,
'all_fields': True})
api.ok(centroid)
# wait again, via polling, until the dataset is finished
batch_dataset_id = centroid['object']['output_dataset_resource']
batch_dataset = api.get_dataset(batch_dataset_id)
api.ok(batch_dataset)
#BSSML16 Automating Machine Learning December 2016 18 / 29
Client-side automation via bindings
Strengths of bindings-based solutions
Versatility Maximum flexibility and possibility of encapsulation (via
proper engineering)
Native Easy to support any programming language
Offline Whitebox models allow local use of resources (e.g.,
real-time predictions)
#BSSML16 Automating Machine Learning December 2016 19 / 29
Client-side automation via bindings
Strengths of bindings-based solutions
from bigml.model import Model
model_id = 'model/5643d345f43a234ff2310a3e'
# Download of (whitebox) resource
local_model = Model(model_id)
# Purely local calculations
local_model.predict({'plasma glucose': 132})
#BSSML16 Automating Machine Learning December 2016 20 / 29
Client-side automation via bindings
Problems of bindings-based solutions
Complexity Lots of details outside the problem domain
Reuse No inter-language compatibility
Scalability Client-side workflows are hard to optimize
Not enough abstraction
#BSSML16 Automating Machine Learning December 2016 21 / 29
Outline
1 Introduction: ML as a System Service
2 ML as a RESTful Cloudy Service
3 Client-side workflows: REST API and bindings
4 Client-side workflows: Bigmler
#BSSML16 Automating Machine Learning December 2016 22 / 29
Higher-level Machine Learning
#BSSML16 Automating Machine Learning December 2016 23 / 29
Simple workflow in a one-liner
# 1-clikc cluster
bigmler cluster 
--output-dir output/job
--train data/iris.csv 
--test-datasets output/job/dataset 
--remote 
--to-dataset
# the created dataset id:
cat output/job/batch_centroid_dataset
#BSSML16 Automating Machine Learning December 2016 24 / 29
Simple automation: “1-click” tasks
# "1-click" ensemble
bigmler --train data/iris.csv 
--number-of-models 500 
--sample-rate 0.85 
--output-dir output/iris-ensemble 
--project "vssml tutorial"
# "1-click" dataset with parameterized fields
bigmler --train data/diabetes.csv 
--no-model 
--name "4-featured diabetes" 
--dataset-fields 
"plasma glucose,insulin,diabetes pedigree,diabetes" 
--output-dir output/diabetes 
--project vssml_tutorial
#BSSML16 Automating Machine Learning December 2016 25 / 29
Rich, parameterized workflows: cross-validation
bigmler analyze --cross-validation  # parameterized input
--dataset $(cat output/diabetes/dataset) 
--k-folds 3  # number of folds during validation
--output-dir output/diabetes-validation
#BSSML16 Automating Machine Learning December 2016 26 / 29
Rich, parameterized workflows: feature selection
bigmler analyze --features  # parameterized input
--dataset $(cat output/diabetes/dataset) 
--k-folds 2  # number of folds during validation
--staleness 2  # stop criterium
--optimize precision  # optimization metric
--penalty 1  # algorithm parameter
--output-dir output/diabetes-features-selection
#BSSML16 Automating Machine Learning December 2016 27 / 29
Client-side Machine Learning Automation
Problems of client-side solutions
Complex Too fine-grained, leaky abstractions
Cumbersome Error handling, network issues
Hard to reuse Tied to a single programming language
Hard to scale Parallelization again a problem
Hard to generalize CLI tools like bigmler hide complexity at the cost of
flexibility
#BSSML16 Automating Machine Learning December 2016 28 / 29
Client-side Machine Learning Automation
Problems of client-side solutions
Complex Too fine-grained, leaky abstractions
Cumbersome Error handling, network issues
Hard to reuse Tied to a single programming language
Hard to scale Parallelization again a problem
Hard to generalize CLI tools like bigmler hide complexity at the cost of
flexibility
Algorithmic complexity and computing resources management
problems mostly washed away are back!
#BSSML16 Automating Machine Learning December 2016 28 / 29
Questions?
#BSSML16 Automating Machine Learning December 2016 29 / 29

BSSML16 L8. REST API, Bindings, and Basic Workflows

  • 1.
    Automating Machine Learning API,bindings, BigMLer and Basic Workflows #BSSML16 December 2016 #BSSML16 Automating Machine Learning December 2016 1 / 29
  • 2.
    Outline 1 Introduction: MLas a System Service 2 ML as a RESTful Cloudy Service 3 Client-side workflows: REST API and bindings 4 Client-side workflows: Bigmler #BSSML16 Automating Machine Learning December 2016 2 / 29
  • 3.
    Outline 1 Introduction: MLas a System Service 2 ML as a RESTful Cloudy Service 3 Client-side workflows: REST API and bindings 4 Client-side workflows: Bigmler #BSSML16 Automating Machine Learning December 2016 3 / 29
  • 4.
    Machine Learning asa System Service The goal Machine Learning as a system level service The means • APIs: ML building blocks • Abstraction layer over feature engineering • Abstraction layer over algorithms • Automation #BSSML16 Automating Machine Learning December 2016 4 / 29
  • 5.
    The Roadmap #BSSML16 AutomatingMachine Learning December 2016 5 / 29
  • 6.
    Outline 1 Introduction: MLas a System Service 2 ML as a RESTful Cloudy Service 3 Client-side workflows: REST API and bindings 4 Client-side workflows: Bigmler #BSSML16 Automating Machine Learning December 2016 6 / 29
  • 7.
    RESTful-ish ML Services #BSSML16Automating Machine Learning December 2016 7 / 29
  • 8.
    RESTful-ish ML Services #BSSML16Automating Machine Learning December 2016 8 / 29
  • 9.
    RESTful-ish ML Services #BSSML16Automating Machine Learning December 2016 9 / 29
  • 10.
    RESTful-ish ML Services •Excellent abstraction layer • Transparent data model • Immutable resources and UUIDs: traceability • Simple yet effective interaction model • Easy access from any language (API bindings) Algorithmic complexity and computing resources management problems mostly washed away #BSSML16 Automating Machine Learning December 2016 10 / 29
  • 11.
    RESTful done right:Whitebox resources • Your data, your model • Model reverse engineering becomes moot • Maximizes reach (Web, CLI, desktop, IoT) #BSSML16 Automating Machine Learning December 2016 11 / 29
  • 12.
    Outline 1 Introduction: MLas a System Service 2 ML as a RESTful Cloudy Service 3 Client-side workflows: REST API and bindings 4 Client-side workflows: Bigmler #BSSML16 Automating Machine Learning December 2016 12 / 29
  • 13.
    Higher-level Machine Learning #BSSML16Automating Machine Learning December 2016 13 / 29
  • 14.
    Example workflow: BatchCentroid Objective: Label each row in a Dataset with its associated centroid. We need to... • Create Dataset • Create Cluster • Create BatchCentroid from Cluster and Dataset • Save BatchCentroid as new Dataset #BSSML16 Automating Machine Learning December 2016 14 / 29
  • 15.
    Example workflow: buildingblocks curl -X POST "https://bigml.io?$AUTH/dataset" -D '{"source": "source/56fbbfea200d5a3403000db7"}' curl -X POST "https://bigml.io?$AUTH/cluster" -D '{"source": "dataset/43ffe231a34fff333000b65"}' curl -X POST "https://bigml.io?$AUTH/batchcentroid" -D '{"dataset": "dataset/43ffe231a34fff333000b65", "cluster": "cluster/33e2e231a34fff333000b65"}' curl -X GET "https://bigml.io?$AUTH/dataset/1234ff45eab8c0034334" #BSSML16 Automating Machine Learning December 2016 15 / 29
  • 16.
    Example workflow: WebUI #BSSML16 Automating Machine Learning December 2016 16 / 29
  • 17.
    Automation via bindings frombigml.api import BigML api = BigML() project = api.create_project({'name': 'ToyBoost'}) orig_source = api.create_source(source, {"name": "ToyBoost", "project": project['resource']}) api.ok(orig_source) orig_dataset = api.create_dataset(orig_source, {"name": "Boost"}) api.ok(orig_dataset) trainset = api.get_dataset(trainset) for loop in range(0,10): api.ok(trainset) model = api.create_model(trainset, { "name": "ToyBoost - Model%d" % loop, "objective_fields": ["letter"], "excluded_fields": ["weight"], "weight_field": "100011"}) api.ok(model) batchp = api.create_batch_prediction(model, trainset, { "name": "ToyBoost - Result%d" % loop, "all_fields": True, "header": True}) api.ok(batchp) batchp = api.get_batch_prediction(batchp) batchp_dataset = api.get_dataset(batchp['object'])#BSSML16 Automating Machine Learning December 2016 17 / 29
  • 18.
    Example workflow: Pythonbindings from bigml.api import BigML api = BigML() source = 'source/5643d345f43a234ff2310a3e' # create dataset and cluster, waiting for both dataset = api.create_dataset(source) api.ok(dataset) cluster = api.create_cluster(dataset) api.ok(cluster) # create a batch centroid with output to dataset centroid = api.create_batch_centroid(cluster, dataset, {'output_dataset': True, 'all_fields': True}) api.ok(centroid) # wait again, via polling, until the dataset is finished batch_dataset_id = centroid['object']['output_dataset_resource'] batch_dataset = api.get_dataset(batch_dataset_id) api.ok(batch_dataset) #BSSML16 Automating Machine Learning December 2016 18 / 29
  • 19.
    Client-side automation viabindings Strengths of bindings-based solutions Versatility Maximum flexibility and possibility of encapsulation (via proper engineering) Native Easy to support any programming language Offline Whitebox models allow local use of resources (e.g., real-time predictions) #BSSML16 Automating Machine Learning December 2016 19 / 29
  • 20.
    Client-side automation viabindings Strengths of bindings-based solutions from bigml.model import Model model_id = 'model/5643d345f43a234ff2310a3e' # Download of (whitebox) resource local_model = Model(model_id) # Purely local calculations local_model.predict({'plasma glucose': 132}) #BSSML16 Automating Machine Learning December 2016 20 / 29
  • 21.
    Client-side automation viabindings Problems of bindings-based solutions Complexity Lots of details outside the problem domain Reuse No inter-language compatibility Scalability Client-side workflows are hard to optimize Not enough abstraction #BSSML16 Automating Machine Learning December 2016 21 / 29
  • 22.
    Outline 1 Introduction: MLas a System Service 2 ML as a RESTful Cloudy Service 3 Client-side workflows: REST API and bindings 4 Client-side workflows: Bigmler #BSSML16 Automating Machine Learning December 2016 22 / 29
  • 23.
    Higher-level Machine Learning #BSSML16Automating Machine Learning December 2016 23 / 29
  • 24.
    Simple workflow ina one-liner # 1-clikc cluster bigmler cluster --output-dir output/job --train data/iris.csv --test-datasets output/job/dataset --remote --to-dataset # the created dataset id: cat output/job/batch_centroid_dataset #BSSML16 Automating Machine Learning December 2016 24 / 29
  • 25.
    Simple automation: “1-click”tasks # "1-click" ensemble bigmler --train data/iris.csv --number-of-models 500 --sample-rate 0.85 --output-dir output/iris-ensemble --project "vssml tutorial" # "1-click" dataset with parameterized fields bigmler --train data/diabetes.csv --no-model --name "4-featured diabetes" --dataset-fields "plasma glucose,insulin,diabetes pedigree,diabetes" --output-dir output/diabetes --project vssml_tutorial #BSSML16 Automating Machine Learning December 2016 25 / 29
  • 26.
    Rich, parameterized workflows:cross-validation bigmler analyze --cross-validation # parameterized input --dataset $(cat output/diabetes/dataset) --k-folds 3 # number of folds during validation --output-dir output/diabetes-validation #BSSML16 Automating Machine Learning December 2016 26 / 29
  • 27.
    Rich, parameterized workflows:feature selection bigmler analyze --features # parameterized input --dataset $(cat output/diabetes/dataset) --k-folds 2 # number of folds during validation --staleness 2 # stop criterium --optimize precision # optimization metric --penalty 1 # algorithm parameter --output-dir output/diabetes-features-selection #BSSML16 Automating Machine Learning December 2016 27 / 29
  • 28.
    Client-side Machine LearningAutomation Problems of client-side solutions Complex Too fine-grained, leaky abstractions Cumbersome Error handling, network issues Hard to reuse Tied to a single programming language Hard to scale Parallelization again a problem Hard to generalize CLI tools like bigmler hide complexity at the cost of flexibility #BSSML16 Automating Machine Learning December 2016 28 / 29
  • 29.
    Client-side Machine LearningAutomation Problems of client-side solutions Complex Too fine-grained, leaky abstractions Cumbersome Error handling, network issues Hard to reuse Tied to a single programming language Hard to scale Parallelization again a problem Hard to generalize CLI tools like bigmler hide complexity at the cost of flexibility Algorithmic complexity and computing resources management problems mostly washed away are back! #BSSML16 Automating Machine Learning December 2016 28 / 29
  • 30.
    Questions? #BSSML16 Automating MachineLearning December 2016 29 / 29