Intermediate Workflows
The BigML Team
May 2016
The BigML Team Intermediate Workflows May 2016 1 / 18
Outline
1 Creating a model per cluster
2 Predicting with the closet model for a new instance
The BigML Team Intermediate Workflows May 2016 2 / 18
Outline
1 Creating a model per cluster
2 Predicting with the closet model for a new instance
The BigML Team Intermediate Workflows May 2016 3 / 18
Model per cluster
• Given an input dataset, create a G-means cluster
• For each resulting centroid, create a dataset
• Model and evaluate each centroid dataset separately
• Store results for later use (predictions)
The BigML Team Intermediate Workflows May 2016 4 / 18
Global workflow
;; Given a dataset id, creates a cluster and models for each of the
;; resulting centroid datasets. Returns the cluster-id, a map from
;; centroid id to its model, a map to centroid id to their names, and
;; a map from centroid id to the evaluation.
(define (make-cluster-models ds-id)
(let (cluster-id (create-and-wait-cluster {"dataset" ds-id})
cluster (fetch cluster-id)
cid (get cluster "resource")
cids (centroids-prop cluster "id")
cnames (centroids-prop cluster "name")
centroid-dataset-ids (create-centroid-datasets cid cids)
mod-ids (create-centroid-models centroid-dataset-ids)
mod-evs (create-model-and-evaluations centroid-dataset-ids)
models (map head mod-evs)
evals (map (lambda (x) (nth x 1)) mod-evs))
[cid (make-map cids models) (make-map cids cnames) (make-map cids evals)]))
(define (centroids-prop cluster prop)
(let (cs (get-in cluster ["clusters" "clusters"]))
(map (lambda (c) (get c prop)) cs)))
The BigML Team Intermediate Workflows May 2016 5 / 18
Dataset and Model per cluster
;; Given a cluster and its centroids, generate associated datasets
(define (create-centroid-datasets cluster-id centroid-ids)
(wait-forever* (create* "dataset" (for (id centroid-ids)
{"cluster" cluster-id "centroid" id}))))
;; Given a list of dataset ids, create a model for each one
(define (create-centroid-models dataset-ids)
(create* "model" (for (id dataset-ids) {"dataset" id})))
;; Interlude: list comprehension with "for"
(for (x [1 2 3]) {"value" x}) ;; => [{"value" 1} {"value" 2} {"value" 3}]
The BigML Team Intermediate Workflows May 2016 6 / 18
Evaluations
(define (sample-dataset ds-id rate oob)
(create-dataset {"sample_rate" rate
"origin_dataset" ds-id
"out_of_bag" oob
"seed" "whizzml-example"}))
(define (evaluate-model-on-dataset ds-id)
(let (training-id (sample-dataset ds-id 0.8 false)
test-id (sample-dataset ds-id 0.8 true)
_ (wait-forever training-id)
model-id (create-model {"dataset" training-id})
_ (wait-forever* [test-id model-id])
ev-id (create-evaluation {"model" model-id "dataset" test-id}))
{"model" model-id "training" training-id "test" test-id "evaluation" ev-id}))
(define (create-evaluations ds-ids)
(let (mds (map evaluate-model-on-dataset ds-ids))
(for (md mds)
(let (mid (get md "model")
eid (get md "evaluation"))
(wait-forever eid)
(delete* [mid (get md "training") (get md "test")])
(fetch eid)))))
The BigML Team Intermediate Workflows May 2016 7 / 18
Evaluations
(define (sample-dataset ds-id rate oob)
(create-dataset {"sample_rate" rate
"origin_dataset" ds-id
"out_of_bag" oob
"seed" "whizzml-example"}))
;; Creates a model with 80% of the input dataset, evaluates it
;; with the remaining 20%. Returns the model and evaluation ids.
(define (evaluate-model-on-dataset ds-id)
(let (training-id (sample-dataset ds-id 0.8 false)
test-id (sample-dataset ds-id 0.8 true)
_ (wait-forever training-id)
model-id (create-model {"dataset" training-id})
_ (wait-forever* [test-id model-id])
ev-id (create-evaluation {"model" model-id "dataset" test-id}))
{"model" model-id
"training" training-id
"test" test-id
"evaluation" ev-id}))
The BigML Team Intermediate Workflows May 2016 8 / 18
Final workflow
(define (make-cluster-models ds-id)
(let (cluster-id (create-and-wait-cluster {"dataset" ds-id})
cluster (fetch cluster-id)
cid (get cluster "resource")
cids (centroids-prop cluster "id")
cnames (centroids-prop cluster "name")
centroid-dataset-ids (create-centroid-datasets cid cids)
mod-ids (create-centroid-models centroid-dataset-ids)
evals (create-evaluations centroid-dataset-ids))
[cid (make-map cids mod-ids) (make-map cids cnames) (make-map cids evals)]))
(define dataset-id (create-and-wait-dataset {"source" source-id}))
(define full-result (make-cluster-models dataset-id))
(define cluster-id (nth full-result 0))
(define models (nth full-result 1))
(define names (nth full-result 2))
(define evaluations (nth full-result 3))
https://github.com/whizzml/examples/tree/master/model-per-cluster/cr
The BigML Team Intermediate Workflows May 2016 9 / 18
Outline
1 Creating a model per cluster
2 Predicting with the closet model for a new instance
The BigML Team Intermediate Workflows May 2016 10 / 18
Accessing execution outputs
(define (get-output exec n)
(nth (nth (get-in exec ["execution" "outputs"]) n) 1))
(define (get-cluster-id exec) (get-output exec 1))
(define (get-models exec) (get-output exec 2))
(define (get-centroid-names exec) (values (get-output exec 3)))
(define (get-model-ids exec) (values (get-models exec)))
(define (get-centroid-ids exec) (keys (get-models exec)))
(define (get-model exec cid) (get (get-models exec) cid))
The BigML Team Intermediate Workflows May 2016 11 / 18
Individual predictions
• Given an input map (new instance), find its centroid
• Find the model associated with that centroid
• Perform a single prediction using that model and input
The BigML Team Intermediate Workflows May 2016 12 / 18
Individual predictions
;; Assign a centroid to given input data
;; (input-centroid "cluster/123453567890978675aade45" {"000000" 3})
(define (find-input-centroid cluster-id input-data)
(let (pid (create-and-wait-centroid {"cluster" cluster-id
"input_data" input-data})
pred (fetch pid))
(get pred "centroid_id")))
(define (predict-by-cluster exec-id input-data)
(let (exec (fetch exec-id)
cluster-id (get-cluster-id exec)
centroid-id (find-input-centroid cluster-id input-data)
model-id (get-model exec centroid-id)
pred (fetch (create-and-wait-prediction {"model" model-id
"input_data" input-data})))
{"prediction" (get pred "prediction")
"model" model-id
"centroid" centroid-id}))
The BigML Team Intermediate Workflows May 2016 13 / 18
Batch predictions
• Given an input dataset, predict for each row using the model
associated to the row’s centroid
Create a batcentroid dataset that assigns to each new instance its
centroid
Split the resulting batchcentroid dataset in one dataset per centroid
value
Perform a regular batchprediction for each of the single-centroid
datasets
Combine all resulting batchprediction datasets in a single dataset
The BigML Team Intermediate Workflows May 2016 14 / 18
Global workflow
(define (batchpredict-by-cluster exec-id input-dataset-id)
(let (exec (fetch exec-id) ;; 1) fetch execution
centroid-names (get-centroid-names exec) ;; and extract info:
model-ids (get-model-ids exec) ;; cluster, model and
cluster-id (get-cluster-id exec) ;; centroid names.
;; 2) Split the input dataset: one subdataset per centroid
ds-ids (split-by-clusters input-dataset-id cluster-id centroid-names)
;; 3) Batchpredict on each subdataset using its model
;; and combine the results in a single dataset
result (make-predictions ds-ids model-ids))
;; 4) Get rid of intermmediate datasets
(delete* ds-ids)
result)
The BigML Team Intermediate Workflows May 2016 15 / 18
Split input dataset
(define (filter-centroid ds-id centroid-id)
(let (cname "centroid"
fl (flatline "(= {{centroid-id}} (f {{cname}}))"))
(create-dataset {"origin_dataset" ds-id "lisp_filter" fl})))
(define (split-by-clusters ds-id cluster-id centroid-names)
(let (bc (create-and-wait-batchcentroid {"cluster" cluster-id
"dataset" ds-id
"output_dataset" true
"all_fields" true
"distance" false
"centroid_name" "centroid"})
cds-id (get (fetch bc) "output_dataset_resource")
_ (wait-forever cds-id)
ds-ids (map (lambda (cid) (filter-centroid cds-id cid)) centroid-names))
(wait-forever* ds-ids)))
The BigML Team Intermediate Workflows May 2016 16 / 18
Making batchpredictions and combining results
(define (batchprediction-dataset bp-id)
(get (fetch bp-id) "output_dataset_resource"))
(define (make-predictions ds-ids model-ids)
(let (p-ids (for (n (range (count ds-ids)))
(let (mid (nth model-ids n)
dsid (nth ds-ids n))
(create-batchprediction {"model" mid
"dataset" dsid
"all_fields" true
"output_dataset" true
"confidence" true})))
ds-ids (map batchprediction-dataset (wait-forever* p-ids))
_ (wait-forever* ds-ids)
ds-id (create-and-wait-dataset {"origin_datasets" ds-ids}))
(delete* ds-ids)
ds-id))
The BigML Team Intermediate Workflows May 2016 17 / 18
Library-based scripts
Script for single predictions
(define result (predict-by-cluster execution-id input-data))
Script for batch predictions
(define result (batchpredict-by-cluster execution-id dataset-id))
https://github.com/whizzml/examples/tree/master/model-per-cluster/us
The BigML Team Intermediate Workflows May 2016 18 / 18

Intermediate WhizzML Workflows

  • 1.
    Intermediate Workflows The BigMLTeam May 2016 The BigML Team Intermediate Workflows May 2016 1 / 18
  • 2.
    Outline 1 Creating amodel per cluster 2 Predicting with the closet model for a new instance The BigML Team Intermediate Workflows May 2016 2 / 18
  • 3.
    Outline 1 Creating amodel per cluster 2 Predicting with the closet model for a new instance The BigML Team Intermediate Workflows May 2016 3 / 18
  • 4.
    Model per cluster •Given an input dataset, create a G-means cluster • For each resulting centroid, create a dataset • Model and evaluate each centroid dataset separately • Store results for later use (predictions) The BigML Team Intermediate Workflows May 2016 4 / 18
  • 5.
    Global workflow ;; Givena dataset id, creates a cluster and models for each of the ;; resulting centroid datasets. Returns the cluster-id, a map from ;; centroid id to its model, a map to centroid id to their names, and ;; a map from centroid id to the evaluation. (define (make-cluster-models ds-id) (let (cluster-id (create-and-wait-cluster {"dataset" ds-id}) cluster (fetch cluster-id) cid (get cluster "resource") cids (centroids-prop cluster "id") cnames (centroids-prop cluster "name") centroid-dataset-ids (create-centroid-datasets cid cids) mod-ids (create-centroid-models centroid-dataset-ids) mod-evs (create-model-and-evaluations centroid-dataset-ids) models (map head mod-evs) evals (map (lambda (x) (nth x 1)) mod-evs)) [cid (make-map cids models) (make-map cids cnames) (make-map cids evals)])) (define (centroids-prop cluster prop) (let (cs (get-in cluster ["clusters" "clusters"])) (map (lambda (c) (get c prop)) cs))) The BigML Team Intermediate Workflows May 2016 5 / 18
  • 6.
    Dataset and Modelper cluster ;; Given a cluster and its centroids, generate associated datasets (define (create-centroid-datasets cluster-id centroid-ids) (wait-forever* (create* "dataset" (for (id centroid-ids) {"cluster" cluster-id "centroid" id})))) ;; Given a list of dataset ids, create a model for each one (define (create-centroid-models dataset-ids) (create* "model" (for (id dataset-ids) {"dataset" id}))) ;; Interlude: list comprehension with "for" (for (x [1 2 3]) {"value" x}) ;; => [{"value" 1} {"value" 2} {"value" 3}] The BigML Team Intermediate Workflows May 2016 6 / 18
  • 7.
    Evaluations (define (sample-dataset ds-idrate oob) (create-dataset {"sample_rate" rate "origin_dataset" ds-id "out_of_bag" oob "seed" "whizzml-example"})) (define (evaluate-model-on-dataset ds-id) (let (training-id (sample-dataset ds-id 0.8 false) test-id (sample-dataset ds-id 0.8 true) _ (wait-forever training-id) model-id (create-model {"dataset" training-id}) _ (wait-forever* [test-id model-id]) ev-id (create-evaluation {"model" model-id "dataset" test-id})) {"model" model-id "training" training-id "test" test-id "evaluation" ev-id})) (define (create-evaluations ds-ids) (let (mds (map evaluate-model-on-dataset ds-ids)) (for (md mds) (let (mid (get md "model") eid (get md "evaluation")) (wait-forever eid) (delete* [mid (get md "training") (get md "test")]) (fetch eid))))) The BigML Team Intermediate Workflows May 2016 7 / 18
  • 8.
    Evaluations (define (sample-dataset ds-idrate oob) (create-dataset {"sample_rate" rate "origin_dataset" ds-id "out_of_bag" oob "seed" "whizzml-example"})) ;; Creates a model with 80% of the input dataset, evaluates it ;; with the remaining 20%. Returns the model and evaluation ids. (define (evaluate-model-on-dataset ds-id) (let (training-id (sample-dataset ds-id 0.8 false) test-id (sample-dataset ds-id 0.8 true) _ (wait-forever training-id) model-id (create-model {"dataset" training-id}) _ (wait-forever* [test-id model-id]) ev-id (create-evaluation {"model" model-id "dataset" test-id})) {"model" model-id "training" training-id "test" test-id "evaluation" ev-id})) The BigML Team Intermediate Workflows May 2016 8 / 18
  • 9.
    Final workflow (define (make-cluster-modelsds-id) (let (cluster-id (create-and-wait-cluster {"dataset" ds-id}) cluster (fetch cluster-id) cid (get cluster "resource") cids (centroids-prop cluster "id") cnames (centroids-prop cluster "name") centroid-dataset-ids (create-centroid-datasets cid cids) mod-ids (create-centroid-models centroid-dataset-ids) evals (create-evaluations centroid-dataset-ids)) [cid (make-map cids mod-ids) (make-map cids cnames) (make-map cids evals)])) (define dataset-id (create-and-wait-dataset {"source" source-id})) (define full-result (make-cluster-models dataset-id)) (define cluster-id (nth full-result 0)) (define models (nth full-result 1)) (define names (nth full-result 2)) (define evaluations (nth full-result 3)) https://github.com/whizzml/examples/tree/master/model-per-cluster/cr The BigML Team Intermediate Workflows May 2016 9 / 18
  • 10.
    Outline 1 Creating amodel per cluster 2 Predicting with the closet model for a new instance The BigML Team Intermediate Workflows May 2016 10 / 18
  • 11.
    Accessing execution outputs (define(get-output exec n) (nth (nth (get-in exec ["execution" "outputs"]) n) 1)) (define (get-cluster-id exec) (get-output exec 1)) (define (get-models exec) (get-output exec 2)) (define (get-centroid-names exec) (values (get-output exec 3))) (define (get-model-ids exec) (values (get-models exec))) (define (get-centroid-ids exec) (keys (get-models exec))) (define (get-model exec cid) (get (get-models exec) cid)) The BigML Team Intermediate Workflows May 2016 11 / 18
  • 12.
    Individual predictions • Givenan input map (new instance), find its centroid • Find the model associated with that centroid • Perform a single prediction using that model and input The BigML Team Intermediate Workflows May 2016 12 / 18
  • 13.
    Individual predictions ;; Assigna centroid to given input data ;; (input-centroid "cluster/123453567890978675aade45" {"000000" 3}) (define (find-input-centroid cluster-id input-data) (let (pid (create-and-wait-centroid {"cluster" cluster-id "input_data" input-data}) pred (fetch pid)) (get pred "centroid_id"))) (define (predict-by-cluster exec-id input-data) (let (exec (fetch exec-id) cluster-id (get-cluster-id exec) centroid-id (find-input-centroid cluster-id input-data) model-id (get-model exec centroid-id) pred (fetch (create-and-wait-prediction {"model" model-id "input_data" input-data}))) {"prediction" (get pred "prediction") "model" model-id "centroid" centroid-id})) The BigML Team Intermediate Workflows May 2016 13 / 18
  • 14.
    Batch predictions • Givenan input dataset, predict for each row using the model associated to the row’s centroid Create a batcentroid dataset that assigns to each new instance its centroid Split the resulting batchcentroid dataset in one dataset per centroid value Perform a regular batchprediction for each of the single-centroid datasets Combine all resulting batchprediction datasets in a single dataset The BigML Team Intermediate Workflows May 2016 14 / 18
  • 15.
    Global workflow (define (batchpredict-by-clusterexec-id input-dataset-id) (let (exec (fetch exec-id) ;; 1) fetch execution centroid-names (get-centroid-names exec) ;; and extract info: model-ids (get-model-ids exec) ;; cluster, model and cluster-id (get-cluster-id exec) ;; centroid names. ;; 2) Split the input dataset: one subdataset per centroid ds-ids (split-by-clusters input-dataset-id cluster-id centroid-names) ;; 3) Batchpredict on each subdataset using its model ;; and combine the results in a single dataset result (make-predictions ds-ids model-ids)) ;; 4) Get rid of intermmediate datasets (delete* ds-ids) result) The BigML Team Intermediate Workflows May 2016 15 / 18
  • 16.
    Split input dataset (define(filter-centroid ds-id centroid-id) (let (cname "centroid" fl (flatline "(= {{centroid-id}} (f {{cname}}))")) (create-dataset {"origin_dataset" ds-id "lisp_filter" fl}))) (define (split-by-clusters ds-id cluster-id centroid-names) (let (bc (create-and-wait-batchcentroid {"cluster" cluster-id "dataset" ds-id "output_dataset" true "all_fields" true "distance" false "centroid_name" "centroid"}) cds-id (get (fetch bc) "output_dataset_resource") _ (wait-forever cds-id) ds-ids (map (lambda (cid) (filter-centroid cds-id cid)) centroid-names)) (wait-forever* ds-ids))) The BigML Team Intermediate Workflows May 2016 16 / 18
  • 17.
    Making batchpredictions andcombining results (define (batchprediction-dataset bp-id) (get (fetch bp-id) "output_dataset_resource")) (define (make-predictions ds-ids model-ids) (let (p-ids (for (n (range (count ds-ids))) (let (mid (nth model-ids n) dsid (nth ds-ids n)) (create-batchprediction {"model" mid "dataset" dsid "all_fields" true "output_dataset" true "confidence" true}))) ds-ids (map batchprediction-dataset (wait-forever* p-ids)) _ (wait-forever* ds-ids) ds-id (create-and-wait-dataset {"origin_datasets" ds-ids})) (delete* ds-ids) ds-id)) The BigML Team Intermediate Workflows May 2016 17 / 18
  • 18.
    Library-based scripts Script forsingle predictions (define result (predict-by-cluster execution-id input-data)) Script for batch predictions (define result (batchpredict-by-cluster execution-id dataset-id)) https://github.com/whizzml/examples/tree/master/model-per-cluster/us The BigML Team Intermediate Workflows May 2016 18 / 18