4. SPARK MBUTO
• Spark poc to (easy) create, run and test pipelines and
workflow
• Pipelines are made by sequential steps in a SparkJobApp
• Each steps is a SparkJob
• Each job share the same Spark/SQL context
• Jobs are consecutively run by JobRunner
13. IMAGE CLASSIFICATION
• Multiclass image classification:
1. Choose model (NN, SVM,TREE…)
2. Train/test model (with labeled images)
3. Predict the label of new images
4. Tune the model
14. IMAGE RETRIEVAL
• Multiclass image classification:
1. Choose metric (Euclidean, cosine…)
2. Build dictionary
3. Train/test the model
4. Query and search
5. Tune the model
19. CLASSIFICATION & RETRIEVAL
• Keypoints extraction from each images
• Clustering on the keypoints universe
• Represent each image with weighted cluster
vector
• Train &Test the model
• Query the model (finding the most similar
images)
Features
Engineering
Build the
Dictionary
Build the
classifier
Query
the model
20. C. & R. JOBS
• Load whole dataset
• Extract keypoints
• Reduce the keypoints universe
• Transform the features space
• Create the dictionary (aka Codebook)
• Train, test & evaluate the classifier
• Query and get prediction
DATA
TRAIN
CLASSIFIER
MODEL
PREDICTION
23. KNN IMPLEMENTATION
• Is a comparison model: the similarity metric is crucial!
• Nearest Neighbour search (in the codebook) is the panic point:
• KDTree: not parallel (anche se…)
• LSH: hyperparams difficult to tune
• MetricTree: disjoint features points area
• Spill tree: too many shared points
=> HybridTree
24. HYBRIDTREE
• TopTree is a Metric tree
• SubLeaf Tree are Spill tree, trained in parallel
• Nodes can be:
• OVERLAP => defeatist search
• NON OVERLAP => backtracking
25. NEURAL NETWORK
• Convolutional works well with images
• Hyperparameters tuning is the panic point, but can
be automatised (guarda il nuovo algo)
• Training is not trivial, update the model is easy to
complain
26. WHAT MORE?
• Features engineering
• Hyperparameters tuning
• Parallel optimizations
• Persist/update steps
• Ensemble models
DATA
Combiner
PREDICTION
Normalizer
pipelineModel
Cross
Validator