Structure Unstructured Data

Structure unstructured data
From 300k uncategorised images of watches to production models that
recognise 58 classes with 90% precision, at scale
Carmine Paolino
OLX Group

“Hey, we have thousands of
watches and I want to organise
them!

Make models that extract at
least 50 attributes from images
of watches at 90% precision,
and prepare a labelled dataset.

2. Data exploration and
Dataset creation

Use pretrained models for classiﬁcation
Faster R-CNN
InceptionResNet v2
OpenImageDataset
Faster R-CNN
NASNet
COCO
OLX Stock image
detection
Clock
Clock
Stock ImageWatch
Wall clock
Alarm clock
Original Image
Digital clock
Pocket watch
Watch strap

Cluster embeddings of items
from pretrained models

(spoiler alert)
Fine-tune on your domain

1-gram:
“Omega”
2-gram:
“Omega Speedmaster”

A few other options:
Topic modelling
Sentiment analysis
Named entity recognition

InceptionResNet v2
pretrained on ImageNet

Epochs
Accuracy
Brand classiﬁer’s learning curve
training
softmax

Epochs
Accuracy
training
softmax
+1 block

Epochs
Accuracy
training
softmax
+1 block
+1 block

Epochs
Accuracy
training
softmax
+1 block
+1 block
+1 block

Epochs
Accuracy
training
softmax
+1 block
+1 block
+1 block
+1 block

Set the classiﬁcation threshold of
every category in order to have
90% precision

Classiﬁer Class Precision Recall Threshold
Brand Apple 0,90020 0,97720 0,17410
Brand Armani 0,90080 0,51010 0,54613
Brand Breitling 0,90030 0,63290 0,47542
Brand Casio 0,90100 0,25290 0,77514
Brand Diesel 0,90060 0,85480 0,27741

Export your model to
TensorFlow SavedModel

def export_to_tf(model_path, classes_json, export_dir, top_k):
"""Converts the model to TensorFlow Serving."""
model = keras.models.load_model(model_path)
K.set_learning_phase(0)
# deal with serialized inputs
serialized_tf_example = tf.placeholder(tf.string, name='tf_example')
feature_configs = {
'image/encoded': tf.FixedLenFeature(shape=[], dtype=tf.string),
}
tf_example = tf.parse_example(serialized_tf_example, feature_configs)
jpegs = tf_example['image/encoded']
images = tf.map_fn(preprocess_image, jpegs, dtype=tf.float32)
# ...

def preprocess_image(image_buffer):
"""Emulates how Keras does preprocessing for its InceptionResnetV2
model. Check https://goo.gl/je7BCx"""
# Decode the string as an RGB JPEG
image = tf.image.decode_jpeg(image_buffer, channels=3)
# After this point, all image pixels reside in [0,1)
# until the very end, when they're rescaled to (-1, 1). The various
# adjust_* ops all require this range for dtype float.
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
# Resize the image to the original height and width.
image = tf.expand_dims(image, 0)
image = tf.image.resize_bilinear(
image, [IMAGE_SIZE, IMAGE_SIZE], align_corners=False)
image = tf.squeeze(image, [0])
# Finally, rescale to [-1, 1] instead of [0, 1)
image = tf.subtract(image, 0.5)
image = tf.multiply(image, 2.0)
return image

# … continuing the function export_to_tf
# attach preprocessing to graph
output = model(images)
# output the top 3 predictions
prediction_scores, prediction_indexes = tf.nn.top_k(output, top_k)
# prediction classes
classes = json.loads(classes_json.read())
prediction_table = tf.contrib.lookup.index_to_string_table_from_tensor(
tf.constant(classes))
prediction_classes = prediction_table.lookup(
tf.to_int64(prediction_indexes))
# build the signature_def_map
classification_inputs = tf.saved_model.utils.build_tensor_info(
serialized_tf_example)
classification_outputs_classes = tf.saved_model.utils.build_tensor_info(
prediction_classes)
classification_outputs_scores = tf.saved_model.utils.build_tensor_info(
prediction_scores)

# … continuing the function export_to_tf
classification_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={
signature_constants.CLASSIFY_INPUTS: classification_inputs
},
outputs={
signature_constants.CLASSIFY_OUTPUT_CLASSES:
classification_outputs_classes,
signature_constants.CLASSIFY_OUTPUT_SCORES:
classification_outputs_scores
},
method_name=signature_constants.CLASSIFY_METHOD_NAME))
builder = tf.saved_model.builder.SavedModelBuilder(export_dir)
with K.get_session() as sess:
builder.add_meta_graph_and_variables(
sess=sess,
tags=[tf.saved_model.tag_constants.SERVING],
clear_devices=True,
legacy_init_op=tf.tables_initializer(),
signature_def_map={
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
classification_signature
})
builder.save()

Save it as tar.gz to S3 and
deploy

import sagemaker
from sagemaker.tensorflow.model import TensorFlowModel, TensorFlowPredictor
def custom_tensorflow_predictor(endpoint_name, sagemaker_session):
def predictor():
return TensorFlowPredictor(endpoint_name, sagemaker_session)
return predictor
def deploy(model_path, entry_point_path, inference_instance_type, model_name):
sess = sagemaker.Session()
predictor = custom_tensorflow_predictor(model_name, sess)
model = TensorFlowModel(
model_data=model_path,
role=sagemaker.get_execution_role(),
entry_point=entry_point_path,
predictor_cls=predictor,
name=model_name)
model.deploy(
initial_instance_count=1, instance_type=inference_instance_type)

Set an autoscaling group and…

Thanks!
Questions?
Post, code, and more soon at
https://tech.olx.com/

Structure Unstructured Data

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Structure Unstructured Data

Similar to Structure Unstructured Data (20)

Recently uploaded

Recently uploaded (20)

Structure Unstructured Data