The document describes productionalizing a machine learning model for price prediction that was initially developed using Python notebooks. Key aspects include:
1) The ML pipeline extracts data from various sources, transforms it, trains XGBoost models for price classification and regression, and uploads results to S3.
2) A Java-based web service was developed to serve predictions using the trained models. It performs the same data transformations and vectorization as the notebooks using Java libraries.
3) Extensive unit and integration tests were written to ensure the web service produces identical results to the Python notebooks on both the training and test data. The tests load models and configuration from a zip file produced by the notebooks.
3. 3
PROJECT INFO
Customer:
A Canadian company that provides different fleet management services.
E.g. it runs a call center that handles all the maintenance and repairs of vehicles (acts as a “proxy”
between a client and service providers).
Use case:
A fleet owner contacts the agent to ask for assistance with the maintenance. The agent contacts
nearby service providers, gets offers, selects the supplier, negotiates the price for each line of the
maintenance order.
Problem:
a) Price negotiation takes agent’s time
b) Agents need to remember the details on cars/makes/models/spare parts to properly validate the
price.
Solution:
Price Prediction Web Service (based on ML) which predicts the maintenance price based on the
information about the vehicle, type of service, client location, etc.
4. 4
DATA SCIENCE SCOPE
Data Extraction
Destination: parquet files on HDFS,
scope: 2 last years, output: 28 mln. rows
Data Transformation
Filtering, joining, new fields/expressions,
destination: parquet files on HDFS, output: 5 mln. rows
ML Pipeline
label encoding, one hot encoding, vector assembling, training
XGBoost models, performance metrics.
Data Sources
Sybase, Cassandra, others
Typical scope of data scientist’s work: We made two models:
• classification model answering the question
“is the price for the repair relevant or not”?
• regression model (customer’s choice)
answering the question “what is the
recommended price for this maintenance
item?”
2 times boost in agents time
1.10 times decrease in savings (due to FN)
5. 5
SOLUTION ARCHITECTURE
Client
application
(scoring service
consumer)
Scoring Web
Service
(SOAP+REST)
Stack: Java, Spring
Data Extraction
Destination: parquet files on HDFS,
scope: 2 last years, output: 28 mln. rows
Data Transformation
Filtering, joining, new fields/expressions,
destination: parquet files on HDFS, output: 5 mln. rows
ML Pipeline
label encoding, one hot encoding, vector assembling, training
XGBoost models, performance metrics.
Data Sources
Sybase, Cassandra, others
Uploading Results
Upload models, training metrics, lookup tables, labels for
category variables. Destination: S3, output: 150MB zip archive
HTTP (SOAP)
requests for
price prediction
Downloading
new models
Scheduled
run of
training
process
Administrator
HTTP (REST)
management
commands
Model Storage
(HDFS/S3)
models, variables, labels,
lookup tables
6. 6
TECHNOLOGY STACK
Training part:
• Jupyter notebook
• Spark 2.3.1
• Pandas, Numpy, Scikit-learn and other libraries
• XGBoost
Scoring part (web service):
• Java
• Spring Boot
• xgboost-predictor-java library
https://github.com/komiya-atsushi/xgboost-predictor-java
• Lots of other open source Java libraries
7. 7
TASKS AT DIFFERENT ENVIRONMENTS
Exploration/Development Production
Training ScoringEnvironment:
• Python, Jupyter Notebook
• Spark/Scikit/XGBoost, etc.
Tasks:
• Get input data
• Rename fields
• Check values
• Modify field values
• Add new fields
• Filter rows
• Join other tables
• ML Pipeline tasks:
• Label encoding
• One hot encoding
• Train/test split
• Model training
• Metrics calculation
Environment:
• Java, Spring Boot, etc.
Tasks/Challenges:
We need to do the same things on Java:
“rename fields”, “check values”, “modify
field values”, “add fields”, “filter rows”,
“join other tables” and some of the
“ML pipepine tasks” (“label encoding”,
“one hot encoding”, “scoring by model”)
Challenges:
a) How to represent data?
b) What libraries to use for transformation?
c) What libraries to use for ML pipeline
tasks?
d) What libraries to use for scoring?
Environment:
• the same as exploration/development
Goal:
re-use the same code as much as possible.
Other tasks that we need to do here:
• Scheduled running of the whole training
cycle
• Uploading of results to some storage
(S3/HDFS)
• Alerting if metrics are below the
expectations
• Alerting if errors occurred during
training
9. 9
Spring Boot Web Service
Price Prediction Service Library
SCORING WEB SERVICE ARCHITECTURE
REST
Controller
SOAP
Endpoint
HTTP
Request
Transformers
Vectorization
(Label Encoder,
One-hot
encoder)
Scoring
(ML
models)
HTTP
Response
Payload
Logging
ML Data
Serialized Models + vectorization info (variables, labels, data types).
Lookup tables (for enriching the feature records).
The scoring web service is a purely Java solution using the artifacts (“ML Data”) output by the Python’s training code.
*Many other things (payload shipment on S3/HDFS, updater of the ML Data files, management interface) are not shown here and will be shown later.
10. 10
PRICE PREDICTION SERVICE LIBRARY
Price Prediction Service Library
Transformers
Vectorization
(Label Encoder,
One-hot
encoder)
Scoring
ML Data
Serialized Models + vectorization info (variables, labels, data types).
Lookup tables (for enriching the feature records).
Input Output
Input: maintenance order
• Order: VIN, country, supplier_id
• Line: repair code, ATA category, quantity.
Note: NO FEATURES HERE
Scoring web service needs to do the same
things as notebook, but on smaller data (5-10
lines per order):
a) It filters records (e.g. “remove Mexico data”)
b) It adds columns (=features), e.g. VIN =>
make, model, engine size, etc. Often done by
doing a lookup (=join to other table);
c) It generates features, e.g. ata_key =
aga_category + “_” + ata_subcategory.
Output: price prediction for every line.
11. 11
NOTEBOOK DATA DUMPS
Training notebook dumps lots of things: models, lookup tables, data for the integration tests
Aggregations
Root data
and lookups
Variables
configuration
ML models
and test
dataset
12. 12
BASE CLASSES
VIN Supplier_ID Repair code ATA cat. ATA subcat Qty
1HGBH41JXM
N109186
123456 REP 74 001003 1
[Same columns] Make Model Fuel Type … Supplier City
[Same columns] BMW X5 petrol … San Francisco
FeatureRecord - container for
features
FeatureRecordGroupedSet –
grouping by any field (in our
case – by order id)
Transformers – enrich FR with features:
• LookupTransformer – adds new columns
• ApplicableTransformer – stops processing
if some field is not in the lookup table
• OilGroupTransformer – groups similar
records
• Etc. (many others exist)
Make_BMW Make_Ford … Quantity … Prediction
1.0 0 … 1 … $55
MultiModel:
vectorizes the feature record and
does prediction
13. 13
LOOKUP TRANSFORMER
Purpose: enrich the feature record with real features by making a lookup. Backed with:
• InMemoryIndexedDataFrame – a fast in-memory lookup
• IndexedDataFrameReader – reader of the df.csv.gz + df.schema pair of files
.schema file:
.scv.gz file example:
ata_ctgy_cd,ata_sub_ctgy_cd,ata_cd_long_desc,english_cd_long_desc,cd_stat_ind
17,001100,NEW TIRE RADIAL STEEL BELTED,NEW TIRE RADIAL STEEL BELTED,A
17,003001,USED TIRE,USED TIRE,A
10,02,010045,MIRROR SPOT,MIRROR SPOT,A
14. 14
FULL CHAIN TRANSFORMER
Full chain transformer
combines all the atomic
transformations to one chain
Running transform() gives the
same result as if the same
records were passed through
notebook’s PySpark ETL code.
15. 15
OTHER CLASSES
MultiModel does two things: vectorization of the feature record (into sparce vector of
doubles) and prediction. Encapsulates many XGBoost models (separate model for every
ATA code – a subject of repair).
It uses biz.k11i.xgboost (https://github.com/komiya-atsushi/xgboost-predictor-java):
MultiModelReader is a class to load the global configuration and all the models from
the config provider.
ConfigProvider is an abstraction which allows to read resources from one place.
Currently there is just one implementation – ZipConfigProvider (to read everything
from a single zip file)
PricePredictionService – a class which combines FullChainTransformer and MultiModel.
16. 16
CONFIGURATION
Configuration structure (contents of the zip file):
• gg
• agg.json - aggregations configuration file
• <set of folders for each ata_key
• <set of pairs .csv.gz+.schema for each aggregation>,
e.g. agg_vin_model.csv.gz, agg_vin_model.schema
• config
• global.json – configuration file that describes all models, their variables, and
possible labels for cat. variables
• models
• <for every ata_key: .bin, .config and .txt file> - serialized models by the
XGBoost
• lookup
• <pairs of .csv.gz + .schema files for lookups>
ZipConfigProvider reads this zip file (size ≈ 200MB) produced by the notebook.
Uncompressed size in Java structures that we choose: 1.8 GB.
18. 18
UNIT VS. INTEGRATION TESTING
Unit test Integration test
Results depends only on Java code Results also depends on external systems/data
Easy to write and verify Setup of integration test might be complicated
A single class/unit is tested in isolation One or more components are tested
All dependencies are mocked if needed
No mocking is used (or only unrelated components are
mocked)
Test verifies only implementation of code
Test verifies implementation of individual components and
their interconnection behavior when they are used together
A unit test uses only JUnit/TestNG and a mocking
framework
An integration test can use real containers and real DBs as
well as special integration testings frameworks (e.g. Arquillian
or DbUnit)
Mostly used by developers Integration tests are also useful to QA, DevOps, Help Desk
A failed unit test is always a regression (if the
business has not changed)
A failed integration test can also mean that the code is still
correct but the environment has changed
Unit tests in an Enterprise application should last
about 5 minutes
Integration tests in an Enterprise application can last for hours
19. 19
INTEGRATION TESTING
Goal: to ensure that the web service is doing EXACTLY THE SAME THINGS as the training notebook does.
Training notebook outputs:
• mldata_20180807_093948.zip (200MB) - scoring configuration (ML models, lookups, variable configuration, etc.)
• mldata_test_20180807_093948.zip (600 MB) – scoring configuration + IT data:
What do we check:
• Take the input dataset (VIN, supplier_id, country, odometer reading, ATA category/subcategory, repair code, parts
quantity).
• Pass through FullChainTransformer and check if “Features by Python” = “Features by Java”
• Get predictions using MultiModel and check if “Prediction by Python” = “Prediction by Java”
Input Data
(all test dataset but JUST INPUT features)
VIN, supplier_id, ATA code, odom. reading, qnty
1M records
Test Data
with ALL features and predictions
All features (make, model, etc) + predictions
400K records
20. 20
INTEGRATION TESTING
Maven life cycle phases:
1. validate - checks if the project is correct and all information is available
2. compile - compiles source code in binary artifacts
3. test - executes the tests
4. package - takes the compiled code and package it, for example
5. integration-test - takes the packaged result and executes additional tests, which require the packaging*
6. verify - performs checks if the package is valid
7. install - install the result of the package phase into the local Maven repository
8. deploy - deploys the package to a target, i.e. remote repository
Example of how to run:
mvn clean install –Dmldata=/path/to/mldata_test_20181010_150000.zip
21. 21
CHECKED HASH MAP AS FEATURE CONTAINER
CheckedHashMap – an override of the HashMap, but with explicit operations.
Goal: avoid errors, increase control.
Overriden operations:
• put(): fails if the key already exists in the map
• get(): fails if the key doesn’t exist in the map
• remove(): fails if the key doesn’t exist in the map
New Operations:
• overwrite(): the key must exist, otherwise it will fail
• overwriteIfExists(): overwrites if the key exists, otherwise does nothing
• putIfNotExists(): doesn’t fail, works only if the key doesn’t exist
• putOrOverwrite(): no matter if the key exists or not, puts or overwrites it there
Left as is:
• getOrDefault(): if the key doesn’t exists, it will return a default
All operations do NOT allow null keys!
22. 22
SERIALIZATION IN HEX
Double and float values inside CSV/LIBSVM files are written like this:
• 3.5/0000000000000c40 - for double values
• 5.1016541/c040a340 – for float values
At the python side:
23. 23
SERIALIZATION IN HEX
At the Java side:
• HexParser class with helper static
methods to parse hex values – see the
code.
• Using SuperCSV library – to read the
CSV files.
• Made a class ParseDoubleHex extending
SuperCSV’s CellProcessor leveraging
HexParser to get values out of
3.5/0000000000000c40 .
• The same for float type.
24. 24
PRETTY PRINTING OF ERRORS
Advice: do a “pretty print” of error data. If you don’t do it, fixing bugs will be hard.
25. 25
PRETTY PRINTING OF ERRORS
Example of how easy is to fix errors if we have pretty print:
26. 26
MEMORY OPTIMIZATION FOR IT
Before re-design of integration tests: After re-design:
(added partitioning)
11 GB
3.8 GB
(three times less)
28. 28
Citing “Hidden Technical Debt in Machine Learning Systems”:
It may be surprising to the academic community to know that only a tiny fraction of the
code in many ML systems is actually devoted to learning or prediction:
TECHNICAL DEBT
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems
29. 29
Spring Boot Web Service
Price Prediction
Service Library
SCORING WEB SERVICE ARCHITECTURE
REST
Controller
SOAP
Endpoint
HTTP
Request
HTTP
Response
Payload
Logging
ML Data
(called a “config”)
Config
Updater
Payload
Shipping
Management
REST Controller
Client
application
(scoring service
consumer)
Model Storage
(HDFS/S3)
models, variables,
labels, lookup tables
HTTP
Request
Payload Storage
(HDFS/S3)
Daily zip files of all the
payloads
Payloads
Scheduler
(Spring)
VIN Decoder
(external REST service)
30. 30
PRICE PREDICTION CONTROLLER/ENDPOINT
Endpoints:
• /rest – for REST (RestControler class)
• /soap – for SOAP (SoapEndpoint class)
• /manage – for management REST requests
Overrides of default behavior:
• for REST: override exception resolver, JSON message
converter (fails on unknown properties)
• for SOAP: proper handling of IvalidXmlException
and SoapMessageCreation – doing “500 Internal
Server Error” instead of “400 bad request”.
• For both: async uploads of payloads (derived
MessageDispatcherServler and DispatcherServlet)
31. 31
PROCESSING ALGORITHM
The algorithm is common for REST controller and SOAP endpoint:
• validate the input data
• get the instance of the PricePredictionService from the manager
• perform prediction
• check the status. If it is OK – return the result, otherwise log the error and return the error.
34. 34
MANAGEMENT CONTROLLER
The web UI is rendered
by Swagger + Springfox
Caveat #1:
it cannot generate good
examples for the
properties of type
Map<Long, SomeClass>
Caveat #2:
It doesn’t cover SOAP
/swagger-ui.html - automatically generated endpoint for help on the REST methods
36. 36
MEMORY CLEANING
MemoryCleaner: utility class for cleaning memory.
Used on SWITCH operation.
• uses jlibs-core https://santhosh-
tekuri.github.io/jlibs/ RuntimeUtil.gc() - guarantees
garbage collection to happen, contrary to
System.gc()
• Runs a thread re-trying to free up RAM each T
seconds if
• at previous attempt we freed-up less than X MB
• we made not more than N attempts
Reason: old REST requests may still be running and hold
a reference to the old instance of the
PricePredictionService.
38. 38
SONARQUBE
Sonarqube is a tool for
continuous inspection
of code quality.
It helps to find
potential bugs, does
code review, check unit
test, coverage, etc.
Supports 20
programming
languages.
39. 39
Advice:
try to cover all the code
with unit tests
SONARQUBE: CODE COVERAGE
This bug
Was found after
covering these lines
with unit tests
41. 41
SONARQUBE: TRUE POSITIVES
Reference on explanation why we need the interrupt:
https://stackoverflow.com/questions/4906799/why-invoke-thread-currentthread-interrupt-in-a-catch-interruptexception-block
45. 45
RUNNING NOTEBOOKS OFFLINE
Goal: to re-use the code in exploration/development and in production
How it works:
nbrun.py notebook.ipynb –o out -k "pyspark 2.3.1" -e dev -i 3 -z results_{}.zip -t 180 -m 5
-o = output folder
-k = kernel name
-e = environment
-i = the cell where to insert the environment load
-z = the zip file pattern (to put the ipynb + *.py files after the run)
-t = timeout for the kernel to start
-m = maximum number of times to try to start the kernel
Algorithm:
• read the .ipynb file (just the code: omitting the output which may be there)
• insert some cells into position “-i”: a cell with profile name and override of the print function
• create “-out” folder and put there all modules, make this folder the working one.
• start the kernel, run the cells, output results to the “-out” folder
• zip the contents of the “-out” folder (notebook with executed content + py files)
46. 46
RUNNING NOTEBOOKS OFFLINE
How nbrun works:
• Nbformat – https://github.com/jupyter/nbformat
for reading/writing ipynb, inserting/editing cells
• Nbconvert – https://github.com/jupyter/nbconvert
for running the notebook with a specified kernel and getting the output
Applied tricks:
• Override nbconvert.preprocessors.ExecutePreprocessor:
• preprocess_cell: to measure execution time and add
cell.metadata["ExecuteTime"] = {"end_time": time_end, "start_time": time_start}
• run_cell: to log the execution start/end and result just into console output of nbrun.py
• preprocess: to fix the bugs with shutting down the kernel process in the case if we couldn’t
connect to it, to change the timeout for kernel start and to do the “retry” starting if something
failed (what happens quite often with “heavy” kernels like PySpark).
47. 47
ENVIRONMENT REPLACEMENT
This environment will be
loaded at development
time
Here nbrub.py will insert
a cell overriding the
ENV_NAME
This code
will dynamically load the
env_${ENV_NAME}.py
48. 48
PATCHED PRINT FUNCTION
Notebook’s output
System’s output
(nbrun.py)
The print() function is overridden and inserted by nbrun.py.
Goal: to allow having the output in two places – the notebook and stdout of nbrun.py
49. 49
ENVIRONMENT PARAMETERS
Environment parameters:
• FILESYSTEM, DB_NAME: place where we will store temporary tables.
Supports: s3, hdfs, cassandra, local, FiloDB
• HDFS/LOCAL/S3 parameters (depending on the type)
• Spark unpersisting parameters: mode which will force the Spark to unpersist the dataframes
(related to a bug with Spark 1.4 which caused cascaded unpersisting)
• Upload parameters: S3/HDFS parameters of where to upload the results of the training
• Metrics limits: upper limits for ML metrics (if met, then the new models will be uploaded)
• Datasets: location, table names, SQL statements of how to get the source data
• Thresholds for category variables (e.g. “train for top 1000 makes, ignore the others”)
• TOP_N: how many models to train
• VIN decoder parameters: how to decode the VINs in the case if they’re absent in the lookup
table
50. 50
NOTEBOOK STRUCTURE
Notebook’s code is split on sections:
• Environment loading (if run offline – will be made by
nbrun.py)
• Loading modules, initializing shared variables
• Data extraction
• Data transformations (filtering, joining, new features)
• Model training
• Building model metrics and analysis
• Dumping of artifacts (models, lookup, etc.), zipping results
and metrics
Each section reads data from the previous section results and
saves its own results.
Each section can be switched off during development (to save
execution time).
51. 51
MODULES
Functions were moved to the modules.
Reasons:
• Easy to debug
• Easy to see errors
• Easy to do unit tests
• PyCharm capabilities of
code navigation and type hints
Reference: https://www.jetbrains.com/help/pycharm/type-hinting-in-pycharm.html
52. 52
Common variables are shared between modules.
Example of how to share variables:
1) Create a module shared.py with variables with
the same names as in the notebook
2) Call “init” at the beginning:
SHARED VARIABLES
53. 53
MEMORY PROFILING OF THE NOTEBOOK
Memory profiling and usage of “del” at the
end of sections:
A simple way is to make some decorators.
After that for “heavy” frequently used
functions do this:
@profile
def your_func():
...
you will see how much memory the
notebook’s kernel consumed before and after
the function call.
54. 54
MEMORY PROFILING OF THE NOTEBOOK
Each section ends with checking unreleased Pandas and Spark dataframes.
Results:
• decreased memory of the notebook from 12GB to 4GB
• removed all cached Spark dataframes from the cluster memory (=decreased the demands for cluster
resources).
56. 56
HIGH LEVEL API
Goals:
• to simplify usage of Sparks dataframes/SQL
• Implicit caching defaults for common operations
• to minimize errors.
Commonly used functions:
• load_df(db_name, table_name, ...), save_df(df, db_name, table_name, ...)
• change_df(a_select_cols, a_drop, a_replace, a_rename, a_add, a_distinct, a_order_by, a_drop_end, a_filter_df,
a_filter_columns, a_filter_not_df, a_filter_not_columns, a_where)
• join_df, group_by
• filter_by_where, filter_by_df, filter_by_not_df, filter_by_threshold, filter_duplicates
Example: new_df = change_df(df, a_add={"new_col": "case when col > 0.0 then 1 else 0 end"})
instead of new_df = df.withColumn("new_col", F.when(df["col"] > 0.0, 1).otherwise(0))
57. 57
CREATEDATAFRAME MONKEY PATCH
Problem:
the sqlContext.createDataFrame() function doesn’t contain a numSlices parameter (that is present in the
sc.parallelize() and defines the number of partitions). This is true up to 2.3.1
Why it is important:
to control the number of partitions for conversion from the Pandas dataframe into Spark dataframe.
Solution: patch three functions (code is in the sparkdf.py):
• SparkSession._createFromLocal = _createFromLocalMonkeyPatch
• SparkSession.createDataFrame = createDataFrameMonkeyPatch_session
• SQLContext.createDataFrame = createDataFrameMonkeyPatch_sqlcontext
In all of the overrides add the numSlices and pass it accordingly to the sc.parallelize()
58. 58
NOTEBOOK KERNEL PARAMETERS
Marked are those parameters which we strongly advice to apply for a standalone Spark cluster:
{
"display_name": "pyspark cluster - ibobak - 3e 3c",
"language": "python",
"argv": ["/opt/conda/envs/py27/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}"],
"env": {
"SPARK_HOME":"/opt/spark",
"PYTHONPATH":"/opt/spark/python/lib/py4j-0.10.4-src.zip:/opt/spark/python",
"PYTHONSTARTUP":"/opt/spark/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS":" --packages com.databricks:spark-avro_2.11:3.2.0
--driver-memory 5G --executor-memory 10G --num-executors 3 --executor-cores 3 --total-executor-cores 9
--master spark://10.4.12.36:7077
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=1024m
--conf spark.driver.extraJavaOptions="-Xss16m" --conf spark.executor.extraJavaOptions="-Xss16m"
--conf spark.cassandra.output.consistency.level=ALL --conf spark.cassandra.input.consistency.level=ALL pyspark-shell"
}
}
All three params num-executors, executor-cores and total-executor-cores must be specified (otherwise the number of cores
will be unpredictable). Serialization parameters are strongly advices to speed up the dataframe caching. –Xss16m is advices
to avoid stackoverflow errors. Cassandra parameters are needed when you use writes to Cassandra and don’t want a
situation when records are lost after saving and re-loading them.
60. 60
JENKINS JOBS
Jenkins jobs to run different parts of the flow:
• ml_build_scoreapi: builds the price prediction web service (Java), runs the unit tests, does SonarQube analysis,
uploads the jar to the artifactory.
• ml_config_scoreapi: takes the new config files form Git and puts them on the server (dev/qa/prod), restarts the
service.
• ml_deploy_scoreapi: takes the jar file from the artifactory and puts them on the server (dev/qa/prod), restarts
the web service.
• ml_deploy_training: takes the notebooks from Git and puts them to the working folder of the training server.
• ml_run_training: runs on the training server these steps:
• Data extraction into parquet files.
• Offline running of the notebooks (using nbrun.py)
• Integration tests
• Checking ML metrics
• Uploading the training results on S3
• Issuing /manage/syncswitch HTTP GET-request on the working instance of the scoring web service.
The customer selected the regression model based on business demands and on non-risk approach: the maintenance order should be manually reviewed anyway due to checking against policies (e.g. “does the client have right to wash the car?”).
XGBoost actually replaced Spark’s GBT and RF after we discovered that on 5 million rows it is much better to train locally on multiple cores instead of doing this with Spark.
We looked at MLeap in December, but faced with a set of problems: no XGBoost support at that time. At scoring part we need to transform just a couple of rows: much of the transformation logic either vanishes, or changes, so no “copy paste” of the training notebook’s code is possible (actually, that wouldn’t happen anyway: notebook has a PySpark code, while we need Java).
There are two modules (jars):
price-prediction-service.jar (the library): does all the price prediction stuff
Price-prediction-web.jar (the web service): wraps the library into a Spring Boot microservice and provides a management interface, ML data version switching/syncing, payload shipments, VIN decoder testing, etc.
After all the dumps has been finished, it creates two zip files: mldata_YYYYMMDD_hhmmss.zip and mldata_test_YYYYMMDD_hhmmss.zip (extended one, for integration tests)
FeatureRecord:
features (CheckedHashMap<String,Object>) – for holding feature names and values
stopInfo (Map<String, String>) – for marking the record as “stopped”
updateInfo (Map<String, String>) – for outputting information that “prediction was made to an updated value of some feature”
predictionInfo (Map<String,String>) – for outputting the price prediction (potentially we can put there several predictions)
All transformers should work like this:
If the record is stopped, it should return it immediately without any processing
If the record is “alive”, it should create a copy of it and apply the transformation, then return the transformed record.
We considered many implementations: MySQL, SQLite, Memcached, but stopped on this simple one due to one reason: it fits in RAM into 2GB in our case. If it wouldn’t – we would do something more complicated.
ATA = American Trucking Association
Do not use XGBoost4J: slow, buggy, platform dependent (uses JNI)
It is often a temptation to work in the training notebook with a dataset which already contains some of the features, like “make”, “model”, etc.
The rule is next: if the scoring web service doesn’t get this feature from the client, then your notebook MUST have a part of ETL which takes this feature from some lookup or some external source or anything else.
Such explicit control on feature management allowed us to avoid errors, especially they were easily detected during integration tests.
The concept is as follows: if you did not put anything as a feature, then please DO NOT TRY TO GET IT. But Java’s HashMap, TreeMap and others simply return “null” in the case if you do map.get(“Feature_X”) while Feature_X simply doesn’t exist because no one ever put it.
Often in the testing code we can see just “assertEquals(target, actual)”. OK, it will assert you. But how much time you will spend on searching for the reason?
In our case, we’ve done a TreeMap of features and then printed them out to the logs, so that after running all tests we’ve got detailed output and could fix that in 10 minutes.
The freeware comparison tool is called WinMerge.
The second picture is intentionally decreased by height to see how much less memory it started to use.
There are much more work to do than you thought at the beginning.
Orange blocks are those pieces of functionality that are helper stuff: we could live without them, but life with them becomes easier.
The prediction method is called “safePredict()” instead of just “predict()”. That means that we do not raise any exception from there: the error information will be encapsulated in the response. This pattern allows the controller/endpoint to not care about ML stuff: this is not their business what exception may happen inside the price prediction service. The PredictionResponse has getErrorDescriptionForServerLog() which will be logged for further analysis on the server side, but this won’t go our of the service to the external world.
This is a sample of prediction request and response in JSON format. As you may see, not each order line contains a prediction: some of them may contain “stop” message, the other may contain additional “update” message which notifies that “we’ve made prediction to this item forcibly setting quantity = 1”, so no matter what value was there – 1 or not 1 – we predicted as if it was “1”.
Serialization/deserialization into XML and JSON is done using the same classes – PredictionRequest/PredictionResponse. Fields are annotated both with JAXB and Jackson annotations.
Swagger is good enough but NOT the best tool for documenting. A huge problem is properties with type Map<anything, SomeClass>: it cannot automatically generate good examples on the UI which allows to test how the service is working.To enable this swagger stuff having @Configuration class RestMvcConf extends WebMvcConfigurationSupport, we had to
@Override
protected void configureDefaultServletHandling(DefaultServletHandlerConfigurer configurer) {
configurer.enable();
}
otherwise the default request handler for /** was not enabled.
A simple approach which we use everywhere is to return Map<String, Object> for /info methods: it will convert to JSON automatically and we shouldn’t care of types.
SonarQube (formerly Sonar)[1] is an open source platform developed by SonarSource for continuous inspection of code quality to perform automatic reviews with static analysis of code to detect bugs, code smells and security vulnerabilities on 20+ programming languages. SonarQube offers reports on duplicated code, coding standards, unit tests, code coverage, code complexity, comments, bugs, and security vulnerabilities.[2][3]
Why do we need interrupts:
when you catch the InterruptException and swallow it, you essentially prevent any higher level methods/thread groups from noticing the interrupt. Which may cause problems. By calling Thread.currentThread().interrupt(), you set the interrupt flag of the thread, so higher level interrupt handlers will notice it and can handle it appropriately.
An override of the print function is needed to make so that everything that notebook’s cells are printing should go both to the notebook and to the stdout of the nbrun.py
The env_dev.py, env_prod.py etc. is a standard python file which contains variable assignments like
VIN_DECODER_ENABLED=True
VIN_DECODER_TIMEOUT_SECONDS=60
VIN_DECODER_URL=“https://servername:port/vinData”
etc.