Heads-up for Deploying Scikit-learn Models to Production: Quick Checklist

Heads-up for Deploying Scikit-learn Models to Production: Quick Checklist

Last updated:
Heads-up for Deploying Scikit-learn Models to Production: Quick Checklist
Source
Table of Contents

These are just heads-ups, specifically for scikit-learn, not a full workflow you can follow;

The problem of how to accurately deliver model predictions in production and at scale is a large subject in and of itself, requiring things like model monitoring, logging, etc.

General workflow

deploying-scikit-learn-model-to-production General workflow for deploying a trained sklearn model into production

Cached data

Be careful with cached information.

We normally cache models, processed data and other things that take too much time to build.

It's important to have checks to make sure cached data matches what we are actually expecting.

Asserts

Asserts have their place.

Python is not a statically-typed language so there will be times when simple mistakes that could be caught at compile-time will leak into run-time.

  • Asserts can check the sanity of your data and results.

  • Assets can be used to filter cases where you have syntactically valid data which makes no sense.

    • in other words, cases when your models would still deliver predictions, but probably nonsensical ones.
  • Asserts help inform the reader about what some specific piece of code does, and what the underlying assumptions are.

    • In other words, asserts are useful documentation too.

Namespaces and pickled objects

If you pickle a classifier or a pipeline and use custom classes and other resources, these must also be available at inference time.

For example, if you use custom steps in a pipeline, external data or things like that for training your model, these must also be available at inference time.

Training VS prediction time

Data preprocessing at prediction time must be exactly the same as at training time

This is also called the Training/Serving Skew.

If you require lots of preprocessing, such as extracting features from text, creating artificial features from incoming data and/or processing categorical data, you need to make sure the exact same process is done at prediction time!

Important: Wrap the whole preprocessing/classification code into just a few methods and call those at training and prediction times.

  • This is especially true for categorical data and one-hot-encoding with pd.get_dummies():

    • Categorical data must be encoded into dummy variables using the very same mapping at training and at inference time!
    • Always set categories attribute when using pd.get_dummies(), to make sure you are encoding categorical data with dummy variables in the very same positions as you did at training time.
    • Always use dummy_na=True
  • For vectorizing text:

    • Unseen text must be vectorized with the vectorizer fitted on the training data!

Dependencies

Whatch out for library versions.

When you train models, you use specific versions of libraries like numpy, pandas or scikit-learn.

If someone runs the very same code you wrote using slightly different library versions, things will break.

  • Always make evident what library versions you are using:

    • PIN dependency versions
    • print pandas.__version__, np.__version__ and so on at the top of notebooks
    • define versions in a requirements.txt file
    • encode the whole environment (OS-level stuff too) on a Dockerfile or something like that.

Missing data

Watch out for NULLs and missing data

It's very common for there to be missing data at inference time.

Most sklearn classifiers will throw errors if you data has Nones or np.nans, so you must remove them.

  • Always use fillna() on each dataframe column with suitable default values such as:

    • "" empty string for text data
    • 0 or some other marker for numerical data
    • Missing categorical data should be handled by to_dummies() method, with dummy_na=True

Log predictions

You must log all predictions done by a more, be it realtime or batch.

At a very minimum, you must enable logging for:

  • Input Features (features for each item processed)

  • Model output (i.e. scores, classes, etc)

  • Scoring time (the exact timestamp the model was called)