
The world of MLOps is a big web of tangled tools and services.
We have tools that perform model and data tracking, versioning, experiment management, CI/CD pipelines, continuous learning, and so on. There’s no shortage of tools that do all sorts of complex tasks.
One of the major tasks after training a custom ML model on your data is performing inference in production.
It’s a task that many of these services complicate it to the extent that most companies have to hire multiple engineers just to manage the deployment pipelines for those models.
Which is where the idea of making the process of getting ML models inference ready quicker and easier.
A way of making ML models ready for inference in a matter of minutes.
A deployment pipeline that just works.
The entire premise of this venture is to reduce the time it takes to take a model into production.
Let’s see how to do that. Here’s my rough workflow diagram:
The rough diagram above shows a minimum viable prototype of what should happen when a custom ML model needs to be deployed.
Here are some of the steps I’ve imagined should occur:
The model upload needs to be quick and without fuss ( a drag and drop frontend)
The service needs to be have some sort of automatic versioning and updates built in
Upon the model upload, there should be no other task that the user has to perform
API endpoints for performing inference on the model needs to be available as a result of the upload
The user takes those endpoints and uses them however she likes in her app’s backend or frontend.
With me so far?
Building a small prototype
Now that we’ve seen some of the features for the service, what are the most difficult tasks to perform?
the frontend for uploading models is simple enough.
the backend to automatically create API endpoints needs some thought.
There needs to be a way for:
the backend to know what kind of model was uploaded
the backend to then take an inference script from the user or build one itself
the backend to then generate and run an API that performs inference from the uploaded model
Two of the easiest solutions to this can be by:
defining inference scripts for different kinds of tasks and keeping them ready to use in the backend
getting an inference script defined in a specific format from the user
The two solutions come with their own pros and cons. It’s not easy to see which one will be simpler to build.
If we go with the first approach, we need to have a list of different models/tasks listed on the frontend that will make it easy for the backend to know which inference script to kick off.
If we go with the second approach, we need to define a “spec” for the script that the user needs to upload along with the model.
Anyway, let’s move on.
Once we do have a script, we can make a dynamic API.
How do we do that?
Docker.
The steps can be:
define API scripts that can load models and read inputs to return predictions
make a dynamic docker image from the backend and run a container with the API
return the endpoints to the user to perform inference
Here’s a very simple, very poor demo:
There are two obvious ways to use docker here:
spin up a custom image from inference script with the docker API, or
use something like s2i (source code to image) automation which doesn’t need a Dockerfile at all.
I haven’t tested the latter yet, but it seems like a less tightly-coupled option compared to defining custom Dockerfiles for each and every inference script.
The complexity will change as requirements become larger
This app is a very simple MVP of an intrusive thought that it should be way easier to deploy ML models in production than it is currently.
There are so many details to uncover and so many things to improve such as:
a docker image takes a long time to build (it’s a cold start problem)
replacing a running container with another and have it correctly versioned is not as simple as it is on paper
which moving parts to include and which ones to not is a great question to answer. Questions like whether users should be able to define their own endpoints and should inference scripts be hardcoded or should they be made dynamic will take some experiments to figure out.
Thanks for reading!