Getting Started with MLOps
MLOps is a hot topic among the data science world at present, but adopting this philosophy can be an intimidating task. There is an abundance of tools and frameworks out there that claim to make such a transition easy, so how does one know what’s actually going to be useful? In this post, I’ll demystify the situation a little bit by highlighting what problems we’re actually trying to solve with an MLOps workflow. I’ll do this by taking you through a “case study” of a fictitious company. This will hopefully give you a sense of how to solve some common problems that plague early-stage data science teams.
A quick note about the approach I’ve taken with this case study: the task of adopting an MLOps culture can start by introducing new process first, or new tooling. Process should always inform choice of tooling, not the other way around, so the focus of the case study is the workflow. We want to introduce new processes in small, digestible ways to maximise the chance of them actually being adopted. So at each stage, I’ll explain the problem, how we’ll solve it, and provide just enough tooling to support that change.
There’s no one-size-fits-all with this sort of thing, so I’ve been fairly opinionated in my choice of implementation — it’s easier to follow this with concrete examples. The solutions should translate well to any number of environments and applications, though. The important takeaway is recognising the problems and general solution approach.
Here’s our imaginary situation:
Daintree™ is an online store which has just begun to use data science in its operations. A few POCs have been implemented, and it’s becoming clear that there is business value in deploying Machine Learning models, so they’re looking to scale up those operations. The business runs as a microservice architecture, and services are run on a Kubernetes cluster. They use Python as their primary model development language.
One of the models their data scientists have developed enables more effective product recommendations by combining collaborative filtering data and the Word2Vec algorithm. (See here for more info on how this works!) Here’s a quick-and-dirty implementation that uses a dataset from a different online retailer named after a rain-forest:
The data science team developed this model by experimenting in notebooks on their laptops, and then handing the code off to the ops team for deployment. At this point, the team structure looks like this:
Improvement 1: Reproducibility
At present, the handoff between the data science team and the ops team is a pain point. The ML experiments are all ad hoc, which gives the data scientists a lot of freedom and flexibility, but makes it hard to actually operationalise any of the results. The main problem boils down to reproducibility: it’s difficult to share logic and dependencies, so there is no guarantee that the code will actually run on the ops team’s machines, or on a server. The ops team are aware of the value of containerisation: the rest of the microservices run on Kubernetes, after all.
The data scientists are given the responsibility of containerising their experiments. This means much of the heavy lifting of actually deploying the app as a service is already done by the time it reaches ops.
It’s important to make this transition easy, so we want to provide our data scientists with as many resources as possible to encourage containerisation by default. The following minimal Dockerfile is a good start:
All the image tags are
dev, because these scripts are designed to be run locally. When model versions are finalised, the tags should be updated to a more unique identifier. The git commit hash is a good choice! To this end, githooks can be set up to do all the building and pushing automatically on every git commit. A script like this might work as a post-commit hook:
The team that created the Item2Vec model above would build their container images with
docker build -t daintree.container-registry.example/ml/item2vec:dev . while still in development. Any other member of the team should be able to
docker pull daintree.container-registry.example/ml/item2vec:latest or
docker pull daintree.container-registry.example/ml/item2vec:<commit-hash> and run a working version of the project in a container locally if they want to use it themselves.
It’s really important that the data science team feels like they are actually deriving value from this change (and any of the suggested changes in this post). We also want to avoid introducing too much too soon. Improving the reproducibility of the code is a great place to start — it’s digestible, and should hopefully demonstrate its value early on.
Containers are just one way to achieve this goal: Python virtual environments or VMs serve a similar purpose (although I’d argue that virtual environments should be used with containers). Containerisation is my preferred route, but choose whatever fits your business best.
The team structure now looks like this:
Improvement 2: Modularity
Daintree has been collecting data for a long time, and their storage pattern is fairly mature — data is stored in
.csv files in their preferred cloud storage solution. In the interests of keeping this blog post cloud-agnostic, let’s say they run MinIO in their Kubernetes cluster. Data is being loaded from roughly the same sources for every experiment, and even a lot of the preprocessing and cleaning steps are repeated. All the logic to perform these steps is being rewritten again and again, and we’d much prefer single unified implementations.
We want our data scientists to shift their thinking from a “monolithic” architecture to a “pipeline” architecture. This is a significant shift in thought process, but is often the mark of a mature data science team.
In practise, one way to facilitate this is to containerise individual modules of data processing logic, and run experiments as a sequence of container tasks. The containers need to share input and output data, so we could mount a volume at some agreed-upon location, and read/write data from/to that location. It’s important, then, that the file format of the data is consistent, so that there is a standard interface by which pipeline components communicate with one another. Given that the existing data storage format is
.csv files, this seems as good a format to use as any.
We should provide the team with templates to make adopting this mentality as easy as possible:
It’s worth noting that it’s just the intermediate representation that needs to be uniform. Connectors can be written for any number of data sources to perform the initial conversion to the common interface.
The Item2Vec model has fairly well-defined stages: loading, cleaning, preprocessing and training. These would be good candidates for a series of modules. For example, the data cleaning module might look like:
However, the dataset we load for this project is actually a
.json file that we download from a URL, so doesn’t fit with this pattern. The benefit of templates is that they aren’t rigid — we can adapt the template to work for this use case:
And voila, we’ve created a general data connector for downloading line-separated
.json files and converting them to our compatible file-format. Push this as a containerised app, and everyone can use the same implementation of this logic.
As mentioned above, now our containers need to mount volumes to be able to share data between stages. We can update the provided
docker run one-liner:
Now any data placed in the local
data directory will be available to the script.
We could also improve the structure of Daintree’s ML git repository to reflect this new pipeline architecture:
DaintreeML └── components ├── common ├── project1 │ ├── task1 │ └── task2 └── project2
We could fairly easily amend the earlier build-and-push script to operate on every subdirectory, and push new containers for each component that has been changed as part of a CI/CD process.
Of all the changes I’m proposing in this post, this one is probably the most dramatic. But it’s also the one from which you’ll likely see the most value. Adopting a pipeline architecture for your ML experiments is crucial to ongoing success — every ML experiment can build on the resources of previous ones, which should dramatically speed up iteration in your team.
Again, containers are perhaps not strictly necessary for this, you could play around with creating
pip packages for your library code as an alternative, although containers are my preferred solution.
This is the new structure:
Improvement #3: Centralised Caching
The large datasets required by ML applications mean each run of the pipeline can take hours to complete from start to finish. This only serves to slow down the iteration speed of the data science team, especially when they’re usually only developing one component at a time. To speed up the development cycle, it makes sense to cache intermediate results and only run the sections of the pipeline that have changed. Thanks to the changes above, the team are sharing logic between themselves, but it’s likely that many of their intermediate results are also reusable. Hence, instead of mounting docker volumes locally, we should organise a central repository to store input and output data.
A good choice for this is a cloud storage service like S3 or GCS. As I mentioned above, Daintree uses MinIO as their storage service. We should create a bucket in which to store our ML artifacts — let’s call it
ml-artifacts – and come up with a sane prefixing strategy so that it’s easy to find what we’re looking for. Maybe a good prefixing strategy would be
The script template becomes:
NB: MinIO’s API is drop-in compatible with Amazon S3, which is where the
s3:// prefix comes from.
The run command now no longer requires the
-v data:/application/data flag, but does require environment variables to be set i.e.
-e MINIO_ACCESS_KEY -e MINIO_SECRET_KEY.
Daintree’s data scientists can now more easily share inputs and outputs of their processing logic with one another. If a particular team member is developing one stage in isolation, they are able to bootstrap their development by using the last cached inputs and outputs. This is also useful for debugging purposes — if a stage isn’t producing expected results, there is an auditable trail of input/output pairs that may be used to determine under what conditions the bug appears.
You might notice that the prefixing scheme isn’t actually enforced in the template. Sure, it’s important to have consistency across experiments, but there’s also every possibility that the original structure isn’t actually the best choice, and the team might decide down the track that they’d like to change. Since this is a template that may be copied and reused again and again, changing this prefixing scheme becomes a mammoth task of updating the code of every component in the repo. I’d preference trusting the team to follow the standard, at least initially.
Now our team looks like this:
Improvement #4: Scalability
The data science team are running training and inference locally on their laptops at present. This works fine for comparably small datasets and jobs, but there comes a point where it’s necessary to scale up the experiment. To achieve this, there should be a mechanism for the data scientists to offload tasks to an elastically scalable pool of compute resources.
Given that the company uses Kubernetes, a good solution is to schedule processing stages as pods on the cluster. The stages form cohesive pipelines, so let’s deploy a workflow manager to run experiments end-to-end. Argo Workflows is Kubernetes-native, and has a fairly simple operating model, so it’s my preferred choice, but there are plenty of alternatives (AWS Step Functions, Luigi, Airflow, Google Cloud Workflows).
Argo Workflows supports configuring an artifact repo where intermediate results between workflow steps can be stored. This sounds an awful lot like the MinIO repo set up in the previous step, so Daintree configures their MinIO service as the artifact repo. With this change, they also no longer need to explicitly interface with the MinIO API when running pipelines on Argo. They update the previous template to provide a command line option to store inputs and outputs ‘locally’, which they’ll use when running pipelines on Argo.
Argo declares workflows as Kubernetes resources, but we’d really prefer not to have to train the entire team in
.yaml syntax and the Kubernetes operating model. If we make a few basic assumptions around what most of the pipelines will look like, we can write a script that does most of the heavy lifting.
Let’s assume that the pipelines are sequential, without any parallel steps, and each step passes a single directory of data to the next. We’ll let the user configure the path to the original data source, and the path to which the output of the final stage will be written.
It’s a pretty quick-and-dirty script, but ought to work for the overwhelming majority of cases. The pipeline itself is configured with a
.json file. The config for the
Item2Vec project would look like this:
The neat thing is that these pipeline specs could be stored in an entirely different git repository to the components: the glue between them is the container registry. In fact, it’s not a bad idea to keep component logic and pipeline config separate, because pipelines won’t necessarily reflect the latest versions of the component code.
We could (and should) encourage the data scientists to gradually familiarise themselves with Argo’s API, and make small changes to the generated workflow document. Eventually, it would make more sense for the Workflow manifests to be written manually — this script should be considered nothing more than a tool to make the transition easier.
Once a valid workflow manifest has been created, the data science team submits to the cluster with
argo submit workflow-manifest.yaml, and can access the Argo UI to monitor execution. The basic idea here is that the team should still experiment locally with small subsets of the data, and when ready to train on the entire dataset, the pipeline can be run with the workflow manager. This means they are no longer as constrained by data sizes, or the time taken to run the pipeline.
The team now no longer needs to interact with the registry directly, but just send requests to the compute cluster:
It’s a completely made-up example, but hopefully some or all of these problems seem familiar.
To summarise what we achieved here, the data scientists are now:
- Containerising their application code
- Versioning models according to git commit
- Structuring ML projects as pipelines of discrete components
- Caching results in a centralised repository of data
- Running large-scale pipelines in a shared compute cluster
- Much more effectively sharing code and data
Beyond experimentation, improvements like these also enable more effective monitoring and deployment of models. Possible next steps for building on this framework include specifying a common interface and template for APIs that serve model results; or using a workflow engine to automatically retrain the model based on some sort of event trigger.
It’s worth mentioning that there are existing frameworks and toolkits that provide similar functionality to what we’ve ended up with in this example. In fact, this architecture is effectively a lightweight implementation of Kubeflow Pipelines, which actually uses both MinIO and Argo workflows under the hood. However, all the tooling in the world won’t help if the processes aren’t in place to use it effectively. The point of these building blocks is each one introduces new (and possibly unfamiliar) process to a data science team in a digestible way.
The tooling is deliberately basic, so that it hides as little complexity as possible. This way, the team has the flexibility to change anything that isn’t working for them. New process should never be too prescriptive — if kept flexible, eventually the workflow settles to a point where it’s clear what is and isn’t actually useful. That’s the point where it’s easiest to assess the value of different frameworks, and perhaps consider deploying something a bit more industrial-strength.
So if you’re wanting to elevate the operations of your ML workflow, consider starting by taking these steps (or something similar to them). If you’d like some help with implementing tooling or process in your data science team, get in touch with Eliiza, and we might be able to help you out.