February 24, 2023
 min read

A swift guide to experiment tracking with MLFlow

A Data Science professional is not unfamiliar with the arduous process of trial and error. In a day-to-day manner we...


A Data Science professional is not unfamiliar with the arduous process of trial and error. In a day-to-day manner we encounter multiple cases where it is necessary to keep track of branching and experimentation of ML pipelines. This quickly gets out of hand when the project becomes large enough, and it is nearly impossible to remember the experiments on an old project.

Our objective is to provide reproducible results with optimized metrics for a solution. With the growing number of experiments being done on the data and the model side, Data Scientists tend to forget all the small changes they did and reproducibility suffers greatly. This is where experiment tracking comes into play.

Table of contents:

  1. Experiment tracking
  2. Prerequisites
  3. Tracking experiments with MLFlow
  4. Limitations and alternatives
  5. A few parting words

Experiment tracking

Illustration of Model Management found on neptune.ai

Experiment tracking is the process of keeping a record  of all the relevant information from a Machine Learning experiment. This includes tracking source code, environment modifications, data and model changes, among others. When experimenting and iterating through data and model modifications, things quickly get out of hand and Data Scientists tend to forget what was exactly used for a specific run. Experiment tracking improves greatly on reproducibility, organization and optimization. You might think that there are tools to optimize a ML model such as hyperparameter optimization, but these processes are not automated. The easiest experiment tracking solution that first comes to mind is spreadsheets. But it is error prone as the user has to log everything manually, and there is no standard format of the sheet with no knowledge of the data and preprocessing steps used. This is where MLFlow comes to play. It's an open-source platform for the machine learning lifecycle, addressing the whole process of building and maintaining models. MLFlow is composed of four main modules:

  • MLflow Tracking: An API to log parameters, code, and results in machine learning experiments and compare them using an interactive UI.
  • MLflow Projects: A code packaging format for reproducible runs using Conda and Docker, so you can share your ML code with others.
  • MLflow Models: A model packaging format and tools that let you easily deploy your model (from most common  ML libraries) to batch and real-time scoring on platforms such as Docker, Apache Spark, Azure ML and AWS SageMaker.
  • MLflow Model Registry: A centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of MLflow Models.

We’ll be focusing mainly on tracking, as that is of the most interest for Data scientist. Models and registry will be addressed. The MLFlow Tracking module allows you to organize your experiments into units referred as ‘runs’. With each run you can track:

  • Start & End Time: Start and end time of the run.
  • Source: Name of the file to launch the run, or the project name and entry point for the run if run from an MLflow Project
  • Parameters: Key-value input parameters of your choice. Both keys and values are strings
  • Metrics: Key-value metrics, where the value is numeric. Each metric can be updated throughout the course of the run (for example, to track how your model’s loss function is converging), and MLflow records and lets you visualize the metric’s full history
  • Artifacts: Output files in any format. For example, you can record images (for example, PNGs), models (for example, a pickled scikit-learn model), and data files (for example, a Parquet file) as artifacts

MLFlow automatically logs other extra information such as: source code, author, git version of the code and the execution time and date. Tracking component provides API in different languages like Python, REST, R and Java.


We will be exploring MLFlow in Python, and for that we need to install MLFlow itself and a backend. sqlalchemy will be used as a backend solution to store runs.

pip install mlflow
pip install sqlalchemy

Tracking experiments with MLFlow

After the installation, you can run the UI server locally with an SQLite backend for model registry:

mlflow ui --backend-store-uri sqlite:///mlflow.db

This tells MLFlow where we want to store all the artifacts and metadata for the experiments. In this case it’s an SQL database, but it can also be a remote server, Databricks workspace, or any other database.

By following the link generated,  we are greeted with the home screen:

With the server up and running, we can move on to experimenting. In order to link up our code to the mlflow server we’ll need the experiment name and tracking uri (Universal Resource Identifier). The tracking URI is the URI of our backend, which is sqlite:///mlflow.db in our case. The experiment name is the name of the task under which all the different models and experiments will be located. In order to connect your code to the backend, you should initialize a connection by adding:

import mlflow

If you are running a new experiment name, MLFlow will automatically create it for you with the given name.

With a connected MLFlow backend, we can start tracking. The dataset we’ll use is a diabetes dataset, where the target is the progression of the disease.

We’ll create a RandomForestRegressor as our model. To start tracking, we can either use a context manager or manually start and stop runs. Using a context manager fits the best for our case, as it handles all the opening and closing logic behind MLFlow.

from sklearn import datasets
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(diabetes_X, diabetes_y)
model = RandomForestRegressor()

with mlflow.start_run():
	model.fit(X_train, y_train)

The next steps is to log all the necessary information. For instance, we can log the parameters, set tags, the metadata, as well as  the model and the data used.

As we might be training different models on a single experiment, it’s a good practice to log the model name as a tag, so we can easily search for it in the MLFlow backend. To do that, we add:

mlflow.set_tag("model", "RandomForestRegressor")

This tag will be present in the MLFlow logs as:

Next, we would like to log some metrics to evaluate the experiment and log the parameters used for the model. To do so, we can use the mlflow.log_metric and mlflow.log_param or mlflow.log_params to log a dict.


	y_hat_test = model.predict(X_test)
	y_hat_train = model.predict(X_train)

	train_rmse = sklearn.metrics.mean_squared_error(y_train, y_hat_train)
	test_rmse = sklearn.metrics.mean_squared_error(y_test, y_hat_test)

	mlflow.log_metric("train-rmse", train_rmse)
	mlflow.log_metric("test-rmse", test_metric) 

Now, running the experiment we can see all the metrics and parameters logged in the run on the UI side:

where we can see all the model parameters:

as well as all the metrics and tags:

Now that we have a way of logging the parameters, metrics and metadata, we would also want to log data and model to stay consistent and allow reproducibility. These two fall under the artifact section. MLFlow has a nice way of logging and saving models for PyTorch, Scikit-Learn, XGBoost and many others.

To save the scikit-learn model and the data, we add the following:

output_dir = Path("diabetes_artifacts")

def log_data(
    data, output_dir: Path, name: str
) -> None:
    data_dir = output_dir / "data"
    if not data_dir.exists():
    data_path = data_dir / f"{name}.npy"
    np.save(data_path, data)
    mlflow.log_artifact(data_path, "data")

mlflow.sklearn.log_model(model, "model")
log_data(X_test, output_dir, "x_test")
log_data(y_test, output_dir, "y_test")
log_data(X_train, output_dir, "x_train")
log_data(y_train, output_dir, "y_train")

The way artifacts are stored is that we have to save them somewhere to log the path to them in MLFlow. We can see all the artifacts and models in the lower sections of the run on the web UI:

MLFlow also provides a snippet of code on how to run any model that’s been tracked, be it a PyTorch Neural Network or a Scikit-Learn model through a simple API. All that’s needed is the run id, and we can fetch it with MLFlow and use it within a few lines of code. MLFlow also gathers all the used requirements and places them into a python_env.yaml, which allows the user to install and use your environment easily.

We are also presented with the option to register the model if we are satisfied with the results. This can further be moved to the dev stage, staging and eventually production.

With this we can easily share the model with other teams and move it forward, or compare different models.

MLFlow also provides us with automatic logging, where it logs nearly everything there is in the run. This is not always desirable, as there is a lot of noise, but it’s a fast way to track your experiment

Limitations and alternatives

MLFlow might seem like a suitable solution, but it has a list of limitations as well. One of which is due to users and authentication. It does not have any notion of teams, users, or authentication. So you have to look for workarounds if you want to use it in a team.

The lack of authentication also raises security alarms, so it must be used carefully and accessed through a VPN, if you have it linked to a deployment worker. A workaround for this is to use the paid version of Databricks, which includes a ML platform with MLFlow integrated. This provides a notion of users, teams and authentication.

The second limiting factor is data versioning, specifically data versioning. If you want to have full reproducibility, you have to use external data versioning tools.

Finally, there is no model & data monitoring system built in. MLFlow is only focused on experiment tracking and model management. NannyML is a great solution for post-deployment model monitoring and performance estimation.

Some of the biggest alternatives for MLFlow are: Neptune, Comet, Weights & Biases. All the approaches have their ups and downs, and are best for specific use cases. But MLFlow is currently the only open-source solution that is free to use, and provides a rich interface over experiment tracking and a community behind it which constantly does the updating and patching.

A few parting words

GIF by matthewjocelyn found on GIPHY

I hope this short tutorial and brief introduction into MLFlow has helped you get started with using modern tools for experiment tracking, and not logging everything on a piece of paper or in excel. I recommend you check the MLFlow official documentation, which is friendly and easy to follow. Happy experimenting and be sure to follow us to keep up with the advances in AI, tutorials for tools data scientist use every day and lots of other fun and educational content.

Thank you for reading!


[1] https://mlflow.org/docs/latest/index.html

[2] https://neptune.ai/blog/ml-experiment-tracking

[3] https://www.element61.be/en/resource/4-ways-how-mlflow-can-facilitate-your-machine-learning-development

[4] https://www.databricks.com/product/managed-mlflow

[5] https://www.nannyml.com/

Andreas is driven by the interplay among strategy, relationship management and sales.

Latest articles

Browse all