Starting Your Data Science Project With Metaflow? The MNIST Use-Case.

Ayotomiwa salau
Geek Culture
Published in
6 min readMay 2, 2021

--

The priority of data scientists simply lies in picking out the right features, building and deploying their models, they do not like to be particularly bothered about other aspects like model versioning, job scheduling, flow architecture, compute resources management, which is needed to make operationalizing data science successful.

This is where Metaflow comes in

Metaflow is an open-source tool by Netflix for managing data science workflows. It aims to boost the productivity of data scientists by allowing them to focus on actual data science work and by facilitating faster productionization of their deliverables.

If you are familiar with Airflow or Luigi then you would understand the function of Metaflow. It allows you to run your data science process in steps, so each step is a node in the process and the nodes are connected like a graph as seen below.

This system is called a DAG, Directed Acyclic Graph. In this hypothetical example, the flow trains two versions of models in parallel and outputs the highest accuracy score.

One other interesting attribute of Metaflow is that it ships with a lightweight service that provides a centralized place to keep inspect and track all your flow executions. Metaflow will use a local directory to keep track of all metadata on executions from your laptop. This metadata is called a Data Artifact.

You can use a local Jupyter notebook to interact with data artifacts from all your previous executions as well as currently running ones. However, deploying the Metaflow service alongside Amazon S3 as a datastore is helpful if you would like to share results with your peers and track your work without fear of losing any state.

As we proceed we would see an example of executing a flow with Metaflow and also inspecting data artifacts.

Basic components of Metaflow

A Metaflow Dag essentially consists of the flow, step and transition.

  • Flow: the instance that manages all the codes for the pipeline. It is a Python object in this case class MyFlow(Flowspec)
  • Steps: A step is the smallest resumable unit of computation, delimited by decorator @step, they are python functions in the MyFlow object, in this case, def start, fitA, fitB, eval, end.
  • Transitions: links between the steps could be of different types (linear, branch and for each); there are more details on the documentation.

There are 3 components around the flow:

  • The datastore is the place where all the data (data artifact) generated all along the flow is stored
  • The metadata is the place where the information on the execution of the flow are stored
  • The client is the component that is the connection to access the data in the datastore and get information on the flow from the metadata

These components around the flow allow data scientists to resume a run, inspect run metadata, do hybrid runs and collaborate.

Setting up Metaflow

One can make use of Metaflow on both local and remote servers like AWS

On local and remote server, one can easily install Metaflow by simply

pip install metaflow

Or upgrade already installed Metaflow

pip install - upgrade metaflow

Now, one can prefer the remote option given the scale of the data and resource needed, or the local where fast interaction is required but the resource constraint would be there.

There is also the hybrid method to get the best of both worlds — local and remote — by switching easily between both servers. Metaflow snapshots all data and code in the cloud automatically. This means that you can inspect, resume, and restore any previous Metaflow execution without having to worry that the fruits of your hard work get lost.

Starting your Data science/ Machine learning project — MNIST

For this post, we will build a data science workflow for training a machine learning model using the MNIST data set.

MNIST dataset originally is a database of handwritten digits images but for this project, the image data has been converted to CSV format using its channels -RGB — as features in the dataset i.e. vectorized.

The data file contains the 60,000 examples and labels. Each row consists of 785 values: the first value is the label (a number from 0 to 9) and the remaining 784 values are the pixel values (a number from 0 to 255).

It would involve data extraction, data preparation, data split, model fitting and prediction and model evaluation.

The above graph can be represented in the code below

Walking through each step in the graph

The first step was to set the path to the MNIST data file and load the dataset into a Pandas data frame

The second step was preparing the data set by extracting the features and the label from the whole dataset

The next step was to split the data into train and test set in the ratio of 50:50

Then we fit predict on the data for the Gaussian Naive Bayes model

Next, we fit predict on the data for the random forest model

Here, we joined the two branches on which the classification models are running. We merged data artifacts from the two branches.

Then we evaluated the model

And ended the run

See full the full code base below

See the result of flow run

We can see that the Random Forest model performed better than the Gaussian Naive Bayes model with a score of 93.6% compared to 56.7%.

With Metaflow, we can also inspect our flow i.e. check the list of flow, the number of runs, the latest run, the list of steps and the data artifact at each step. This particularly helpful when debugging your flow.

The error usually appears in this form

Below, you can inspect your flow in this manner, this could be done through a Jupyter notebook or your CLI. We are using Jupyter notebook for this project.

So we covered understanding of Metaflow, its basic components, an example a flow execution using MNIST dataset and inspecting data flow

Hope it was informative. Do share.

Cheers

--

--