There’s a tremendous rise in machine learning applications lately but are they really useful to the industry? Successful deployments and effective production–level operations lead to determining the actual value of these applications.
According to a survey by Algorithmia, 55% of the companies have never deployed a machine learning model. Moreover, 85% of the models cannot make it to production. Some of the main reasons for this failure are lack of talent, non-availability of processes that can manage change, and absence of automated systems. Hence to tackle these challenges, it is necessary to bring in the technicalities of DevOps and Operations with the machine learning development, which is what MLOps is all about.
What is MLOps?
MLOps, also known as Machine Learning Operations for Production, is a set of standardized practices that can be utilized to build, deploy, and govern the lifecycle of ML models. In simple words, MLOps are bunch of technical engineering and operational tasks that allows your machine learning model to be used by other users and applications accross the organization.
MLOps lifecycle
There are seven stages in a MLOps lifecycle, which executes iteratively and the success of machine learning application depends on the success of these individual steps. The problems faced at one step can cause backtracking to the previous step to check for any bugs introduced. Let’s understand what happens at every step in the MLOps lifecycle:
- ML development: This is the basic step that involves creating a complete pipeline beginning from data processing to model training and evaluation codes.
- Model Training: Once the setup is ready, the next logical step is to train the model. Here, continuous training functionality is also needed to adapt to new data or address specific changes.
- Model Evaluation: Performing inference over the trained model and checking the accuracy/correctness of the output results.
- Model Deployment: When the proof of concept stage is accomplished, the other part is to deploy the model according to the industry requirements to face the real-life data.
- Prediction Serving: After deployment, the model is now ready to serve predictions over the incoming data.
- Model Monitoring: Over time, problems such as concept drift can make the results inaccurate hence continuous monitoring of the model is essential to ensure proper functioning.
- Data and Model Management: It is a part of the central system that manages the data and models. It includes maintaining storage, keeping track of different versions, ease of accessibility, security, and configuration across various cross-functional teams.
PyCaret and MLflow
PyCaret
is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment.
MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components:
Let’s get started
It would be easier to understand the MLOps process, pyCaret
and MLflow
using an example. For this exercise we’ll use https://www.kaggle.com/ronitf/heart-disease-uci
. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Firstly, we’ll install pycaret, import libraries and load data:
!pip install pycaret pandas shap
import pandas as pd from pycaret.classification import *
df = pd.read_csv('heart.csv') df.head()
Common to all modules in PyCaret
, the setup
is the first and the only mandatory step in any machine learning experiment using PyCaret
. This function takes care of all the data preparation required prior to training models. Here We will pass log_experiment = True
and experiment_name = 'diamond'
, this will tell PyCaret to automatically log all the metrics, hyperparameters, and model artifacts behind the scene as you progress through the modeling phase. This is possible due to integration with MLflow.
cat_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'thal'] experiment = setup(df, target='target', categorical_features=cat_features, log_experiment = True, experiment_name = 'diamond')
Now that the data is ready, let’s train the model using compare_models
function. It will train all the algorithms available in the model library and evaluates multiple performance metrics using k-fold cross-validation.
best_model = compare_models()
Let’s now finalize the best model i.e. train the best model on the entire dataset including the test set and then save the pipeline as a pickle file. save_model
function will save the entire pipeline (including the model) as a pickle file on your local disk.
save_model(best_model, model_name='ridge-model')
Remember we passed log_experiment = True
in the setup function along with experiment_name = 'diamond'
. Now we can initial MLflow UI to see all the logs of all the models and the pipeline.
mlflow ui
Now open your browser and type “localhost:5000”. It will open a UI like this:
Now, we can load this model at any time and test the data on it:
model = load_model('ridge-model')
model.predict(df.tail())
So, that’s how an end-to-end machine learning model is saved and deployed and is available to use for industrial purposes.
Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:
- Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
- Introduction to Data Science in Python- 400,000+ students already enrolled!
- Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
- Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!