November 29, 2021

MLOps - Real-World Application

1 Introduction

MLOps or ML Ops is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. The term is a compound of machine learning and the continuous development practice of DevOps.

In the first part of this article, I will try to present and discuss the steps needed to prepare a machine learning model for deployment in production. The second part will cover deployment and automation of the ML system creation process.

The topic will be presented on a simple example of a website that allows you to play a text game. The content will be automatically generated by using an appropriately tuned language model. For presentation purposes, the name of the system is TextGameGen.

Figure 1: MLOps is a set of practices at the intersection of Machine Learning, DevOps, and Data Engineering

2 ML Project Lifecycle

The machine learning project life cycle can be divided into several steps:

• defining the scope of the project - consists of project definition, key evaluation metrics selecting (apart from the machine learning model itself, the operation of the entire system can also be assessed), and estimation of the computer resources required for the operation of the system.

• define data and establish baselines - at this stage, data required to train (or tune) the model should be collected and specified. The baseline (expressed using selected evaluation metrics) required to determine the system quality should also be defined.

• data preparation and organization - concerns data processing, feature engineering, and labeling (e.g.: labeling emails as "spam" or "ham"),

• modeling - the most important step in the project implementation. It consists of selecting and training a model, and then its testing and optimization (improving the quality of the code, selecting better hyperparameters after the initial evaluation).

• deployment - at this stage, the machine learning model is deployed and integrated with the end application (e.g.: website, mobile application). This phase of the cycle also includes system monitoring and management.

After defining the scope of the project, the next stages can be performed manually or in the form of an automatic pipeline. The level of their automation defines the entire workflow of the project preparation and deployment process. Typically, three types of such processes are mentioned:

• manual process (MLOps level 0) - this is the simplest MLOps pipeline. Each stage is performed manually, which makes it difficult to integrate the operations preparing the model for deployment. Such a process also makes it difficult to actively monitor the behavior of the model, which makes quality control problematic. The manual process is sufficient when there is no need to tune the model, prepare new models (with a different architecture) or the data itself doesn’t change significantly over time. The manual workflow is shown in the graphic below:

MLOps – manual process. Source: mlops-continuous-delivery-and-automation-pipelines-in-machine-learning# devops_versus_mlops

• ML pipeline automation (MLOps level 1) - its goal is to achieve a continuous delivery (or continuous deployment - CD) of the machine learning model (or models) in production. To use the new data to retrain models in production, you have to introduce automated data and model validation steps as well as pipeline triggers and metadata management to the pipeline. Automating the ML process allows you to conduct faster experiments, train the model on the latest data, and continuously deliver new models. The workflow with an automated machine learning process is shown in the following image:

MLOps – ML pipeline automation. Source: mlops-continuous-delivery-and-automation-pipelines-in-machine-learning# devops_versus_mlops

• CI/CD pipeline automation (MLOps level 2) - its goal is to achieve a continuous supply (or continuous delivery - CD) of the machine learning models in production. To use the new data (to retrain models in production) you have to introduce automated data processing and model validation steps, triggers, and metadata management to the pipeline. Automating the ML process allows you to conduct faster experiments, train models on the latest data, and continuously deliver new models to production environments. The automated machine learning pipeline is shown in the following image:

MLOps – CI/CD pipeline automation. Source:

Regardless of the level of pipeline automation, some manual labor is always required to prepare the initial model for deployment. This will be presented in the first part of this article, while in the second part we will take a closer look at the stage of deploying the model into a production environment, and the automation of individual steps. In other words, the first part will cover the areas of data science and data engineering, while the second will cover DevOps.

Finally, it is worth explaining why the lifecycle of a machine learning project is called a cycle. After deploying the model in production, it may happen that customer feedback indicates the need to adjust the model to new data, prepare new models or improve code quality. This requires us to go back to the previous steps (data preparation, model training etc.) and redeploy a new model. This creates a cycle.

Figure 2: ML project lifecycle. Based on:

3 Scope of the ML project

The first step in preparing any machine learning system is to define the scope of the project. We do this by:

• defining the project - what is to be done (e.g.: "spam" content detection system, character recognition, etc.) and how it can be done (training the model from scratch or maybe tuning the existing one), what tools will be needed.

• determining the main evaluation metrics of the system (e.g.: model accuracy, latency, throughput).

• estimation of required computing resources - important in the context of selecting the model size and training scheme (pre-training vs fine-tuning).

3.1 TextGameGen - the Scope of the Project

This is how we scope our example TextGameGen project:

• What is to be done?

A system that allows you to play a text-based game, based on content automatically created by a fine-tuned language model. Text generation capabilities of the model will be used for content of the story, item descriptions, and dialogues with encountered characters. The system will be available through a website.

• How can this be done?

By tuning the GPT-2 generative language model on selected texts and integrating it with a web application.

• What technologies will be used?

To prepare the system, use:

– Python language in version 3.8.5

– HuggingFace transformers, version 3.3.1 (the library of pretrained language models)

– Pytorch library in version 1.8.1 and 1.7.1 + cu101 (required in the Google Collaboratory environment)

– Numpy library in version 1.19.4

– Google Collaboratory platform

The next step is to select the evaluation metrics, which may cover several important areas such as the quality of the machine learning model (accuracy, F1 score, perplexity, etc.), system performance, and the expected time for server response.

For the TextGameGen system, the perplexity metric was used to evaluate the quality of the machine learning model. The system was also assessed in terms of the waiting time for the server response.

The last element of the project scope is the estimation of the required computing resources. In case of the TextGameGen system, the Nvidia K80 graphics card (12 GB VRAM) in the free version of the Google Collaboratory environment and the Nvidia GTX 1060 card (6 GB VRAM) were used. This allowed for the decision to reject the larger GPT-2 models (large and XL).

4 Define Data and Establish Baseline

At this stage, the data necessary to train the model should be collected and the baseline for the key system quality assessment metrics should be established.

As mentioned before, data can be obtained in several different ways depending on the resources available:

• ready-made data sets - the fastest way of obtaining data is to use ready made sets from the web. The advantage of this solution is that the data is often pre-pre-processed and properly labeled. The disadvantage may be the fact that the datasets can be dedicated for a specific type of task, quality can also be an issue (poor quality of graphics, text with linguistic and spelling errors). An example of such a set is the well-known MNIST database.

• scraping web content - An advantage of such solution is the virtually unlimited amount of available content. The main disadvantage is the unorganized structure of obtained data, which increases time required for preprocessing and data organization.

• obtaining data from external entities - training data can be obtained from third parties such as company clients or partner companies, research institutions, etc.

• own data set preparation - this is a very labor intensive approach and should be avoided if possible.

If the quality and quantity of collected data is satisfactory, you can proceed to determine the baseline of the selected project metrics. Baseline allows you to define the minimum performance that the system must achieve for selected evaluation metrics in order for the project to be considered ready for deployment in production. This can be, for example, 90% accuracy of the model, 10 ms of waiting time for the server response, and 5 ms of time for the model to generate the content.

4.1 TextGameGen - Define Data and Establish Baseline

In our case, training data was sourced from the Harry Potter book series and the novels based on the book Lord of the Rings. The texts were prepared in the

form of txt files. In addition, the collection has been supplemented with texts based on dialogues from the game The Elder Scrolls III: Morrowind.

The selected model’s baseline was the perplexity of 16,53 (perplexity for the trained gpt-2 large model). The selected baseline for the application server response time was no more than 10 ms.

5 Labeling and Organizing the Data

Obtained datasets are very often available in unstructured form (machine learning models cope well with structured data such as tables). They may also contain incorrect values, have redundant information or be of poor quality. Input models also expect numerical values, so for text data you can already pre-create word embeddings.

For some ML tasks (such as text classification) it is necessary to assign appropriate labels to the data (e.g.: "spam" or "not spam") or to perform additional feature engineering on the training data. In the case of translation or text generation systems, this could include supplementing the data with special tokens indicating the end and beginning of a sentence.

Having properly organized and labeled data, you can divide it into two subsets: training (used to train the model) and validation (allowing to assess initial the quality of the model).

5.1 TextGameGen - Organizing the Data

Here’s what we did with our data for TextGenGame app:

5.1.1 Data Cleansing

First set were the dialogues from the game The Elder Scrolls III: Morrowind (exported as a txt file using the Creation Kit tool available with the game of the year edition on the Steam platform). Originally data was divided into columns (dialogue ID number, category, topic, content, etc.). We extracted only the content of the dialogues. The second set were texts based on excerpts from Harry Potter books and novels based on the book Lord of the Rings. Redundant lines, sentences, and existing artifacts (special characters, adjacent characters) have been removed.

Figure 3: An example of dialogues from the game The Elder Scrolls III: Morrowind before extracting their content. A library was used to better visualize the content of the dialogue file pandas.

The texts created in this way were then divided into individual sentences, using the function sent _tokenize from the NLTK library.

5.1.2 Training Corpus Preparation

The first step to building the corpus was to add special tokens to the collected texts. For the text generation task (and the GPT-2 model used), a <|endoftext|> token has been added to the end of each sequence element, the token <|endoftext|> was added, while the beginning was expanded with the addition of special tokens:

<|action|> – for each description of the events from the used novels

<|dialogue|> – for each dialogue from the used novels

<|quest|> – for each dialogue from the game The Elder Scrolls III: Morrowind

Sample sentences in the finished corpus are as follows:

<|action|> Chest heaving with emotion, Wood turned to Harry. <|endoftext|>

<|dialogue|> ’It wears off after a while,’ said Hermione, waving her hand impatiently. <|endoftext|>

<|quest|> I am looking for a rare book, Vampires of Vvardenfell. <|endoftext|>

The necessity for adding tokens <|dialogue|> and <|quest|> results from the following facts: in the case of the game The Elder Scrolls III: Morrowind the text usually addressed the interlocutor (you) directly, while the content itself concerned the tasks that should (or had to) be performed. On the other hand, dialogues from the books were less direct and often related to the events in the novel itself.

The sentences prepared in this way were concatenated in one text file, which was then divided into appropriate subsets:

train – consisting of 90% of the texts in the prepared corpus (train_ratio)

dev – consisting of 5% corpus (dev_ratio) and containing validation data

test – consisting of 5% corpus (test_ratio) and containing test data

The corpus prepared in this way will be used to fine-tune the language model.

6 Modeling

At this stage, the actual machine learning model is prepared. In case of our manual process, it will be delivered in the form of a binary that needs to be loaded in the target application. To smoothly go through the modeling stage, follow these steps:

• model selection – the first step is to select the appropriate architecture and model size (this step can also be performed at the stage of determining the scope of the project)

• hyperparameters setting – the values of hyperparameters (learning rate, batch size, number of iterations) should be defined in order to train the model properly

• model training – model training is performed

• model evaluation – model is tested on the validation and test sets to verify if we meet our baselines

• model performance test – model performance (in terms of response time) is measured (does it allow us to achieve the required baseline)

6.1 TextGameGen - Modeling

Having already prepared training and validation data, the next step was to choose the appropriate language model. From the models available in the library transformers, the GPT-2 model was selected, which, due to its architecture (autoregressive model), works best in text generation tasks.

The model is available in four versions differing in the number of parameters: gpt2 (117 million parameters), gpt2-medium (345 milion parameters), gpt2-large (774 million parameters) and gpt2-xl (1558 million parameters) 1.

Due to technical limitations, small gpt2 and gpt2-medium models were tested.

1 Source:

6.2 Fine-Tuning of the Selected GPT-2 Model

Before starting fine-tuning the Google Collaboratory (notebook available at: the environment was prepared, on which Python version 3.8.1 is installed (at the time of the experiments, Python 3.6.9 was installed by default) and the required packages.

After proper preparation of the work area, a special training script was created, in which at the beginning the used modules were imported and the selected gpt2 model was downloaded along with the appropriate configuration and tokenizer.

The next step was to add functions to train the model. For this purpose, a Trainer class from transformers library was used. The parameters of the learning process were determined using the TrainingArguments class. Most of the parameters have default values that are available at: The output catalog and the batch size were changed, which was set to 64, due to the improvement in the speed of the model learning process. The number of learning epochs was also increased to four.

The prepared training script was run in the Google Collaboratory environment. By default, cross entropy was used as the cost function.

Cross entropy and calculated perplexity values after epoch 4 are as follows:

• gpt2 – loss = 2.562578125, perplexity = 27.1647,

• gpt2-medium – loss = 1.9130078125, perplexity = 16.5429.

Due to the lower value of the cost function and perplexity, gpt2-medium model was used as the underlying language model in the text generator for the TextGameGen system.

7 Deployment

The last step is to integrate the model with the final application. The standard deployment process can be broken down into several steps:

1. model installation - you have to share resources for your model (e.g.: model Pytorch file) and code that is used for the model prediction on the input data

2. defining the service configuration - describes the container (e.g.: Docker platform) and files to use during service initialization, and deployment configuration (available computing resources, basic system configuration)

3. local testing of the service - relies on local loading, running, and testing of the model on the input data in the simulation of the production environment (e.g.: website in the debug mode)

4. testing the resulting service - the model is loaded and tested in the final product

5. monitoring and management - while supervising the operation of the service, feedback is collected in order to diagnose the error, retrain, and deploy new models

7.1 TextGameGen - Deployment

After the modeling stage, resources ( pytorchmodel.bin, config.json, etc.) with the fine-tuned GPT-2 model are obtained. With the model assets ready, the next step is to release the code. In the prepared system it was a simple script (listing 1) containing the generate function, which run the generation method from the transformers library for the selected model.

Listing 1: Generating the output from the model
output = model.generate(input_ids=input_ids,
max_length=length + len(encoded_prompt[0],
min_length=length + len(encoded_prompt[0],

The prepared files were then registered on the server (which was the local computer). The service was prepared in the form of a web application, which was written in Python, using the Flask library. First, local tests were performed (the application started in debug mode). After passing local tests, the service was launched externally (it was done using the tool localtunnel, 2 which makes the local host available to the outside quickly and easily). At this stage, the entire system was tested, also checking the baseline for the waiting time for the server response (the result was 10 ms, so the deployment was successful).


Figure 4: The main window of the TextGameGen system. 1 – main game window, 2 – window with the generated text (model operation test), 3 – the field for the input sequence.

8 Summary of Part I

After going through all the steps of the cycle, the first machine learning model was successfully deployed. However, as a result of monitoring the system, it turns out that a new generator option is necessary, which would allow the creation of texts only based on the novel Alice in Wonderland. To achieve this goal, it would be necessary to return to the stage of data preparation, go through the modeling process and re-deploy the new model. After some time, the data science team showed that other values of the generation parameters improve the quality of the resulting texts. In this case, we would have to go back to the modeling stage. In order to avoid the manual preparation and deployment of new models, these steps should be automated. In part II we will discuss the possible automation of steps described here in order to achieve code integration and continuous delivery of new models to production environment.


[1] MLOps: Continuous delivery and automation pipelines in machine learning -

Continue reading

TILT Can Go Where Others Can’t
What makes Applica the only solution in its class that can do what we do? Specialization. And research.
A Deeper Look at TILT – Applica’s Revolutionary Deep Learning Tech
We’ve figured out a way to give our clients more speed, more precision, more control, and more options, while again setting an entirely new standard in the document automation game.
Improving Work-Life Balance With Extra Paid Leave
Applica’s team members are everything to us, which is why we trust them to take the time that they need.