In brief

- Anomaly detection is the identification of items that deviate significantly from the majority of the data and do not conform to a well-defined notion of normal behavior. These may have been generated by a different mechanism or appear inconsistent with the remainder of a particular data set
- Traditional detection methods perform well when dealing with binary classification of anomalies but are not designed to capture the temporal/spatial correlation pattern between data points
- In this article, we consider an application for anomaly detection using deep learning techniques and neural networks (NNs) implemented with the PyTorch framework

Thanks to data science and, in particular, machine learning, businesses can better understand and perform preventive and timely maintenance on processes that might cause high losses when they don’t work properly.

Anomaly detection, also known as outlier detection, is the identification of items, events or observations that deviate significantly from the majority of the data and do not conform to a well-defined notion of normal behavior. Such items, events or observations may have been generated by a different mechanism or appear inconsistent with the remainder of that data set.

To deal with an anomaly detection problem, traditional methods are usually applied. Some of these methods could be seasonal autoregressive moving average (ARIMA), deviation from static location parameters from an underlying distribution, quantile regression, ensemble techniques using feature bagging, isolation forests and cluster analysis-based anomaly detection.

The above techniques perform well when dealing with data points and events labeled “anomaly” with a binary classification logic. Unfortunately, they are not designed to capture the temporal/spatial correlation pattern between data points. Although classical time series models can model those patterns, we can resort to deep learning techniques in neural networks (NNs) to achieve more meaningful results.

In this article, we will consider an application for anomaly detection using an RNN autoencoder-based technique where a sequence of data points is labeled as either “normal” or “anomaly” behavior. The NN is implemented with a PyTorch framework.

## Deep learning solutions

**Autoencoder**

Figure 1. The autoencoder structure, from left to right: The vector sequence as input, the encoder NN for dimensionality reduction, the latent encoded vector, the decoder NN and the output of the model

In Figure 1, the encoder is a neural network trained to encode the input sequence into a compact representation called the latent vector. The decoder does the opposite by trying to reconstruct the input sequence starting from its compressed encoded representation. To determine whether an input sequence is anomalous, it is compared with the autoencoder output. If the reconstruction error exceeds a chosen threshold, the sequence is considered anomalous.

The training strategy to achieve high accuracy consists of only training the autoencoder on normal data, i.e., when the system that generates the data works properly. A normal sequence fed into the model will show an output close to its input; however, with anomalous sequences, the output is expected to deviate significantly by at least a chosen meaningful threshold.

*The above is one of the cases where overfitting might be beneficial as the success of the models lies in the ability to not generalize to anomalous sequence data. Good and bad performance of the model is the key to correct classification*.

## Recurrent neural networks

The choice of the encoder is crucial, and its architecture must capture the pattern structure of the data and keep their most important characteristics in a reduced representation. The same can be said for the choice of the decoder, which, starting from the reduced vector, must expand the encoded information to a full-size sequence output.

Recurrent neural networks (RNNs) are particularly capable of capturing the interdependence of the data thanks to their recurrent mechanism. The main idea is that if we have a sequence of size *T* and *t* is the index of the observation ranging in [1, 2, …, *t*, …, *T*], when the observation at time *t* is passed into the network, the past observations at times [*t*-1, *t*-2, …] influence the output result.

RNNs have been so successful that researchers have implemented modifications of the vanilla algorithm like the gated recurrent unit (GRU) and the long short-term memory (LSTM). Apart from solving the vanilla RNN vanishing gradient problem, they are capable of interpolating longer time series dependencies thanks to structures called gates and memory cells.

## Data source

The original dataset can be found on Kaggle as the pump sensor data, where a water pump system is monitored because it does not work for some periods. The following motivation drives the investigation and attempts to find a solution:

**Problem statement: ***There were 7 system failures last year. Those failures cause huge problems for many people, leading to some serious living problem*s *for a number of families. The team can’t see any pattern in the data when the system goes down, so they are not sure where to focus their attention (from the above source)*.

The dataset columns are:

**timestamp**—**sensor_##**— reading value of each sensor, where ## is the sensor number ranging from 00 to 51, inclusively.**machine_status**— has the values ‘NORMAL’ when the water pump is working properly, or ‘BROKEN’ or ‘RECOVERING’ otherwise

Firstly, the column **machine_status** is dropped in favor of a new column called **target**, where** **elements are assigned the value 0 when the **machine_status** is ‘NORMAL’ or 1 otherwise.

Figure 2. The blue bar over the ‘0’ label represents the number of points under which the water pump is functioning properly, and the ‘1’ bar represents the number of points when the water pump is not working or is recovering

The water pump system was not working at the moments represented at 14484 time points. However, this graph would not tell us how many times the system went down. Although the problem statement already clarifies that issue, we need to better study the temporal pattern for periods when the water pump system was not functioning.

The article focuses more on applying ML techniques than packaging a final solution to the problem; hence, we will work only with **sensor_02**.

Sensor values: Blue shows when the system is working nominally, whereas yellow shows anomalies/regime shifts.

The water pump system does not work approximately 6% of the time, and there are 7 failures:

Failure 1: 945 minutes, 15.75 hours

Failure 2: 3111 minutes, 51.85 hours

Failure 3: 1313 minutes, 21.88 hours

Failure 4: 606 minutes, 10.1 hours

Failure 5: 8391 minutes, 139.85 hours

Failure 6: 42 minutes, 0.7 hour

Failure 7: 76 minutes, 1.27 hours

The first 4 failures endured for a maximum of 2 days, and the longest failure lasted 6 days, whereas the later failures were solved in less than 2 hours.

The goal is to identify when the system is not working by looking at the last available sequence of points in time with an arbitrary size *t.** *The choice of the size *t* depends on many factors. If *t* is set to 60, last-hour measurements are observed to determine whether the system is working.

## Data wrangling

The data provided by the sensor is a long time series consisting of 220,320 observations. These data are scaled using the Python class sklearn.preprocessing.StandardScaler(), which centers the data to a 0 mean series and unit variance.

The scaler is fitted only on normal data and then applied to the whole series. Once scaled, a rolling window with a size of 60 data points with a lag of 45 steps creates a dataset of 1-hour sequences. For each sequence, the column **target **counts the number of anomalous points.

In this way, 4895 sequences are generated, some of which are shown below:

The red highlighted rows are discarded due to their ambiguity in whether they are anomalous or not, while the green rows with target valued 0 are normal sequences, when the value 60 is anomalous.

As sequences with anomalies between 1 and 59 can signal partial working of the water pump system. only rows whose target columns are valued at 0 or 60 are retained, representing normal and anomalous sequences, respectfully. Some of the rows present not a number (NaN) values filled with the respective sequences’ medians.

Due to PyTorch being used as a Python library for building the autoencoder architecture, the sequences are cast to torch tensors and, eventually, the dataset is partitioned into the training (85%), validation (5%), and test (10%) subdatasets which include only ‘normal’ sequences. In contrast, the anomalous sequences are grouped as an ‘anomalous’ test subdataset due to the design of the training procedure.

## PyTorch implementation

The LSTM Autoencoder implementation will be shown and explained below, as the GRU and Vanilla RNN in terms of code are a simple variation of LSTM.

## Encoder

The encoder’s full code is presented below:

from torch import nn, distributions, exp, log import torch class LSTM_Encoder(nn.Module): def __init__(self, n_features, seq_len, embedding_dim=20): super(LSTM_Encoder, self).__init__() self.seq_len, self.n_features = seq_len, n_features self.embedding_dim = embedding_dim self.rnn = nn.LSTM( input_size=n_features, hidden_size=embedding_dim, num_layers=1, batch_first=True ) def forward(self, x): x = x.reshape((1, self.seq_len, self.n_features)) _, (hidden_n, _) = self.rnn(x) return hidden_n.reshape((self.n_features, self.embedding_dim))

The following is a line-by-line explanation of the encoder implementation:

from torch import nn, distributions, exp, log import torch

These are all elements of the torch: torch.nn allows a neural network to be built, whereas torch.distributions, torch.exp and torch.log are needed for the variational autoencoder.

class RNN_Encoder(nn.Module): def __init__(self, n_features, seq_len, embedding_dim=20): super(RNN_Encoder, self).__init__() # The same as super().__init__() self.seq_len, self.n_features = seq_len, n_features self.embedding_dim = embedding_dim self.rnn = nn.LSTM( input_size=n_features, hidden_size=self.embedding_dim, num_layers=1, batch_first=True )

The RNN_Encoder class inherits the functionality of the nn.Module class, *self.seq_len* is the length of the input time series sequence (60 in this case), *self.n_features* is the input number of sequences corresponding to 1 here, as only the data of sensor_02 are taken into consideration. The attribute *self.embedding*_dim is the size of the reduced size vector where the information of the input sequence is encoded to.

*self.rnn* is the recurrent neural network to be implemented, and its LSTM layer has the following parameters:

*input_size*is the number of feature sequences (only one for sensor_02)*hidden_size*is the size of the hidden layer*num_layers*is the number of hidden layers of the RNN architecture*batch_first*is set to true when the input torch vector dimension has to be in a specific order (batch size, length of the sequence and number of features)

def forward(self, x): x = x.reshape((1, self.seq_len, self.n_features)) _, (hidden_n, _) = self.rnn(x) return hidden_n.reshape((self.n_features, self.embedding_dim))

The forward method defines how the input is passed through the network. It consists of a reshape of the input torch vector for this application because *b*atch_first is set to true; for the subsequent LSTM step the torch tensor dimension is (1, 60, 1).

The LSTM pass will produce a first torch tensor corresponding to the output of the NN and a second bidimensional torch tensor corresponding to the hidden layer and the cell layer. The Encoder returns the latent vector from the hidden layer.

## Decoder

class LSTM_Decoder(nn.Module): def __init__(self, n_features=1, seq_len=60, input_dim=20): super(LSTM_Decoder, self).__init__() self.seq_len, self.input_dim = seq_len, input_dim self.hidden_dim, self.n_features = 2 * input_dim, n_features self.rnn = nn.LSTM( input_size=input_dim, hidden_size=self.hidden_dim, num_layers=1, batch_first=True ) self.output_layer = nn.Linear(self.hidden_dim, n_features) def forward(self, x): x = x.repeat(self.seq_len, self.n_features) x = x.reshape((self.n_features, self.seq_len, self.input_dim)) x, (hidden_n, cell_n) = self.rnn(x) x = x.reshape((self.seq_len, self.hidden_dim)) return self.output_layer(x)

A similar procedure inverse to the encoder’s procedure is followed for the decoder. Here the input sequence is size 20 of the encoded vector, which is mapped to an input of double the size of the input sequence. The output sequence must be 60 as the initial input. For this reason, a single linear feed-forward layer is employed.

For the forward pass, the data are repeated 60 times; these sequences are encoded to a sequence of 40 data points and eventually to a 60-point sequence.

## Training

The training procedure is shown below. There is no batch gradient descend, and the code shown is not encapsulated in a function or a class for educational purposes.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) criterion = nn.L1Loss(reduction='sum').to(device) history = dict(train=[], val=[]) best_loss = 10000.0 best_model_wts = copy.deepcopy(model.state_dict()) # train_dataset # val_dataset

The first line of code regulates GPU usage when available. The optimizer is the known Adam optimizer with the learning parameter 0.0003. The experiments prove this value is optimal for training: Bigger values do not lead to convergence, while smaller values bring slower convergence. The loss is the sum of the absolute difference.

A dictionary called *history *is instantiated to save the training dataset and validation dataset loss across epochs. The variable *best_loss* is an arbitrary big value as the placeholder for the highest loss: When it is exceeded in any of the subsequent epochs, the model parameters will be saved within the object *best_model_wts*.

There is the set of sequences *train_dataset* and *val_dataset* — the training and validation set consisting only of normal sequences.

for epoch in range(50): model = model.train() train_losses = [] for seq_true in train_dataset: seq_true = seq_true.to(device) seq_pred = model(seq_true) loss = criterion(seq_pred, seq_true) optimizer.zero_grad() loss.backward() optimizer.step() train_losses.append(loss.item()) val_losses = [] model = model.eval() with torch.no_grad(): for seq_true in val_dataset: seq_true = seq_true.to(device) seq_pred = model(seq_true) loss = criterion(seq_pred, seq_true) val_losses.append(loss.item()) train_loss = np.mean(train_losses) val_loss = np.mean(val_losses) history['train'].append(train_loss) history['val'].append(val_loss) if val_loss < best_loss: best_loss = val_loss best_model_wts = copy.deepcopy(model.state_dict())

The training starts as usual in an NN. For 50 epochs, each sequence is moved to GPU if available, passed through the model, the loss is computed, and the neural network parameters are adjusted thanks to backpropagation. Then the training and validation losses are saved, and the best-performing model is chosen through the epochs.

** **

## Threshold

Once the model is fitted, it is necessary to decide which threshold has to be specified for discriminating between normal and anomalous sequences.

Figure 3. Distribution of loss values for the training dataset using the best model

The distribution is skewed to the right, i.e., a more significant number of reconstructed sequences exist that are closer to the inputs rather than further from them.

One strategy is to set a quantile of the distribution as the threshold. If the quantile is set to be high, more anomalous sequences will be labeled as normal; on the contrary, a lower quantile means normal sequences will be labeled as anomalous. The choice of threshold is highly dependent on the business problem, and a cost analysis should be done to determine the optimal threshold. This information, in this case, is not available; therefore, an arbitrary 90% quantile seems reasonable.

## Performance metrics

Now, the normal and anomalous sequence test subdatasets can be used to compute crucial metrics for this binary classification problem. In particular, with the normal sequences, the number of true negative (TN) and false positive (FP) sequences is computed, and with the anomalous sequences, the number of FN and TP sequences is obtained.

- TP is the number of anomalous sequences labeled as anomalous by the model — i.e., the reconstruction error is greater than the threshold
- TN is the number of normal sequences labeled as normal by the model — i.e., the reconstruction error is smaller than the threshold
- FP is the number of normal sequences reconstructed badly by the model and, hence, labeled as anomalous
- FN is the number of anomalous sequences labeled as normal since the model could reconstruct close to the input for the threshold selected

From these quantities, the following metrics can be obtained:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

The precision is the model’s capability to discriminate the input sequence only when it is anomalous, and it needs to be maximized if the most important solution must avoid false water system failure alarms. The highest recall metric is preferred when the crucial thing is to identify most of the anomalies.

For this problem, the accuracy is the percentage of correctly labeled sequences, whether anomalous or normal.

## Results

The results of some of our experiments are shown below. The Vanilla RNN, GRU, and LSTM have been implemented similarly, as the demonstrated code highlighted by the suffix *_light *where only one hidden layer exists, a deeper version with two hidden layers. The results also include a variational autoencoder with a one hidden layer RNN.

All the methods perform well according to the metrics considered. The GRU and LSTM implementations have a clear advantage in recall, compared with RNN, which is the worst performing, especially when using the lighter architecture; it’s probably unable to capture some parts of sequences complexity. The VAE implementation improves the performance of the LSTM architecture. The other two RNN types are not equally beneficial; on the contrary, they deteriorate the TP.

## Summary

We can conclude that there is no preferred model of architecture among those considered that significantly outperforms the others for this problem. The GRU architecture seems a good choice in production from the MLOps perspective, as it has shown higher performance than RNN and it is lighter than LSTM, and the latter is also the slowest to train.

In this article, we have shown some snippets of code that might help data scientists and developers to implement those ML techniques with more profound theoretical knowledge rather than hard writing the code. It is important to note that our code is educational, and when delivering a model into production, the codebase should follow best practices, such as unit testing, linting, OOP paradigm, etc.

The null treatment can be done with better techniques, such as smoothing kernels, and other metrics such as F1, ROC AUC curve, etc. can be considered to choose the optimal model for the problem’s needs. Also, for the variational autoencoder, the training function has to be endowed by a more complex loss function that takes into consideration the KL convergence.