DeepSentry Book, Pipeline

The Pipeline: End-to-End Walkthrough

Understanding each stage of the anomaly detection pipeline: from raw logs to anomaly scores.

Pipeline Overview

The DeepSentry pipeline consists of 6 sequential stages. Each stage takes input from the previous stage and produces output for the next:

┌──────────────────────┐
│      RAW LOGS        │
│     HDFS_1.log       │
└──────────┬───────────┘
           │
  [STAGE 1: PREPARE]
           │
           ├─→ Split into train/val
           ├─→ Extract messages
           └─→ Build vocabulary
           │
           ▼
┌──────────────────────┐
│ PREPARED TEXT DATA   │
│ • train.txt          │
│ • val.txt            │
│ • vocab.pkl          │
└──────────┬───────────┘
           │
  [STAGE 2: TEXT TRAINING]
  LSTM Autoencoder
           │
           ├─→ Learn text patterns
           ├─→ Compression to 128 dims
           └─→ Reconstruction quality
           │
           ▼
┌──────────────────────┐
│  TEXT AUTOENCODER    │
│ • encoder.h5         │
│ • decoder.h5         │
│ • metadata.json      │
└──────────┬───────────┘
           │
  [STAGE 3: DATASET ENCODE]
           │
           ├─→ Compress all logs
           ├─→ Create sequences
           └─→ Save vectors
           │
           ▼
┌──────────────────────┐
│ ENCODED SEQUENCES    │
│ • train_encoded.pkl  │
│ • val_encoded.pkl    │
│ • test_encoded.pkl   │
└──────────┬───────────┘
           │
  [STAGE 4: ANOMALY TRAINING]
  Bidirectional LSTM
           │
           ├─→ Learn temporal patterns
           ├─→ Predict next sequence
           └─→ Calculate errors
           │
           ▼
┌──────────────────────┐
│ ANOMALY DETECTOR     │
│ • detector.h5        │
│ • threshold.json     │
│ • metadata.json      │
└──────────┬───────────┘
           │
  [STAGE 5: EVALUATION]
           │
           ├─→ Calculate AUC/metrics
           ├─→ Generate ROC curve
           └─→ Create report
           │
           ▼
┌──────────────────────┐
│  EVAL RESULTS        │
│ • metrics.json       │
│ • roc_curve.png      │
│ • precision_recall.* │
└──────────┬───────────┘
           │
  [STAGE 6: LIVE MONITORING]
           │
           ├─→ Load models
           ├─→ Score incoming logs
           └─→ Alert on anomalies
           │
           ▼
┌──────────────────────┐
│   REAL-TIME ALERTS   │
│  Anomalies flagged   │
└──────────────────────┘

Stage 1: Data Preparation

Input: Raw log file in Utah format (YYMMDD HHMMSS MESSAGE)
Output: Prepared text files and vocabulary
Module: src/tx/prepare.py

What Happens

Read the raw log file line-by-line
Extract the message portion (everything after HHMMSS)
Split into training (80%) and validation (20%) sets
Build a vocabulary of all unique words/tokens seen
Save prepared data and vocabulary for the next stage

Example

Input raw log:

220115 14:23:45 BlockReport processing took 120ms for 50000 blocks from 192.168.1.100
220115 14:23:46 Got block report from DataNode 192.168.1.100
220115 14:23:47 Verification complete

Output prepared data:

train.txt: Contains 80% of messages (with timestamps preserved for sorting)
val.txt: Contains 20% of messages
vocab.pkl: Dictionary mapping each unique word to an integer ID

Running Stage 1

bash dockerrun/run_text_autoencoder_prepare_data.sh

This runs the prepare module, which reads from your configured data directory and outputs to prepared_data/.

Stage 2: Text Encoding Training

Input: Prepared text (train.txt, val.txt) and vocabulary
Output: Trained text autoencoder model
Module: src/tx/train.py

What Happens

An LSTM autoencoder learns to compress log messages into dense vectors. Think of it as learning a "fingerprint" for each message type:

"BlockReport processing took 120ms" → [0.12, -0.45, 0.89, ...]
"Got block report from" → [0.34, 0.11, -0.23, ...]
"Verification complete" → [-0.56, 0.78, 0.01, ...]

The model learns these encodings by trying to reconstruct the original message from its compressed representation. Messages with similar meaning get similar vectors.

Training Details

Epochs: Typically 10-20 passes through the data
Batch size: 32-128 messages at a time
Vector dimension: 64-128 dimensions (configurable)
Learning rate: Adaptive, typically starts at 0.001

The model uses validation data to prevent overfitting. Training continues until validation loss plateaus.

Running Stage 2

bash dockerrun/run_text_autoencoder_train.sh

This trains the text encoder, saving the model to text_autoencoder_model/.

Stage 3: Dataset Encoding

Input: Trained text encoder and full dataset
Output: Encoded vectors for all logs
Module: src/tx/encode.py

What Happens

Now that we have a trained text encoder, we use it to encode every log message in our dataset into vectors. This transforms raw text into numbers that the anomaly detector can process.

For each message:

Tokenize the message (convert words to IDs)
Run through the encoder portion of the autoencoder
Get a vector output (e.g., 128 floats)
Save the vector

Input/Output Example

Input:

"BlockReport processing took 120ms for 50000 blocks"

After encoding:

[0.123, -0.456, 0.789, ..., 0.234]  # 128-dimensional vector

Running Stage 3

bash dockerrun/run_text_autoencoder_encode_dataset.sh

This encodes the entire dataset and saves pickled vectors to encoded_dataset/.

Stage 4: Anomaly Detector Training

Input: Encoded vectors
Output: Trained anomaly detector model
Module: src/an/train.py

What Happens

Now we have sequences of vectors. A bidirectional LSTM learns what normal temporal patterns look like:

Which messages typically follow which
How quickly things normally change
What rate of events is typical

The model is trained to predict the next vector given the previous ones. When it can predict accurately, the sequence is normal. When prediction error is high, something unusual is happening.

Training Process

Create sliding windows of sequences (e.g., 10-step windows)
For each window, predict the final step given the first 9 steps
Train on normal (training) data
Evaluate on validation data
Save the best model

Running Stage 4

bash dockerrun/run_anomaly_detector_train.sh

This trains the anomaly detector and saves it to anomaly_trained_model/.

Stage 5: Evaluation

Input: Trained detector and test data
Output: Anomaly scores and evaluation metrics
Module: src/an/eval.py or src/an/analysis.py

What Happens

We run the trained model on test data and score each log entry:

For each sequence in test data
Feed it to the trained detector
Get reconstruction error (how well the model predicted)
High error = anomalous, low error = normal
Save scores to file

Evaluation with Ground Truth

If you have labeled data (ground truth anomalies), run the labeled evaluation:

bash dockerrun/run_anomaly_detector_eval_labeled.sh

This computes:

ROC curve: Trade-off between true positives and false positives
AUC: Area under ROC curve (1.0 = perfect, 0.5 = random guessing)
Precision: Of detected anomalies, how many are real?
Recall: Of real anomalies, how many did we catch?
F-beta score: Harmonic mean of precision and recall

Stage 6: Live Monitoring

Input: Incoming log stream and trained models
Output: Real-time anomaly alerts
Module: src/live/main.py

What Happens

The live monitor reads logs in real-time and scores them as they arrive:

Read incoming log messages
Encode each message using the text encoder
Score the vector using the anomaly detector
Track rolling statistics (mean, std of recent scores)
Alert when scores deviate from normal

Running Live Monitoring

bash dockerrun/run_live_monitoring.sh

The live monitor runs continuously, reading from a configured log file or stream and outputting alerts to the console or a monitoring system.

Running the Complete Pipeline

To run all stages in order:

#!/bin/bash
# Run complete pipeline

bash dockerrun/run_text_autoencoder_prepare_data.sh
bash dockerrun/run_text_autoencoder_train.sh
bash dockerrun/run_text_autoencoder_encode_dataset.sh
bash dockerrun/run_anomaly_detector_train.sh
bash dockerrun/run_anomaly_detector_eval.sh

# Optionally, with ground truth evaluation:
bash dockerrun/run_anomaly_detector_eval_labeled.sh

# Start live monitoring:
bash dockerrun/run_live_monitoring.sh

Tip: Each stage can take anywhere from seconds (small datasets) to hours (large production logs). Use a terminal multiplexer like tmux or screen to run stages in parallel on different systems.