The Pipeline: End-to-End Walkthrough

Understanding each stage of the anomaly detection pipeline: from raw logs to anomaly scores.

Pipeline Overview

The DeepSentry pipeline consists of 6 sequential stages. Each stage takes input from the previous stage and produces output for the next:

┌──────────────────────┐
│      RAW LOGS        │
│     HDFS_1.log       │
└──────────┬───────────┘
           │
  [STAGE 1: PREPARE]
           │
           ├─→ Split into train/val
           ├─→ Extract messages
           └─→ Build vocabulary
           │
           ▼
┌──────────────────────┐
│ PREPARED TEXT DATA   │
│ • train.txt          │
│ • val.txt            │
│ • vocab.pkl          │
└──────────┬───────────┘
           │
  [STAGE 2: TEXT TRAINING]
  LSTM Autoencoder
           │
           ├─→ Learn text patterns
           ├─→ Compression to 128 dims
           └─→ Reconstruction quality
           │
           ▼
┌──────────────────────┐
│  TEXT AUTOENCODER    │
│ • encoder.h5         │
│ • decoder.h5         │
│ • metadata.json      │
└──────────┬───────────┘
           │
  [STAGE 3: DATASET ENCODE]
           │
           ├─→ Compress all logs
           ├─→ Create sequences
           └─→ Save vectors
           │
           ▼
┌──────────────────────┐
│ ENCODED SEQUENCES    │
│ • train_encoded.pkl  │
│ • val_encoded.pkl    │
│ • test_encoded.pkl   │
└──────────┬───────────┘
           │
  [STAGE 4: ANOMALY TRAINING]
  Bidirectional LSTM
           │
           ├─→ Learn temporal patterns
           ├─→ Predict next sequence
           └─→ Calculate errors
           │
           ▼
┌──────────────────────┐
│ ANOMALY DETECTOR     │
│ • detector.h5        │
│ • threshold.json     │
│ • metadata.json      │
└──────────┬───────────┘
           │
  [STAGE 5: EVALUATION]
           │
           ├─→ Calculate AUC/metrics
           ├─→ Generate ROC curve
           └─→ Create report
           │
           ▼
┌──────────────────────┐
│  EVAL RESULTS        │
│ • metrics.json       │
│ • roc_curve.png      │
│ • precision_recall.* │
└──────────┬───────────┘
           │
  [STAGE 6: LIVE MONITORING]
           │
           ├─→ Load models
           ├─→ Score incoming logs
           └─→ Alert on anomalies
           │
           ▼
┌──────────────────────┐
│   REAL-TIME ALERTS   │
│  Anomalies flagged   │
└──────────────────────┘

Stage 1: Data Preparation

Input: Raw log file in Utah format (YYMMDD HHMMSS MESSAGE)
Output: Prepared text files and vocabulary
Module: src/tx/prepare.py

What Happens

  1. Read the raw log file line-by-line
  2. Extract the message portion (everything after HHMMSS)
  3. Split into training (80%) and validation (20%) sets
  4. Build a vocabulary of all unique words/tokens seen
  5. Save prepared data and vocabulary for the next stage

Example

Input raw log:

220115 14:23:45 BlockReport processing took 120ms for 50000 blocks from 192.168.1.100
220115 14:23:46 Got block report from DataNode 192.168.1.100
220115 14:23:47 Verification complete

Output prepared data:

  • train.txt: Contains 80% of messages (with timestamps preserved for sorting)
  • val.txt: Contains 20% of messages
  • vocab.pkl: Dictionary mapping each unique word to an integer ID

Running Stage 1

bash dockerrun/run_text_autoencoder_prepare_data.sh

This runs the prepare module, which reads from your configured data directory and outputs to prepared_data/.

Stage 2: Text Encoding Training

Input: Prepared text (train.txt, val.txt) and vocabulary
Output: Trained text autoencoder model
Module: src/tx/train.py

What Happens

An LSTM autoencoder learns to compress log messages into dense vectors. Think of it as learning a "fingerprint" for each message type:

  • "BlockReport processing took 120ms" → [0.12, -0.45, 0.89, ...]
  • "Got block report from" → [0.34, 0.11, -0.23, ...]
  • "Verification complete" → [-0.56, 0.78, 0.01, ...]

The model learns these encodings by trying to reconstruct the original message from its compressed representation. Messages with similar meaning get similar vectors.

Training Details

  • Epochs: Typically 10-20 passes through the data
  • Batch size: 32-128 messages at a time
  • Vector dimension: 64-128 dimensions (configurable)
  • Learning rate: Adaptive, typically starts at 0.001

The model uses validation data to prevent overfitting. Training continues until validation loss plateaus.

Running Stage 2

bash dockerrun/run_text_autoencoder_train.sh

This trains the text encoder, saving the model to text_autoencoder_model/.

Stage 3: Dataset Encoding

Input: Trained text encoder and full dataset
Output: Encoded vectors for all logs
Module: src/tx/encode.py

What Happens

Now that we have a trained text encoder, we use it to encode every log message in our dataset into vectors. This transforms raw text into numbers that the anomaly detector can process.

For each message:

  1. Tokenize the message (convert words to IDs)
  2. Run through the encoder portion of the autoencoder
  3. Get a vector output (e.g., 128 floats)
  4. Save the vector

Input/Output Example

Input:

"BlockReport processing took 120ms for 50000 blocks"

After encoding:

[0.123, -0.456, 0.789, ..., 0.234]  # 128-dimensional vector

Running Stage 3

bash dockerrun/run_text_autoencoder_encode_dataset.sh

This encodes the entire dataset and saves pickled vectors to encoded_dataset/.

Stage 4: Anomaly Detector Training

Input: Encoded vectors
Output: Trained anomaly detector model
Module: src/an/train.py

What Happens

Now we have sequences of vectors. A bidirectional LSTM learns what normal temporal patterns look like:

  • Which messages typically follow which
  • How quickly things normally change
  • What rate of events is typical

The model is trained to predict the next vector given the previous ones. When it can predict accurately, the sequence is normal. When prediction error is high, something unusual is happening.

Training Process

  • Create sliding windows of sequences (e.g., 10-step windows)
  • For each window, predict the final step given the first 9 steps
  • Train on normal (training) data
  • Evaluate on validation data
  • Save the best model

Running Stage 4

bash dockerrun/run_anomaly_detector_train.sh

This trains the anomaly detector and saves it to anomaly_trained_model/.

Stage 5: Evaluation

Input: Trained detector and test data
Output: Anomaly scores and evaluation metrics
Module: src/an/eval.py or src/an/analysis.py

What Happens

We run the trained model on test data and score each log entry:

  1. For each sequence in test data
  2. Feed it to the trained detector
  3. Get reconstruction error (how well the model predicted)
  4. High error = anomalous, low error = normal
  5. Save scores to file

Evaluation with Ground Truth

If you have labeled data (ground truth anomalies), run the labeled evaluation:

bash dockerrun/run_anomaly_detector_eval_labeled.sh

This computes:

  • ROC curve: Trade-off between true positives and false positives
  • AUC: Area under ROC curve (1.0 = perfect, 0.5 = random guessing)
  • Precision: Of detected anomalies, how many are real?
  • Recall: Of real anomalies, how many did we catch?
  • F-beta score: Harmonic mean of precision and recall

Stage 6: Live Monitoring

Input: Incoming log stream and trained models
Output: Real-time anomaly alerts
Module: src/live/main.py

What Happens

The live monitor reads logs in real-time and scores them as they arrive:

  1. Read incoming log messages
  2. Encode each message using the text encoder
  3. Score the vector using the anomaly detector
  4. Track rolling statistics (mean, std of recent scores)
  5. Alert when scores deviate from normal

Running Live Monitoring

bash dockerrun/run_live_monitoring.sh

The live monitor runs continuously, reading from a configured log file or stream and outputting alerts to the console or a monitoring system.

Running the Complete Pipeline

To run all stages in order:

#!/bin/bash
# Run complete pipeline

bash dockerrun/run_text_autoencoder_prepare_data.sh
bash dockerrun/run_text_autoencoder_train.sh
bash dockerrun/run_text_autoencoder_encode_dataset.sh
bash dockerrun/run_anomaly_detector_train.sh
bash dockerrun/run_anomaly_detector_eval.sh

# Optionally, with ground truth evaluation:
bash dockerrun/run_anomaly_detector_eval_labeled.sh

# Start live monitoring:
bash dockerrun/run_live_monitoring.sh
Tip: Each stage can take anywhere from seconds (small datasets) to hours (large production logs). Use a terminal multiplexer like tmux or screen to run stages in parallel on different systems.