Model Architecture: Deep Dive

Understanding the neural networks that power DeepSentry: autoencoders, LSTMs, and how they work together.

Overall Architecture

DeepSentry uses a two-stage approach:

┌──────────────────────────────────────────────────────────────┐
│                  STAGE 1: TEXT AUTOENCODER                   │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│   Input Message                                              │
│   "BlockReport processing took 120ms"                        │
│                      │                                       │
│                      ▼                                       │
│             [Embedding Layer]                                │
│                      │                                       │
│                      ▼                                       │
│          [LSTM Encoder (bidirectional)]                      │
│                      │                                       │
│                      ▼                                       │
│      [Bottleneck: 128-D Vector]                              │
│       v = [0.12, -0.45, 0.89, ...]                           │
│                      │                                       │
│                      ▼                                       │
│          [LSTM Decoder (reverses encoding)]                  │
│                      │                                       │
│                      ▼                                       │
│             [Output Layer]                                   │
│                      │                                       │
│                      ▼                                       │
│   Reconstructed Message                                      │
│   "BlockReport processing took 120ms"                        │
│                                                              │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                 STAGE 2: ANOMALY DETECTOR                    │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│   Input Sequence (9 encoded vectors)                         │
│   [v1, v2, v3, v4, v5, v6, v7, v8, v9]                       │
│                      │                                       │
│                      ▼                                       │
│      [Bidirectional LSTM]                                    │
│      ├─ Forward pass  ──→                                    │
│      └─ Backward pass ←──                                    │
│                      │                                       │
│                      ▼                                       │
│   [Dense Prediction Layer]                                   │
│                      │                                       │
│                      ▼                                       │
│   Predicted next vector: v10'                                │
│                      │                                       │
│                      ▼                                       │
│   Calculate Error: distance(v10', v10)                       │
│   • Error < 0.1  = Normal ✓                                  │
│   • Error > 0.5  = Anomaly ⚠                                 │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Part 1: The Text Autoencoder

What is an Autoencoder?

An autoencoder is a neural network that learns to compress and decompress data. It has two parts:

  • Encoder: Takes input (a log message) and compresses it to a small vector
  • Decoder: Takes the compressed vector and reconstructs the original message

During training, we feed the model text and ask it to reproduce the exact same text. By forcing the model through a narrow "bottleneck" (the compressed vector), it learns what features are important to represent the message.

Why Use an Autoencoder for Logs?

Log messages are high-dimensional (thousands of possible words) but low-rank (repetitive patterns). An autoencoder:

  • Compresses: From ~5000-word vocabulary to 128-dim vector (40x compression)
  • Learns semantics: Messages with similar meaning get similar vectors
  • Unsupervised: Needs no labels, just learns from data
  • Lossy: Intentionally loses unimportant details

Architecture Details

TEXT AUTOENCODER LAYERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ENCODER HALF:
Input: [Word IDs]  e.g., [1234, 5678, 9012]
    ↓
Embedding: [Word Vectors]  e.g., [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
    ↓
LSTM (forward): Processes left-to-right
    ↓
LSTM (backward): Processes right-to-left
    ↓
Bottleneck Dense Layer: Reduces to 128 dimensions
    ↓
Latent Vector: [0.12, -0.45, 0.89, ..., 0.23]  (128 floats)

DECODER HALF:
Latent Vector: [0.12, -0.45, 0.89, ..., 0.23]
    ↓
Dense Layer: Expands back up
    ↓
LSTM: Generates output word-by-word
    ↓
Softmax: Probability over vocabulary for each position
    ↓
Output: [1234, 5678, 9012]  (Word IDs, should match input)
        

Training the Text Autoencoder

During training:

  1. Take a log message, e.g., "BlockReport processing took 120ms"
  2. Convert words to IDs: [1, 234, 567, 89, 101]
  3. Feed through encoder to get latent vector
  4. Feed latent vector through decoder to reconstruct message
  5. Compare reconstructed output to original input
  6. Calculate reconstruction loss (how wrong the reconstruction is)
  7. Update weights to reduce loss
  8. Repeat for thousands of messages

After training, we throw away the decoder and keep just the encoder. We use it to compress every log message in the dataset into vectors.

Part 2: The Anomaly Detector (Bidirectional LSTM)

Why LSTM for Sequences?

LSTMs (Long Short-Term Memory networks) are specially designed for sequence data. Regular neural networks forget information quickly, but LSTMs can maintain context across many steps.

Why this matters for logs:

  • Memory: An LSTM can remember events from 100 steps ago
  • Context: "authentication failed" means different things in different contexts
  • Patterns: LSTMs learn sequences like: failure → retry → timeout → error message

Bidirectional Processing

The anomaly detector uses a bidirectional LSTM:

  • Forward: Reads sequence left-to-right (past to present)
  • Backward: Reads sequence right-to-left (present to past)
  • Combined: Uses context from both directions

Why? Consider this sequence:

Log 1: "app started"
Log 2: "listening on port 8080"
Log 3: "memory usage: 45%"
Log 4: [ANOMALY] "unexpected error"
Log 5: "memory usage: 92%"
        

A forward-only LSTM might miss the anomaly at log 4 because it hasn't seen logs 5 yet. A bidirectional LSTM sees that log 4 breaks the pattern established by the surrounding context.

Reconstruction Error as Anomaly Score

The detector works by making predictions:

  1. Given vectors [v1, v2, ..., v9], predict what v10 should be
  2. Compare prediction to actual v10
  3. Large error = anomaly, small error = normal

Mathematically:

anomaly_score = distance(predicted_v10, actual_v10)

distance can be:
  • Euclidean: sqrt((p1-a1)² + (p2-a2)² + ... + (p128-a128)²)
  • Manhattan: |p1-a1| + |p2-a2| + ... + |p128-a128|
  • Cosine: 1 - (p·a) / (||p|| ||a||)
        

Training the Anomaly Detector

Training data is normal sequences (from training logs). The goal: minimize reconstruction error on normal data, so any deviation signals an anomaly.

  1. Create 10-step sliding windows from training vectors
  2. For each window [v1...v9], predict v10
  3. Calculate loss: reconstruction error on v10
  4. Update LSTM weights to improve predictions
  5. Validate on separate validation data to prevent overfitting
  6. Save best model (lowest validation error)

Why This Architecture?

DeepSentry combines autoencoders + LSTMs because:

Component Problem It Solves
Text Autoencoder Log messages are variable-length text. Autoencoder converts to fixed-size vectors, removing noise.
LSTM Detector Sequences of vectors show temporal patterns. LSTM detects deviations from normal patterns.
Bidirectional Context before and after an event matters. Bidirectional processing captures full context.

Key Hyperparameters

These control model behavior and are configurable in dockerconfig/:

Parameter Default Impact
embedding_size 64 Size of word vectors. Larger = more expressive, slower training
encoder_dim 128 Size of latent vectors. Larger = more information preserved
lstm_hidden_dim 64 LSTM internal size. Larger = more capacity to learn patterns
sequence_length 10 Length of sequences for anomaly training. Longer = captures longer patterns
epochs 10 Training iterations. More = better learning, but risk of overfitting
batch_size 32 Samples per gradient update. Larger = faster but less stable training

Intuition: Why Reconstruction Error Works

The core assumption: Normal sequences follow predictable patterns.

Example normal patterns in HDFS logs:

  • BlockReport usually followed by Got block report
  • Got block report usually followed by Verification complete
  • File deleted usually followed by Block report

When something anomalous happens:

  • A sequence breaks learned patterns
  • The LSTM can't predict well
  • Reconstruction error spikes

Example anomaly: normally messages about file deletion don't appear in rapid succession. If they do, that's unusual, and the model can't predict it well.

Key Insight: We're not looking for specific attack signatures. We're looking for deviations from normal system behavior. The model learns "what normal looks like" and flags everything else as suspicious.

Limitations and When to Consider Alternatives

  • Seasonal patterns: If your system has patterns that repeat on hourly/daily/weekly cycles, consider including time features
  • Concept drift: If your system fundamentally changes (new features, architecture), models trained on old data may not work well
  • Very large logs: For billions of messages, consider sampling or hierarchical approaches
  • Domain expertise: If you know specific attacks to detect, hybrid approaches combining rule-based + learned methods can be better