DeepSentry Book, Model architecture

Model Architecture: Deep Dive

Understanding the neural networks that power DeepSentry: autoencoders, LSTMs, and how they work together.

Overall Architecture

DeepSentry uses a two-stage approach:

┌──────────────────────────────────────────────────────────────┐
│                  STAGE 1: TEXT AUTOENCODER                   │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│   Input Message                                              │
│   "BlockReport processing took 120ms"                        │
│                      │                                       │
│                      ▼                                       │
│             [Embedding Layer]                                │
│                      │                                       │
│                      ▼                                       │
│          [LSTM Encoder (bidirectional)]                      │
│                      │                                       │
│                      ▼                                       │
│      [Bottleneck: 128-D Vector]                              │
│       v = [0.12, -0.45, 0.89, ...]                           │
│                      │                                       │
│                      ▼                                       │
│          [LSTM Decoder (reverses encoding)]                  │
│                      │                                       │
│                      ▼                                       │
│             [Output Layer]                                   │
│                      │                                       │
│                      ▼                                       │
│   Reconstructed Message                                      │
│   "BlockReport processing took 120ms"                        │
│                                                              │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                 STAGE 2: ANOMALY DETECTOR                    │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│   Input Sequence (9 encoded vectors)                         │
│   [v1, v2, v3, v4, v5, v6, v7, v8, v9]                       │
│                      │                                       │
│                      ▼                                       │
│      [Bidirectional LSTM]                                    │
│      ├─ Forward pass  ──→                                    │
│      └─ Backward pass ←──                                    │
│                      │                                       │
│                      ▼                                       │
│   [Dense Prediction Layer]                                   │
│                      │                                       │
│                      ▼                                       │
│   Predicted next vector: v10'                                │
│                      │                                       │
│                      ▼                                       │
│   Calculate Error: distance(v10', v10)                       │
│   • Error < 0.1  = Normal ✓                                  │
│   • Error > 0.5  = Anomaly ⚠                                 │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Part 1: The Text Autoencoder

What is an Autoencoder?

An autoencoder is a neural network that learns to compress and decompress data. It has two parts:

Encoder: Takes input (a log message) and compresses it to a small vector
Decoder: Takes the compressed vector and reconstructs the original message

During training, we feed the model text and ask it to reproduce the exact same text. By forcing the model through a narrow "bottleneck" (the compressed vector), it learns what features are important to represent the message.

Why Use an Autoencoder for Logs?

Log messages are high-dimensional (thousands of possible words) but low-rank (repetitive patterns). An autoencoder:

Compresses: From ~5000-word vocabulary to 128-dim vector (40x compression)
Learns semantics: Messages with similar meaning get similar vectors
Unsupervised: Needs no labels, just learns from data
Lossy: Intentionally loses unimportant details

Architecture Details

TEXT AUTOENCODER LAYERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ENCODER HALF:
Input: [Word IDs]  e.g., [1234, 5678, 9012]
    ↓
Embedding: [Word Vectors]  e.g., [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
    ↓
LSTM (forward): Processes left-to-right
    ↓
LSTM (backward): Processes right-to-left
    ↓
Bottleneck Dense Layer: Reduces to 128 dimensions
    ↓
Latent Vector: [0.12, -0.45, 0.89, ..., 0.23]  (128 floats)

DECODER HALF:
Latent Vector: [0.12, -0.45, 0.89, ..., 0.23]
    ↓
Dense Layer: Expands back up
    ↓
LSTM: Generates output word-by-word
    ↓
Softmax: Probability over vocabulary for each position
    ↓
Output: [1234, 5678, 9012]  (Word IDs, should match input)

Training the Text Autoencoder

During training:

Take a log message, e.g., "BlockReport processing took 120ms"
Convert words to IDs: [1, 234, 567, 89, 101]
Feed through encoder to get latent vector
Feed latent vector through decoder to reconstruct message
Compare reconstructed output to original input
Calculate reconstruction loss (how wrong the reconstruction is)
Update weights to reduce loss
Repeat for thousands of messages

After training, we throw away the decoder and keep just the encoder. We use it to compress every log message in the dataset into vectors.

Part 2: The Anomaly Detector (Bidirectional LSTM)

Why LSTM for Sequences?

LSTMs (Long Short-Term Memory networks) are specially designed for sequence data. Regular neural networks forget information quickly, but LSTMs can maintain context across many steps.

Why this matters for logs:

Memory: An LSTM can remember events from 100 steps ago
Context: "authentication failed" means different things in different contexts
Patterns: LSTMs learn sequences like: failure → retry → timeout → error message

Bidirectional Processing

The anomaly detector uses a bidirectional LSTM:

Forward: Reads sequence left-to-right (past to present)
Backward: Reads sequence right-to-left (present to past)
Combined: Uses context from both directions

Why? Consider this sequence:

Log 1: "app started"
Log 2: "listening on port 8080"
Log 3: "memory usage: 45%"
Log 4: [ANOMALY] "unexpected error"
Log 5: "memory usage: 92%"

A forward-only LSTM might miss the anomaly at log 4 because it hasn't seen logs 5 yet. A bidirectional LSTM sees that log 4 breaks the pattern established by the surrounding context.

Reconstruction Error as Anomaly Score

The detector works by making predictions:

Given vectors [v1, v2, ..., v9], predict what v10 should be
Compare prediction to actual v10
Large error = anomaly, small error = normal

Mathematically:

anomaly_score = distance(predicted_v10, actual_v10)

distance can be:
  • Euclidean: sqrt((p1-a1)² + (p2-a2)² + ... + (p128-a128)²)
  • Manhattan: |p1-a1| + |p2-a2| + ... + |p128-a128|
  • Cosine: 1 - (p·a) / (||p|| ||a||)

Training the Anomaly Detector

Training data is normal sequences (from training logs). The goal: minimize reconstruction error on normal data, so any deviation signals an anomaly.

Create 10-step sliding windows from training vectors
For each window [v1...v9], predict v10
Calculate loss: reconstruction error on v10
Update LSTM weights to improve predictions
Validate on separate validation data to prevent overfitting
Save best model (lowest validation error)

Why This Architecture?

DeepSentry combines autoencoders + LSTMs because:

Component	Problem It Solves
Text Autoencoder	Log messages are variable-length text. Autoencoder converts to fixed-size vectors, removing noise.
LSTM Detector	Sequences of vectors show temporal patterns. LSTM detects deviations from normal patterns.
Bidirectional	Context before and after an event matters. Bidirectional processing captures full context.

Key Hyperparameters

These control model behavior and are configurable in dockerconfig/:

Parameter	Default	Impact
embedding_size	64	Size of word vectors. Larger = more expressive, slower training
encoder_dim	128	Size of latent vectors. Larger = more information preserved
lstm_hidden_dim	64	LSTM internal size. Larger = more capacity to learn patterns
sequence_length	10	Length of sequences for anomaly training. Longer = captures longer patterns
epochs	10	Training iterations. More = better learning, but risk of overfitting
batch_size	32	Samples per gradient update. Larger = faster but less stable training

Intuition: Why Reconstruction Error Works

The core assumption: Normal sequences follow predictable patterns.

Example normal patterns in HDFS logs:

BlockReport usually followed by Got block report
Got block report usually followed by Verification complete
File deleted usually followed by Block report

When something anomalous happens:

A sequence breaks learned patterns
The LSTM can't predict well
Reconstruction error spikes

Example anomaly: normally messages about file deletion don't appear in rapid succession. If they do, that's unusual, and the model can't predict it well.

Key Insight: We're not looking for specific attack signatures. We're looking for deviations from normal system behavior. The model learns "what normal looks like" and flags everything else as suspicious.

Limitations and When to Consider Alternatives

Seasonal patterns: If your system has patterns that repeat on hourly/daily/weekly cycles, consider including time features
Concept drift: If your system fundamentally changes (new features, architecture), models trained on old data may not work well
Very large logs: For billions of messages, consider sampling or hierarchical approaches
Domain expertise: If you know specific attacks to detect, hybrid approaches combining rule-based + learned methods can be better