Model Architecture: Deep Dive
Overall Architecture
DeepSentry uses a two-stage approach:
┌──────────────────────────────────────────────────────────────┐ │ STAGE 1: TEXT AUTOENCODER │ ├──────────────────────────────────────────────────────────────┤ │ │ │ Input Message │ │ "BlockReport processing took 120ms" │ │ │ │ │ ▼ │ │ [Embedding Layer] │ │ │ │ │ ▼ │ │ [LSTM Encoder (bidirectional)] │ │ │ │ │ ▼ │ │ [Bottleneck: 128-D Vector] │ │ v = [0.12, -0.45, 0.89, ...] │ │ │ │ │ ▼ │ │ [LSTM Decoder (reverses encoding)] │ │ │ │ │ ▼ │ │ [Output Layer] │ │ │ │ │ ▼ │ │ Reconstructed Message │ │ "BlockReport processing took 120ms" │ │ │ └──────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ │ STAGE 2: ANOMALY DETECTOR │ ├──────────────────────────────────────────────────────────────┤ │ │ │ Input Sequence (9 encoded vectors) │ │ [v1, v2, v3, v4, v5, v6, v7, v8, v9] │ │ │ │ │ ▼ │ │ [Bidirectional LSTM] │ │ ├─ Forward pass ──→ │ │ └─ Backward pass ←── │ │ │ │ │ ▼ │ │ [Dense Prediction Layer] │ │ │ │ │ ▼ │ │ Predicted next vector: v10' │ │ │ │ │ ▼ │ │ Calculate Error: distance(v10', v10) │ │ • Error < 0.1 = Normal ✓ │ │ • Error > 0.5 = Anomaly ⚠ │ │ │ └──────────────────────────────────────────────────────────────┘
Part 1: The Text Autoencoder
What is an Autoencoder?
An autoencoder is a neural network that learns to compress and decompress data. It has two parts:
- Encoder: Takes input (a log message) and compresses it to a small vector
- Decoder: Takes the compressed vector and reconstructs the original message
During training, we feed the model text and ask it to reproduce the exact same text. By forcing the model through a narrow "bottleneck" (the compressed vector), it learns what features are important to represent the message.
Why Use an Autoencoder for Logs?
Log messages are high-dimensional (thousands of possible words) but low-rank (repetitive patterns). An autoencoder:
- Compresses: From ~5000-word vocabulary to 128-dim vector (40x compression)
- Learns semantics: Messages with similar meaning get similar vectors
- Unsupervised: Needs no labels, just learns from data
- Lossy: Intentionally loses unimportant details
Architecture Details
TEXT AUTOENCODER LAYERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ENCODER HALF:
Input: [Word IDs] e.g., [1234, 5678, 9012]
↓
Embedding: [Word Vectors] e.g., [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
↓
LSTM (forward): Processes left-to-right
↓
LSTM (backward): Processes right-to-left
↓
Bottleneck Dense Layer: Reduces to 128 dimensions
↓
Latent Vector: [0.12, -0.45, 0.89, ..., 0.23] (128 floats)
DECODER HALF:
Latent Vector: [0.12, -0.45, 0.89, ..., 0.23]
↓
Dense Layer: Expands back up
↓
LSTM: Generates output word-by-word
↓
Softmax: Probability over vocabulary for each position
↓
Output: [1234, 5678, 9012] (Word IDs, should match input)
Training the Text Autoencoder
During training:
- Take a log message, e.g., "BlockReport processing took 120ms"
- Convert words to IDs: [1, 234, 567, 89, 101]
- Feed through encoder to get latent vector
- Feed latent vector through decoder to reconstruct message
- Compare reconstructed output to original input
- Calculate reconstruction loss (how wrong the reconstruction is)
- Update weights to reduce loss
- Repeat for thousands of messages
After training, we throw away the decoder and keep just the encoder. We use it to compress every log message in the dataset into vectors.
Part 2: The Anomaly Detector (Bidirectional LSTM)
Why LSTM for Sequences?
LSTMs (Long Short-Term Memory networks) are specially designed for sequence data. Regular neural networks forget information quickly, but LSTMs can maintain context across many steps.
Why this matters for logs:
- Memory: An LSTM can remember events from 100 steps ago
- Context: "authentication failed" means different things in different contexts
- Patterns: LSTMs learn sequences like: failure → retry → timeout → error message
Bidirectional Processing
The anomaly detector uses a bidirectional LSTM:
- Forward: Reads sequence left-to-right (past to present)
- Backward: Reads sequence right-to-left (present to past)
- Combined: Uses context from both directions
Why? Consider this sequence:
Log 1: "app started"
Log 2: "listening on port 8080"
Log 3: "memory usage: 45%"
Log 4: [ANOMALY] "unexpected error"
Log 5: "memory usage: 92%"
A forward-only LSTM might miss the anomaly at log 4 because it hasn't seen logs 5 yet. A bidirectional LSTM sees that log 4 breaks the pattern established by the surrounding context.
Reconstruction Error as Anomaly Score
The detector works by making predictions:
- Given vectors [v1, v2, ..., v9], predict what v10 should be
- Compare prediction to actual v10
- Large error = anomaly, small error = normal
Mathematically:
anomaly_score = distance(predicted_v10, actual_v10)
distance can be:
• Euclidean: sqrt((p1-a1)² + (p2-a2)² + ... + (p128-a128)²)
• Manhattan: |p1-a1| + |p2-a2| + ... + |p128-a128|
• Cosine: 1 - (p·a) / (||p|| ||a||)
Training the Anomaly Detector
Training data is normal sequences (from training logs). The goal: minimize reconstruction error on normal data, so any deviation signals an anomaly.
- Create 10-step sliding windows from training vectors
- For each window [v1...v9], predict v10
- Calculate loss: reconstruction error on v10
- Update LSTM weights to improve predictions
- Validate on separate validation data to prevent overfitting
- Save best model (lowest validation error)
Why This Architecture?
DeepSentry combines autoencoders + LSTMs because:
| Component | Problem It Solves |
|---|---|
| Text Autoencoder | Log messages are variable-length text. Autoencoder converts to fixed-size vectors, removing noise. |
| LSTM Detector | Sequences of vectors show temporal patterns. LSTM detects deviations from normal patterns. |
| Bidirectional | Context before and after an event matters. Bidirectional processing captures full context. |
Key Hyperparameters
These control model behavior and are configurable in dockerconfig/:
| Parameter | Default | Impact |
|---|---|---|
| embedding_size | 64 | Size of word vectors. Larger = more expressive, slower training |
| encoder_dim | 128 | Size of latent vectors. Larger = more information preserved |
| lstm_hidden_dim | 64 | LSTM internal size. Larger = more capacity to learn patterns |
| sequence_length | 10 | Length of sequences for anomaly training. Longer = captures longer patterns |
| epochs | 10 | Training iterations. More = better learning, but risk of overfitting |
| batch_size | 32 | Samples per gradient update. Larger = faster but less stable training |
Intuition: Why Reconstruction Error Works
The core assumption: Normal sequences follow predictable patterns.
Example normal patterns in HDFS logs:
- BlockReport usually followed by Got block report
- Got block report usually followed by Verification complete
- File deleted usually followed by Block report
When something anomalous happens:
- A sequence breaks learned patterns
- The LSTM can't predict well
- Reconstruction error spikes
Example anomaly: normally messages about file deletion don't appear in rapid succession. If they do, that's unusual, and the model can't predict it well.
Limitations and When to Consider Alternatives
- Seasonal patterns: If your system has patterns that repeat on hourly/daily/weekly cycles, consider including time features
- Concept drift: If your system fundamentally changes (new features, architecture), models trained on old data may not work well
- Very large logs: For billions of messages, consider sampling or hierarchical approaches
- Domain expertise: If you know specific attacks to detect, hybrid approaches combining rule-based + learned methods can be better