DeepSentry Book, Configuration

Configuration: Tuning the System

How to configure every aspect of DeepSentry: model parameters, data paths, logging, and tuning for your specific logs.

Configuration Overview

All configuration happens through YAML files in dockerconfig/. Each stage of the pipeline has its own config file:

Stage	Config File	Purpose
Prepare	text_autoencoder_prepare_data.yml	Data preparation parameters
Text Training	text_autoencoder_train.yml	Text autoencoder architecture
Text Encoding	text_autoencoder_dataset_encoder.yml	Encoding the full dataset
Anomaly Training	anomaly_detector_train.yml	Anomaly detector architecture
Anomaly Eval	anomaly_detector_eval.yml	Test-time evaluation
Labeled Eval	anomaly_detector_eval_labeled.yml	Evaluation with ground truth
Live Monitoring	live_monitoring_config.yml	Real-time detection parameters
Logging	logging_config.yml	Output and debugging options

Understanding Config Files

Config files use YAML format with placeholders and typed values:

parameters:
  root_dir: {ROOT_DATA_DIR}  # Replaced at runtime
  name: "my_experiment"      # String
  epochs: 10                 # Integer
  dropout: 0.2               # Float
  bidirectional: true        # Boolean
  paths:                     # List
    - /path/one
    - /path/two
  nested:                    # Dictionary
    key1: value1
    key2: value2

The Config Loader

src/config_loader.py handles all config loading. It:

Loads YAML file
Replaces {ROOT_DATA_DIR} placeholder with actual path
Validates required keys are present
Provides easy dictionary access to all values

Usage in code:

from src.config_loader import Config

config = Config.load_from_file(
    "dockerconfig/text_autoencoder_train.yml",
    root_data_dir="/path/to/data"
)

epochs = config.get("epochs", default=10)
embedding_size = config.get("embedding_size", default=64)
output_dir = config.get_path("output_model_dir")

Key Configuration Parameters

Text Autoencoder Prepare


# text_autoencoder_prepare_data.yml

parameters:
  root_dir: {ROOT_DATA_DIR}
  raw_log_file: "{ROOT_DATA_DIR}/HDFS_1.log"
  raw_log_labels: "{ROOT_DATA_DIR}/labels.txt"
  
  # Data splitting
  training_log_data_fraction: 0.8  # 80% for training
  
  # Output locations
  prepared_data_dir: "{ROOT_DATA_DIR}/prepared_data"
  
  # Vocabulary limits
  min_word_count: 1      # Include words appearing at least once
  max_vocab_size: 10000  # Limit to top 10k words

Key parameters:

training_log_data_fraction: Split ratio (0.8 = 80% train, 20% validate)
min_word_count: Ignore rare words. Higher = smaller vocabulary, faster training
max_vocab_size: Hard limit on vocabulary. Prevents memory issues on huge datasets

Text Autoencoder Training


# text_autoencoder_train.yml

parameters:
  root_dir: {ROOT_DATA_DIR}
  
  # Architecture
  embedding_size: 64              # Size of word vectors
  encoder_hidden_size: 128        # LSTM hidden dimension
  
  # Bottleneck (latent representation)
  encoder_output_size: 128        # Compressed vector size
  
  # Training
  epochs: 10                      # Full passes through data
  batch_size: 32                  # Samples per update
  dropout: 0.2                    # Regularization (prevents overfitting)
  
  # Learning
  learning_rate: 0.001            # Gradient descent step size
  validation_split: 0.2           # Use 20% of data for validation
  
  # Bidirectional option
  bidirectional: false            # Use forward+backward LSTM
  
  # Checkpointing
  save_best_only: true            # Keep best model during training
  
  # Output
  output_model_dir: "{ROOT_DATA_DIR}/text_autoencoder_model"

Tuning guide:

embedding_size: Larger = more expressive but slower. Try 32-256.
encoder_output_size: The bottleneck compression. Larger = less loss. Try 64-256.
epochs: How many passes through training data. More = better but slower. Try 5-20.
dropout: Prevents overfitting. Try 0.1-0.5. Higher = more regularization.
bidirectional: Use forward+backward for better understanding. Slower but better quality.

Anomaly Detector Training


# anomaly_detector_train.yml

parameters:
  root_dir: {ROOT_DATA_DIR}
  
  # Architecture
  lstm_hidden_size: 64            # LSTM internal dimension
  num_layers: 1                   # Number of LSTM layers
  dropout: 0.2                    # Dropout between layers
  
  # Sequence modeling
  sequence_length: 10             # Length of windows for training
  bidirectional: true             # Forward + backward processing
  
  # Training
  epochs: 10
  batch_size: 32
  learning_rate: 0.001
  
  # Checkpointing
  save_best_only: true
  
  # Input/output
  encoded_data_dir: "{ROOT_DATA_DIR}/encoded_dataset"
  output_model_dir: "{ROOT_DATA_DIR}/anomaly_trained_model"

Tuning guide:

lstm_hidden_size: Capacity to learn patterns. Larger = more learning capacity but slower. Try 32-256.
sequence_length: How many past vectors to use. Longer = captures longer patterns. Try 5-50.
num_layers: Stack multiple LSTMs. Try 1-3. More = more expressive but harder to train.
bidirectional: Usually true for anomaly detection (use context both ways).

Anomaly Evaluation


# anomaly_detector_eval.yml

parameters:
  root_dir: {ROOT_DATA_DIR}
  
  # Models to use
  text_encoder_model: "{ROOT_DATA_DIR}/text_autoencoder_model/encoder.h5"
  anomaly_model: "{ROOT_DATA_DIR}/anomaly_trained_model/detector.h5"
  
  # Test data
  encoded_test_data: "{ROOT_DATA_DIR}/encoded_dataset/test_encoded.pkl"
  
  # Scoring parameters
  sequence_length: 10             # Must match training length
  threshold: 0.5                  # For binary classification
  
  # Output
  output_dir: "{ROOT_DATA_DIR}/eval_results"

Live Monitoring


# live_monitoring_config.yml

parameters:
  root_dir: {ROOT_DATA_DIR}
  
  # Models
  text_encoder_model: "{ROOT_DATA_DIR}/text_autoencoder_model/encoder.h5"
  anomaly_model: "{ROOT_DATA_DIR}/anomaly_trained_model/detector.h5"
  vocabulary: "{ROOT_DATA_DIR}/prepared_data/vocab.pkl"
  
  # Input log stream
  log_file: "/var/log/myapp.log"  # Or /dev/stdin for piping
  tail_mode: true                  # Start from end of file
  
  # Scoring
  sequence_length: 10
  window_size: 100                 # Sliding window for baseline stats
  threshold: 2.5                   # Alert if score > 2.5 std devs above mean
  
  # Output
  output_file: "/var/log/anomalies.log"
  verbose: true                    # Print all scores or just alerts

Customizing for Your Logs

Example: Web Server Logs

You have web server logs like:

220115 14:23:45 GET /api/users 200 45ms
220115 14:23:46 POST /api/users 201 120ms
220115 14:23:47 GET /api/health 200 2ms

Recommended configuration adjustments:


# Shorter sequences (web requests are short-lived)
sequence_length: 5                # Instead of 10

# Smaller models (logs are consistent)
lstm_hidden_size: 32              # Instead of 64
embedding_size: 32                # Instead of 64

# More aggressive anomaly threshold
threshold: 2.0                    # Requests are mostly uniform, easy to detect deviations

# Shorter training period
epochs: 5                         # 1 day of logs is enough
batch_size: 64                    # Smaller batches for faster convergence

Example: Distributed System Logs (Like HDFS)

You have complex multi-node logs where messages come from different components:


# Longer sequences (distributed systems have longer patterns)
sequence_length: 20               # Instead of 10

# Larger models (complex patterns)
lstm_hidden_size: 128             # Instead of 64
embedding_size: 128               # Instead of 64
encoder_output_size: 256          # Larger bottleneck

# More nuanced anomaly threshold
threshold: 3.0                    # Distributed systems have more variance

# Longer training period
epochs: 20                        # Need more data to understand all nodes
bidirectional: true               # Important to understand context
batch_size: 16                    # Smaller batches for stability

Best Practices for Configuration

Configuration Tips:

Start with defaults, then tune based on results
Use smaller datasets for initial experiments
Track which config changes improve performance
Don't over-tune on test data (that's data leakage)
Document changes in git

Parameter Tuning Workflow

Baseline: Run with default config
Experiment: Change one parameter at a time
Evaluate: Check metrics on validation set
Decide: Keep change if better, revert if worse
Repeat: Try next parameter

Common Tuning Scenarios

Problem: Model training is slow

Reduce epochs
Reduce hidden_size
Increase batch_size
Use GPU (Chapter 2)

Problem: Model is overfitting (bad validation performance)

Increase dropout
Reduce hidden_size
Reduce epochs
Reduce encoder_output_size

Problem: Too many false positives (alerts for normal events)

Increase threshold
Increase sequence_length (longer context)
More training data (ensure it's clean)

Problem: Missed anomalies (low detection rate)

Decrease threshold
Increase embedding_size / hidden_size
Increase epochs
Check for anomalies in training data