Configuration: Tuning the System
Configuration Overview
All configuration happens through YAML files in dockerconfig/. Each stage of the pipeline has its own config file:
| Stage | Config File | Purpose |
|---|---|---|
| Prepare | text_autoencoder_prepare_data.yml | Data preparation parameters |
| Text Training | text_autoencoder_train.yml | Text autoencoder architecture |
| Text Encoding | text_autoencoder_dataset_encoder.yml | Encoding the full dataset |
| Anomaly Training | anomaly_detector_train.yml | Anomaly detector architecture |
| Anomaly Eval | anomaly_detector_eval.yml | Test-time evaluation |
| Labeled Eval | anomaly_detector_eval_labeled.yml | Evaluation with ground truth |
| Live Monitoring | live_monitoring_config.yml | Real-time detection parameters |
| Logging | logging_config.yml | Output and debugging options |
Understanding Config Files
Config files use YAML format with placeholders and typed values:
parameters:
root_dir: {ROOT_DATA_DIR} # Replaced at runtime
name: "my_experiment" # String
epochs: 10 # Integer
dropout: 0.2 # Float
bidirectional: true # Boolean
paths: # List
- /path/one
- /path/two
nested: # Dictionary
key1: value1
key2: value2
The Config Loader
src/config_loader.py handles all config loading. It:
- Loads YAML file
- Replaces
{ROOT_DATA_DIR}placeholder with actual path - Validates required keys are present
- Provides easy dictionary access to all values
Usage in code:
from src.config_loader import Config
config = Config.load_from_file(
"dockerconfig/text_autoencoder_train.yml",
root_data_dir="/path/to/data"
)
epochs = config.get("epochs", default=10)
embedding_size = config.get("embedding_size", default=64)
output_dir = config.get_path("output_model_dir")
Key Configuration Parameters
Text Autoencoder Prepare
# text_autoencoder_prepare_data.yml
parameters:
root_dir: {ROOT_DATA_DIR}
raw_log_file: "{ROOT_DATA_DIR}/HDFS_1.log"
raw_log_labels: "{ROOT_DATA_DIR}/labels.txt"
# Data splitting
training_log_data_fraction: 0.8 # 80% for training
# Output locations
prepared_data_dir: "{ROOT_DATA_DIR}/prepared_data"
# Vocabulary limits
min_word_count: 1 # Include words appearing at least once
max_vocab_size: 10000 # Limit to top 10k words
Key parameters:
training_log_data_fraction:Split ratio (0.8 = 80% train, 20% validate)min_word_count:Ignore rare words. Higher = smaller vocabulary, faster trainingmax_vocab_size:Hard limit on vocabulary. Prevents memory issues on huge datasets
Text Autoencoder Training
# text_autoencoder_train.yml
parameters:
root_dir: {ROOT_DATA_DIR}
# Architecture
embedding_size: 64 # Size of word vectors
encoder_hidden_size: 128 # LSTM hidden dimension
# Bottleneck (latent representation)
encoder_output_size: 128 # Compressed vector size
# Training
epochs: 10 # Full passes through data
batch_size: 32 # Samples per update
dropout: 0.2 # Regularization (prevents overfitting)
# Learning
learning_rate: 0.001 # Gradient descent step size
validation_split: 0.2 # Use 20% of data for validation
# Bidirectional option
bidirectional: false # Use forward+backward LSTM
# Checkpointing
save_best_only: true # Keep best model during training
# Output
output_model_dir: "{ROOT_DATA_DIR}/text_autoencoder_model"
Tuning guide:
- embedding_size: Larger = more expressive but slower. Try 32-256.
- encoder_output_size: The bottleneck compression. Larger = less loss. Try 64-256.
- epochs: How many passes through training data. More = better but slower. Try 5-20.
- dropout: Prevents overfitting. Try 0.1-0.5. Higher = more regularization.
- bidirectional: Use forward+backward for better understanding. Slower but better quality.
Anomaly Detector Training
# anomaly_detector_train.yml
parameters:
root_dir: {ROOT_DATA_DIR}
# Architecture
lstm_hidden_size: 64 # LSTM internal dimension
num_layers: 1 # Number of LSTM layers
dropout: 0.2 # Dropout between layers
# Sequence modeling
sequence_length: 10 # Length of windows for training
bidirectional: true # Forward + backward processing
# Training
epochs: 10
batch_size: 32
learning_rate: 0.001
# Checkpointing
save_best_only: true
# Input/output
encoded_data_dir: "{ROOT_DATA_DIR}/encoded_dataset"
output_model_dir: "{ROOT_DATA_DIR}/anomaly_trained_model"
Tuning guide:
- lstm_hidden_size: Capacity to learn patterns. Larger = more learning capacity but slower. Try 32-256.
- sequence_length: How many past vectors to use. Longer = captures longer patterns. Try 5-50.
- num_layers: Stack multiple LSTMs. Try 1-3. More = more expressive but harder to train.
- bidirectional: Usually true for anomaly detection (use context both ways).
Anomaly Evaluation
# anomaly_detector_eval.yml
parameters:
root_dir: {ROOT_DATA_DIR}
# Models to use
text_encoder_model: "{ROOT_DATA_DIR}/text_autoencoder_model/encoder.h5"
anomaly_model: "{ROOT_DATA_DIR}/anomaly_trained_model/detector.h5"
# Test data
encoded_test_data: "{ROOT_DATA_DIR}/encoded_dataset/test_encoded.pkl"
# Scoring parameters
sequence_length: 10 # Must match training length
threshold: 0.5 # For binary classification
# Output
output_dir: "{ROOT_DATA_DIR}/eval_results"
Live Monitoring
# live_monitoring_config.yml
parameters:
root_dir: {ROOT_DATA_DIR}
# Models
text_encoder_model: "{ROOT_DATA_DIR}/text_autoencoder_model/encoder.h5"
anomaly_model: "{ROOT_DATA_DIR}/anomaly_trained_model/detector.h5"
vocabulary: "{ROOT_DATA_DIR}/prepared_data/vocab.pkl"
# Input log stream
log_file: "/var/log/myapp.log" # Or /dev/stdin for piping
tail_mode: true # Start from end of file
# Scoring
sequence_length: 10
window_size: 100 # Sliding window for baseline stats
threshold: 2.5 # Alert if score > 2.5 std devs above mean
# Output
output_file: "/var/log/anomalies.log"
verbose: true # Print all scores or just alerts
Customizing for Your Logs
Example: Web Server Logs
You have web server logs like:
220115 14:23:45 GET /api/users 200 45ms
220115 14:23:46 POST /api/users 201 120ms
220115 14:23:47 GET /api/health 200 2ms
Recommended configuration adjustments:
# Shorter sequences (web requests are short-lived)
sequence_length: 5 # Instead of 10
# Smaller models (logs are consistent)
lstm_hidden_size: 32 # Instead of 64
embedding_size: 32 # Instead of 64
# More aggressive anomaly threshold
threshold: 2.0 # Requests are mostly uniform, easy to detect deviations
# Shorter training period
epochs: 5 # 1 day of logs is enough
batch_size: 64 # Smaller batches for faster convergence
Example: Distributed System Logs (Like HDFS)
You have complex multi-node logs where messages come from different components:
# Longer sequences (distributed systems have longer patterns)
sequence_length: 20 # Instead of 10
# Larger models (complex patterns)
lstm_hidden_size: 128 # Instead of 64
embedding_size: 128 # Instead of 64
encoder_output_size: 256 # Larger bottleneck
# More nuanced anomaly threshold
threshold: 3.0 # Distributed systems have more variance
# Longer training period
epochs: 20 # Need more data to understand all nodes
bidirectional: true # Important to understand context
batch_size: 16 # Smaller batches for stability
Best Practices for Configuration
- Start with defaults, then tune based on results
- Use smaller datasets for initial experiments
- Track which config changes improve performance
- Don't over-tune on test data (that's data leakage)
- Document changes in git
Parameter Tuning Workflow
- Baseline: Run with default config
- Experiment: Change one parameter at a time
- Evaluate: Check metrics on validation set
- Decide: Keep change if better, revert if worse
- Repeat: Try next parameter
Common Tuning Scenarios
Problem: Model training is slow
- Reduce epochs
- Reduce hidden_size
- Increase batch_size
- Use GPU (Chapter 2)
Problem: Model is overfitting (bad validation performance)
- Increase dropout
- Reduce hidden_size
- Reduce epochs
- Reduce encoder_output_size
Problem: Too many false positives (alerts for normal events)
- Increase threshold
- Increase sequence_length (longer context)
- More training data (ensure it's clean)
Problem: Missed anomalies (low detection rate)
- Decrease threshold
- Increase embedding_size / hidden_size
- Increase epochs
- Check for anomalies in training data