Troubleshooting: Debugging and Problem Solving
General Debugging Checklist
When something fails, work through this checklist first:
- Check the error message carefully (often tells you exactly what's wrong)
- Verify files exist:
ls -la /path/to/file - Check disk space:
df -h - Check memory:
free -h - Enable verbose mode: Add
--verboseto commands - Check logs in output directory
- Try with a smaller dataset first
- Check configuration file syntax (valid YAML)
Installation Issues
Error: ModuleNotFoundError: No module named 'tensorflow'
Cause: TensorFlow not installed or wrong Python environment.
Solution:
# Check which Python is active
which python
python --version
# If using venv, ensure it's activated
source venv/bin/activate
echo $VIRTUAL_ENV # Should show venv path
# Reinstall requirements
pip install --upgrade pip
pip install -r requirements.txt
# Verify TensorFlow
python -c "import tensorflow; print(tensorflow.__version__)"
Error: ImportError: libcuda.so.11.0: cannot open shared object file
Cause: CUDA not installed or incompatible version.
Solution:
# Check if GPU is available
nvidia-smi # Should show GPU info
# If GPU not found, use CPU-only TensorFlow
pip uninstall tensorflow -y
pip install tensorflow-cpu
# Or install GPU version with correct CUDA
pip install tensorflow[and-cuda]
Error: Docker: permission denied while trying to connect to the Docker daemon
Cause: Docker daemon not running or user not in docker group.
Solution:
# Start Docker daemon
sudo systemctl start docker
# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker
# Verify
docker ps
Data Preparation Errors
Error: FileNotFoundError: [Errno 2] No such file or directory: 'HDFS_1.log'
Cause: Log file not in the configured location.
Solution:
# Check if file exists
ls -la /path/to/HDFS_1.log
# Check config file for correct path
grep "raw_log_file:" dockerconfig/text_autoencoder_prepare_data.yml
# Verify {ROOT_DATA_DIR} is being replaced correctly
python -c "from src.config_loader import Config; \
c = Config.load_from_file('dockerconfig/text_autoencoder_prepare_data.yml', \
root_data_dir='/path/to/data'); \
print(c.get('raw_log_file'))"
Error: ValueError: Expected 1D array, got 2D array instead
Cause: Log file format issue, missing messages, or malformed timestamps.
Solution:
# Check log file format (should be: YYMMDD HHMMSS MESSAGE)
head -5 /path/to/HDFS_1.log
# Count lines
wc -l /path/to/HDFS_1.log
# Check for empty lines
grep -c "^$" /path/to/HDFS_1.log
# Remove empty lines if any
sed -i '/^$/d' /path/to/HDFS_1.log
# Validate format with Python
python -c "
with open('HDFS_1.log') as f:
for i, line in enumerate(f):
if not line.strip():
print(f'Empty line at {i}')
parts = line.split(' ', 2)
if len(parts) < 3:
print(f'Bad format at line {i}: {line[:50]}')
if i > 100: break
"
Error: MemoryError during data preparation
Cause: Dataset too large to fit in memory.
Solution:
# Use a smaller dataset for testing
head -100000 /path/to/HDFS_1.log > /tmp/small.log
# Or increase system swap
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Check available memory
free -h
Training Issues
Error: RuntimeError: CUDA out of memory
Cause: GPU memory insufficient for batch size.
Solution:
# Reduce batch size in config
# Change batch_size from 32 to 16 or 8
# Or use CPU instead
DEEPSENTRY_GPU=0 python src/tx/train.py ...
# Or reduce model size
# Change embedding_size from 128 to 64
# Change lstm_hidden_size from 128 to 64
Error: NaN loss during training
Cause: Learning rate too high, bad data, or numerical instability.
Solution:
# Reduce learning rate in config
# Change learning_rate from 0.001 to 0.0001
# Check for NaN or Inf in data
python -c "
import numpy as np
import pickle
with open('prepared_data/train.txt') as f:
for line in f:
if 'nan' in line.lower() or 'inf' in line.lower():
print(f'Found: {line}')
"
# Check data statistics
python -c "
import numpy as np
data = np.random.randn(1000)
print(f'Mean: {np.mean(data)}, Std: {np.std(data)}, Min: {np.min(data)}, Max: {np.max(data)}')
"
Error: Model validation loss not improving
Cause: Model is not learning. Could be learning rate, model size, or data quality.
Solution:
# Try different learning rates
# 0.0001 (very small) → 0.001 → 0.01 (very large)
# Increase model capacity
# embedding_size: 64 → 128 → 256
# encoder_hidden_size: 128 → 256
# Check if data is too small
# Need at least 1000s of samples for good training
# Plot training curves to diagnose
python -c "
import json
with open('text_autoencoder_model/training_log.txt') as f:
for line in f:
print(line)
"
Encoding Issues
Error: Index out of range in encoder
Cause: New words in test data that weren't in training vocabulary.
Solution:
# Ensure test data uses same log format and source
# Or update vocab with unknown word handling
# (This requires code changes; consider retraining on combined data)
# Check vocabulary size
python -c "
import pickle
with open('prepared_data/vocab.pkl', 'rb') as f:
vocab = pickle.load(f)
print(f'Vocabulary size: {len(vocab)}')
print(f'Sample words: {list(vocab.items())[:10]}')
"
Anomaly Detection Issues
Error: Sequence length mismatch
Cause: Training used different sequence_length than evaluation.
Solution:
# Check training config
grep "sequence_length:" dockerconfig/anomaly_detector_train.yml
# Check evaluation config
grep "sequence_length:" dockerconfig/anomaly_detector_eval.yml
# Must match. If not, retrain with correct length
Error: No anomalies detected (all scores below threshold)
Cause: Model thinks everything is normal, or threshold is too high.
Debugging:
# Check anomaly score distribution
python -c "
import pickle
import numpy as np
with open('eval_results/anomaly_scores.pkl', 'rb') as f:
scores = pickle.load(f)
print(f'Min: {np.min(scores)}')
print(f'Max: {np.max(scores)}')
print(f'Mean: {np.mean(scores)}')
print(f'Std: {np.std(scores)}')
print(f'Median: {np.median(scores)}')
"
# Try lower threshold
# threshold: 2.5 → 1.5 or 1.0
Error: Too many false positives (alerts for normal logs)
Cause: Threshold too low, model overfitting, or training data contaminated.
Solution:
# Increase threshold
# threshold: 1.0 → 2.5 or 3.0
# Check training data for anomalies
python -c "
# Manually inspect training logs
with open('prepared_data/train.txt') as f:
for i, line in enumerate(f):
if 'error' in line.lower() or 'fail' in line.lower():
print(f'Line {i}: {line[:100]}')
if i > 1000: break
"
# Retrain with cleaner data if needed
Live Monitoring Issues
Error: Process hangs, stops reading logs
Cause: File I/O stalled, network issue, or infinite loop.
Solution:
# Kill and restart
pkill -f "python src/live/main.py"
# Run with timeout (Linux)
timeout 3600 bash dockerrun/run_live_monitoring.sh
# Check if log file is being updated
tail -f /var/log/app.log | head -20
# Check for permission issues
ls -la /var/log/app.log
Error: No alerts being produced
Cause: Models not loaded, wrong config, or all scores below threshold.
Solution:
# Check with verbose mode
python src/live/main.py \
--config dockerconfig/live_monitoring_config.yml \
--verbose
# Verify model paths exist
ls -la text_autoencoder_model/encoder.h5
ls -la anomaly_trained_model/detector.h5
# Check if logs are being read
head -1 /var/log/app.log
# Test model manually
python -c "
from src.an.eval import BidirectionalLSTMAnomalyDetectorHandle
import numpy as np
detector = BidirectionalLSTMAnomalyDetectorHandle(
'anomaly_trained_model/detector.h5'
)
seq = np.random.randn(10, 128)
score = detector.get_anomaly_score(seq)
print(f'Test score: {score}')
"
Performance Issues
Problem: Training is very slow
Diagnosis:
- Are you using GPU? Run
nvidia-smiduring training - What's the batch size? Larger batches = faster
- How large is your dataset? Very large = slow
- What's your model size? Larger models = slower
Solutions:
- Use GPU acceleration (Chapter 2)
- Increase batch_size in config
- Start with smaller dataset (first 100K lines)
- Reduce model sizes (embedding_size, hidden_size)
- Reduce epochs
Problem: High memory usage
Check:
# Monitor during training
watch -n 1 'ps aux | grep python'
watch -n 1 'free -h'
# Profile with Python
import tracemalloc
tracemalloc.start()
# ... run code ...
current, peak = tracemalloc.get_traced_memory()
print(f'Current: {current / 1024 / 1024:.1f}MB, Peak: {peak / 1024 / 1024:.1f}MB')
Solutions:
- Reduce batch size
- Reduce model size
- Use streaming/generator approach
- Process data in smaller chunks
Production Debugging
Docker Debugging
# Check container logs
docker logs deepsentry-live
# Attach to running container
docker exec -it deepsentry-live bash
# Run with verbose
docker run -it deepsentry:latest \
python src/live/main.py --verbose
# Check resource usage
docker stats deepsentry-live
Kubernetes Debugging
# Check pod status
kubectl describe pod deepsentry-live-xxx
# Get logs
kubectl logs deepsentry-live-xxx
# Exec into pod
kubectl exec -it deepsentry-live-xxx -- bash
# Watch status
kubectl logs -f deepsentry-live-xxx
Getting Help
When you're stuck:
- Check logs: Most errors are logged with helpful messages
- Enable verbose mode: Adds detailed debugging info
- Read the error message carefully: It usually tells you exactly what's wrong
- Search for similar issues: Check GitHub issues
- Isolate the problem: Try smaller dataset, simpler model, remove customizations
- Report with details: If reporting a bug, include logs, config, data samples