Troubleshooting: Debugging and Problem Solving

Common errors, how to diagnose them, and step-by-step solutions.

General Debugging Checklist

When something fails, work through this checklist first:

  1. Check the error message carefully (often tells you exactly what's wrong)
  2. Verify files exist: ls -la /path/to/file
  3. Check disk space: df -h
  4. Check memory: free -h
  5. Enable verbose mode: Add --verbose to commands
  6. Check logs in output directory
  7. Try with a smaller dataset first
  8. Check configuration file syntax (valid YAML)

Installation Issues

Error: ModuleNotFoundError: No module named 'tensorflow'

Cause: TensorFlow not installed or wrong Python environment.

Solution:


# Check which Python is active
which python
python --version

# If using venv, ensure it's activated
source venv/bin/activate
echo $VIRTUAL_ENV  # Should show venv path

# Reinstall requirements
pip install --upgrade pip
pip install -r requirements.txt

# Verify TensorFlow
python -c "import tensorflow; print(tensorflow.__version__)"
        

Error: ImportError: libcuda.so.11.0: cannot open shared object file

Cause: CUDA not installed or incompatible version.

Solution:


# Check if GPU is available
nvidia-smi  # Should show GPU info

# If GPU not found, use CPU-only TensorFlow
pip uninstall tensorflow -y
pip install tensorflow-cpu

# Or install GPU version with correct CUDA
pip install tensorflow[and-cuda]
        

Error: Docker: permission denied while trying to connect to the Docker daemon

Cause: Docker daemon not running or user not in docker group.

Solution:


# Start Docker daemon
sudo systemctl start docker

# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker

# Verify
docker ps
        

Data Preparation Errors

Error: FileNotFoundError: [Errno 2] No such file or directory: 'HDFS_1.log'

Cause: Log file not in the configured location.

Solution:


# Check if file exists
ls -la /path/to/HDFS_1.log

# Check config file for correct path
grep "raw_log_file:" dockerconfig/text_autoencoder_prepare_data.yml

# Verify {ROOT_DATA_DIR} is being replaced correctly
python -c "from src.config_loader import Config; \
  c = Config.load_from_file('dockerconfig/text_autoencoder_prepare_data.yml', \
  root_data_dir='/path/to/data'); \
  print(c.get('raw_log_file'))"
        

Error: ValueError: Expected 1D array, got 2D array instead

Cause: Log file format issue, missing messages, or malformed timestamps.

Solution:


# Check log file format (should be: YYMMDD HHMMSS MESSAGE)
head -5 /path/to/HDFS_1.log

# Count lines
wc -l /path/to/HDFS_1.log

# Check for empty lines
grep -c "^$" /path/to/HDFS_1.log

# Remove empty lines if any
sed -i '/^$/d' /path/to/HDFS_1.log

# Validate format with Python
python -c "
with open('HDFS_1.log') as f:
    for i, line in enumerate(f):
        if not line.strip():
            print(f'Empty line at {i}')
        parts = line.split(' ', 2)
        if len(parts) < 3:
            print(f'Bad format at line {i}: {line[:50]}')
        if i > 100: break
"
        

Error: MemoryError during data preparation

Cause: Dataset too large to fit in memory.

Solution:


# Use a smaller dataset for testing
head -100000 /path/to/HDFS_1.log > /tmp/small.log

# Or increase system swap
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Check available memory
free -h
        

Training Issues

Error: RuntimeError: CUDA out of memory

Cause: GPU memory insufficient for batch size.

Solution:


# Reduce batch size in config
# Change batch_size from 32 to 16 or 8

# Or use CPU instead
DEEPSENTRY_GPU=0 python src/tx/train.py ...

# Or reduce model size
# Change embedding_size from 128 to 64
# Change lstm_hidden_size from 128 to 64
        

Error: NaN loss during training

Cause: Learning rate too high, bad data, or numerical instability.

Solution:


# Reduce learning rate in config
# Change learning_rate from 0.001 to 0.0001

# Check for NaN or Inf in data
python -c "
import numpy as np
import pickle
with open('prepared_data/train.txt') as f:
    for line in f:
        if 'nan' in line.lower() or 'inf' in line.lower():
            print(f'Found: {line}')
"

# Check data statistics
python -c "
import numpy as np
data = np.random.randn(1000)
print(f'Mean: {np.mean(data)}, Std: {np.std(data)}, Min: {np.min(data)}, Max: {np.max(data)}')
"
        

Error: Model validation loss not improving

Cause: Model is not learning. Could be learning rate, model size, or data quality.

Solution:


# Try different learning rates
# 0.0001 (very small) → 0.001 → 0.01 (very large)

# Increase model capacity
# embedding_size: 64 → 128 → 256
# encoder_hidden_size: 128 → 256

# Check if data is too small
# Need at least 1000s of samples for good training

# Plot training curves to diagnose
python -c "
import json
with open('text_autoencoder_model/training_log.txt') as f:
    for line in f:
        print(line)
"
        

Encoding Issues

Error: Index out of range in encoder

Cause: New words in test data that weren't in training vocabulary.

Solution:


# Ensure test data uses same log format and source

# Or update vocab with unknown word handling
# (This requires code changes; consider retraining on combined data)

# Check vocabulary size
python -c "
import pickle
with open('prepared_data/vocab.pkl', 'rb') as f:
    vocab = pickle.load(f)
    print(f'Vocabulary size: {len(vocab)}')
    print(f'Sample words: {list(vocab.items())[:10]}')
"
        

Anomaly Detection Issues

Error: Sequence length mismatch

Cause: Training used different sequence_length than evaluation.

Solution:


# Check training config
grep "sequence_length:" dockerconfig/anomaly_detector_train.yml

# Check evaluation config
grep "sequence_length:" dockerconfig/anomaly_detector_eval.yml

# Must match. If not, retrain with correct length
        

Error: No anomalies detected (all scores below threshold)

Cause: Model thinks everything is normal, or threshold is too high.

Debugging:


# Check anomaly score distribution
python -c "
import pickle
import numpy as np
with open('eval_results/anomaly_scores.pkl', 'rb') as f:
    scores = pickle.load(f)
    print(f'Min: {np.min(scores)}')
    print(f'Max: {np.max(scores)}')
    print(f'Mean: {np.mean(scores)}')
    print(f'Std: {np.std(scores)}')
    print(f'Median: {np.median(scores)}')
"

# Try lower threshold
# threshold: 2.5 → 1.5 or 1.0
        

Error: Too many false positives (alerts for normal logs)

Cause: Threshold too low, model overfitting, or training data contaminated.

Solution:


# Increase threshold
# threshold: 1.0 → 2.5 or 3.0

# Check training data for anomalies
python -c "
# Manually inspect training logs
with open('prepared_data/train.txt') as f:
    for i, line in enumerate(f):
        if 'error' in line.lower() or 'fail' in line.lower():
            print(f'Line {i}: {line[:100]}')
        if i > 1000: break
"

# Retrain with cleaner data if needed
        

Live Monitoring Issues

Error: Process hangs, stops reading logs

Cause: File I/O stalled, network issue, or infinite loop.

Solution:


# Kill and restart
pkill -f "python src/live/main.py"

# Run with timeout (Linux)
timeout 3600 bash dockerrun/run_live_monitoring.sh

# Check if log file is being updated
tail -f /var/log/app.log | head -20

# Check for permission issues
ls -la /var/log/app.log
        

Error: No alerts being produced

Cause: Models not loaded, wrong config, or all scores below threshold.

Solution:


# Check with verbose mode
python src/live/main.py \
  --config dockerconfig/live_monitoring_config.yml \
  --verbose

# Verify model paths exist
ls -la text_autoencoder_model/encoder.h5
ls -la anomaly_trained_model/detector.h5

# Check if logs are being read
head -1 /var/log/app.log

# Test model manually
python -c "
from src.an.eval import BidirectionalLSTMAnomalyDetectorHandle
import numpy as np
detector = BidirectionalLSTMAnomalyDetectorHandle(
    'anomaly_trained_model/detector.h5'
)
seq = np.random.randn(10, 128)
score = detector.get_anomaly_score(seq)
print(f'Test score: {score}')
"
        

Performance Issues

Problem: Training is very slow

Diagnosis:

  • Are you using GPU? Run nvidia-smi during training
  • What's the batch size? Larger batches = faster
  • How large is your dataset? Very large = slow
  • What's your model size? Larger models = slower

Solutions:

  • Use GPU acceleration (Chapter 2)
  • Increase batch_size in config
  • Start with smaller dataset (first 100K lines)
  • Reduce model sizes (embedding_size, hidden_size)
  • Reduce epochs

Problem: High memory usage

Check:


# Monitor during training
watch -n 1 'ps aux | grep python'
watch -n 1 'free -h'

# Profile with Python
import tracemalloc
tracemalloc.start()
# ... run code ...
current, peak = tracemalloc.get_traced_memory()
print(f'Current: {current / 1024 / 1024:.1f}MB, Peak: {peak / 1024 / 1024:.1f}MB')
        

Solutions:

  • Reduce batch size
  • Reduce model size
  • Use streaming/generator approach
  • Process data in smaller chunks

Production Debugging

Docker Debugging


# Check container logs
docker logs deepsentry-live

# Attach to running container
docker exec -it deepsentry-live bash

# Run with verbose
docker run -it deepsentry:latest \
  python src/live/main.py --verbose

# Check resource usage
docker stats deepsentry-live
        

Kubernetes Debugging


# Check pod status
kubectl describe pod deepsentry-live-xxx

# Get logs
kubectl logs deepsentry-live-xxx

# Exec into pod
kubectl exec -it deepsentry-live-xxx -- bash

# Watch status
kubectl logs -f deepsentry-live-xxx
        

Getting Help

When you're stuck:

  1. Check logs: Most errors are logged with helpful messages
  2. Enable verbose mode: Adds detailed debugging info
  3. Read the error message carefully: It usually tells you exactly what's wrong
  4. Search for similar issues: Check GitHub issues
  5. Isolate the problem: Try smaller dataset, simpler model, remove customizations
  6. Report with details: If reporting a bug, include logs, config, data samples