Live Monitoring: Real-Time Anomaly Detection
What is Live Monitoring?
After training your models, you want to continuously monitor new logs. Live monitoring:
- Reads incoming log entries in real-time
- Encodes each message using the trained text autoencoder
- Scores each sequence using the anomaly detector
- Alerts when anomaly scores exceed a threshold
- Maintains rolling statistics for adaptive thresholding
Running Live Monitoring
Basic Execution
bash dockerrun/run_live_monitoring.sh
The live monitor starts and reads logs according to configuration. Output looks like:
[14:23:45] Normal - score: 0.12 - "Got block report from..."
[14:23:46] Normal - score: 0.09 - "Verification complete"
[14:23:47] ANOMALY - score: 3.45 - "Unexpected error" ⚠️
[14:23:48] Normal - score: 0.14 - "Retrying connection..."
Input Sources
Configure in live_monitoring_config.yml:
# From a file
log_file: /var/log/myapp.log
tail_mode: true # Start from end (watch for new entries)
# From stdin (pipe logs in)
log_file: /dev/stdin
# From network socket (with socat)
log_file: /dev/tcp/localhost/9999
Piping Logs in Real-Time
Forward logs from a source:
# From syslog
tail -f /var/log/syslog | bash dockerrun/run_live_monitoring.sh
# From application stderr
./myapp 2>&1 | bash dockerrun/run_live_monitoring.sh
# From remote server via SSH
ssh user@remote "tail -f /var/log/app.log" | bash dockerrun/run_live_monitoring.sh
Adaptive Thresholding
The live monitor doesn't use a fixed threshold. Instead, it maintains running statistics:
- Window: Last N scores (e.g., 100 most recent)
- Mean: Average of recent scores
- Std Dev: How much scores vary
- Threshold: mean + 2.5 * std_dev
Why this matters: Even if "normal" anomaly scores slowly increase (due to seasonal changes or system evolution), the threshold adapts.
Example
Hour 1: Scores are [0.1, 0.2, 0.15, 0.3, ...]. Mean=0.17, Std=0.08. Threshold = 0.17 + 2.5*0.08 = 0.37
Hour 2: Scores are [0.2, 0.25, 0.22, 0.4, ...]. Mean=0.27, Std=0.09. Threshold = 0.27 + 2.5*0.09 = 0.495
The threshold moved up because baseline scores increased. This prevents "threshold exhaustion" where all scores become anomalies.
Output and Alerting
Console Output
By default, the live monitor outputs to console:
[14:23:45] Normal - score: 0.21 (threshold: 2.45)
[14:23:46] ANOMALY - score: 3.67 (threshold: 2.45) ⚠️
[14:23:47] Normal - score: 0.18 (threshold: 2.45)
File Output
Configure in live_monitoring_config.yml:
output_file: /var/log/deepsentry-alerts.log
Anomalies are written with full context:
TIMESTAMP=2022-01-15T14:23:47Z
SCORE=3.67
THRESHOLD=2.45
MESSAGE="Unexpected error in database connection"
CONTEXT_WINDOW="Got block report... → Verification complete → Unexpected error"
Integration with Monitoring Systems
Direct integration patterns:
# Send alerts to syslog
bash dockerrun/run_live_monitoring.sh | \
grep ANOMALY | \
logger -t deepsentry -p user.alert
# Send to Prometheus (custom exporter)
bash dockerrun/run_live_monitoring.sh | \
./alert_to_prometheus.py
# Send to monitoring webhook
bash dockerrun/run_live_monitoring.sh | \
grep ANOMALY | \
while read line; do
curl -X POST https://alerts.example.com/webhook \
-d "$line"
done
Tuning Live Monitoring
Key Parameters
| Parameter | Default | Impact |
|---|---|---|
| threshold_multiplier | 2.5 | How many std devs above mean triggers alert. Higher = fewer alerts. |
| window_size | 100 | How many recent scores to use for statistics. Larger = more stable threshold. |
| sequence_length | 10 | Must match training. How many vectors in a sequence. |
| batch_interval | 1.0 (seconds) | How often to score new logs. Smaller = more frequent scoring. |
Tuning Strategy
Start with defaults, then adjust based on your alerts:
- Too many false positives: Increase threshold_multiplier (e.g., 3.0 or 3.5)
- Missing real anomalies: Decrease threshold_multiplier (e.g., 2.0 or 1.5)
- Threshold too volatile: Increase window_size (e.g., 200 or 500)
- Slow to adapt to changes: Decrease window_size (e.g., 50)
Production Deployment
Systemd Service
Run as a systemd service for automatic restart:
# /etc/systemd/system/deepsentry-live.service
[Unit]
Description=DeepSentry Live Anomaly Detection
After=network.target
[Service]
Type=simple
User=deepsentry
WorkingDirectory=/opt/deepsentry
ExecStart=bash dockerrun/run_live_monitoring.sh
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl enable deepsentry-live
sudo systemctl start deepsentry-live
sudo systemctl status deepsentry-live
Docker Container
Run live monitoring in Docker:
docker run -d \
--name deepsentry-live \
-v /var/log:/var/log:ro \
-v /data/deepsentry:/data \
--restart unless-stopped \
deepsentry:latest \
python src/live/main.py
Kubernetes Deployment
For cloud deployments:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepsentry-live
spec:
replicas: 1
selector:
matchLabels:
app: deepsentry-live
template:
metadata:
labels:
app: deepsentry-live
spec:
containers:
- name: live
image: deepsentry:latest
volumeMounts:
- name: logs
mountPath: /var/log
readOnly: true
- name: models
mountPath: /data/models
readOnly: true
env:
- name: LOG_FILE
value: /var/log/app.log
- name: ANOMALY_MODEL
value: /data/models/detector.h5
volumes:
- name: logs
hostPath:
path: /var/log
- name: models
configMap:
name: deepsentry-models
Monitoring the Monitor
Keep an eye on the live monitoring process itself:
# Check if process is running
ps aux | grep deepsentry
# Monitor resource usage
watch -n 1 'docker stats deepsentry-live'
# Check for errors in logs
tail -f /var/log/deepsentry-alerts.log | grep ERROR
# Alert stats
tail -f /var/log/deepsentry-alerts.log | grep ANOMALY | wc -l
Common Issues and Solutions
Problem: No alerts even for obvious anomalies
Check:
- Models are loaded correctly (check logs for errors)
- Sequence length matches training config
- Log format is correct (YYMMDD HHMMSS MESSAGE)
- Threshold is too high
Problem: Too many false positive alerts
Solutions:
- Increase threshold_multiplier in config
- Increase window_size for more stable baseline
- Check that training data was clean (no anomalies)
- Verify test logs are similar distribution to training logs
Problem: Memory or CPU usage is high
Optimizations:
- Reduce sequence_length if possible
- Use smaller models (reduce embedding_size)
- Increase batch_interval to score less frequently
- Use GPU acceleration if available
Next Steps
Once live monitoring is running:
- Set up alerting to send anomalies to your on-call team
- Create dashboards showing anomaly detection rate and latency
- Periodically retrain models with new data to stay current
- Investigate flagged anomalies to understand patterns