Production Best Practices
Deployment Architecture
Recommended Architecture
┌────────────────────────────────────────────────────────────┐
│ PRODUCTION DEPLOYMENT ARCHITECTURE │
├────────────────────────────────────────────────────────────┤
│ │
│ LOG SOURCES (Various) │
│ ├─ Servers │
│ ├─ Applications │
│ └─ Devices │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ LOG AGGREGATION LAYER │ │
│ │ (Fluentd, Logstash, Filebeat) │ │
│ │ • Collection from all sources │ │
│ │ • Parsing and enrichment │ │
│ │ • Routing and filtering │ │
│ └──────────────────────────────────────┘ │
│ │ │
│ ┌─────┴──────────────┬──────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌──────────────────┐ ┌──────────────┐ │
│ │ STORAGE │ │ DEEPSENTRY │ │ STREAMING │ │
│ │ LAYER │ │ (Live Monitor) │ │ PROCESSOR │ │
│ │ │ │ │ │ │ │
│ │ HDFS/S3 │ │ • Load models │ │ Real-time │ │
│ │ Long-term │ │ • Score logs │ │ processing │ │
│ │ storage │ │ • Manage state │ │ │ │
│ └────────────┘ │ • Threshold mgmt │ └──────────────┘ │
│ │ • Alert │ │
│ │ generation │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ ALERTING SYSTEMS │ │
│ ├──────────────────────┤ │
│ │ • Slack webhooks │ │
│ │ • PagerDuty │ │
│ │ • Email notifications│ │
│ │ • Syslog │ │
│ │ • Custom integrations│ │
│ └──────────────────────┘ │
│ │
│ MONITORING & OBSERVABILITY │
│ ├─ Prometheus metrics (throughput, latency, errors) │
│ ├─ Dashboards (Grafana, CloudWatch) │
│ ├─ Log aggregation (ELK, Splunk) │
│ └─ Incident management (on-call rotations) │
│ │
└────────────────────────────────────────────────────────────┘
Security Considerations
Sensitive Information in Logs
Logs often contain sensitive data (API keys, passwords, PII). Before training:
# Sanitize logs before storage
import re
def sanitize_log(message):
# Remove API keys
message = re.sub(r'api_key=[^\s]+', 'api_key=REDACTED', message)
# Remove passwords
message = re.sub(r'password=[^\s]+', 'password=REDACTED', message)
# Remove email addresses (partial)
message = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+', 'EMAIL_REDACTED', message)
# Remove credit card numbers
message = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', 'CC_REDACTED', message)
return message
# Apply before preparation
with open("raw.log") as f_in, open("sanitized.log") as f_out:
for line in f_in:
timestamp, message = line.split(' ', 2)
message = sanitize_log(message)
f_out.write(f"{timestamp} {message}\n")
Access Control
# Secure file permissions
chmod 700 /data/deepsentry # Owner only
chmod 600 /data/deepsentry/models/*.h5 # Restrict model access
chmod 600 /data/deepsentry/logs/*.log # Restrict logs
# Secure environment variables (don't hardcode credentials)
export DEEPSENTRY_ALERT_WEBHOOK_TOKEN="xxxxx"
# Or use secrets management: HashiCorp Vault, AWS Secrets, etc.
# Audit logging
journalctl -u deepsentry-live | grep ALERT # Review all alerts
Reliability and High Availability
Redundant Deployment
Run multiple instances for fault tolerance:
# Kubernetes with multiple replicas
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepsentry-live
spec:
replicas: 3 # Run 3 instances
strategy:
type: RollingUpdate # Gradual updates
rollingUpdate:
maxUnavailable: 1 # Keep 2+ running during updates
template:
# ... pod spec ...
Health Checks
# Docker health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "
import requests
response = requests.get('http://localhost:8080/health')
exit(0 if response.status_code == 200 else 1)
"
# Kubernetes liveness probe
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
# Readiness probe (ready to accept traffic)
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
Monitoring DeepSentry Itself
Key Metrics to Track
| Metric | What It Means | Alert If |
|---|---|---|
| Throughput (msgs/sec) | How fast logs are being processed | Drops 50% from baseline |
| Latency (ms per message) | How long scoring takes | Exceeds 100ms |
| Queue depth | Logs waiting to be processed | Continuously growing |
| Anomaly rate (%) | Percentage of logs flagged | Spikes >10% or drops to 0% |
| Model load time (ms) | Time to load models from disk | Exceeds 5 seconds |
| Memory usage (MB) | RAM consumed by process | Exceeds 80% available |
Export Metrics
# Prometheus metrics in live monitor
from prometheus_client import start_http_server, Counter, Histogram, Gauge
# Define metrics
messages_processed = Counter(
'deepsentry_messages_processed_total',
'Total messages processed'
)
anomalies_detected = Counter(
'deepsentry_anomalies_total',
'Total anomalies detected'
)
processing_time = Histogram(
'deepsentry_processing_seconds',
'Time to process each message'
)
anomaly_score_gauge = Gauge(
'deepsentry_last_anomaly_score',
'Most recent anomaly score'
)
# Start metrics server
start_http_server(8000)
# Use in main loop
for log_entry in stream:
start = time.time()
score = detector.score(log_entry)
anomaly_score_gauge.set(score)
elapsed = time.time() - start
processing_time.observe(elapsed)
messages_processed.inc()
if score > threshold:
anomalies_detected.inc()
Model Management and Versioning
Versioning Strategy
# Save models with version and metadata
models/
├── text_encoder_v1.0.h5
├── text_encoder_v1.0_metadata.json
├── anomaly_detector_v1.0.h5
├── anomaly_detector_v1.0_metadata.json
├── text_encoder_v1.1.h5
├── text_encoder_v1.1_metadata.json
└── ...
# Metadata includes
{
"version": "1.0",
"training_date": "2022-01-15",
"training_data_days": 14,
"training_samples": 1000000,
"model_hash": "sha256:abc123",
"architecture": {
"embedding_size": 128,
"encoder_output_size": 128,
"lstm_hidden_size": 64
},
"validation_auc": 0.92,
"notes": "Trained on HDFS logs from Jan 1-14"
}
A/B Testing New Models
# Run new model in parallel, shadow mode (don't alert)
class ABTester:
def __init__(self, model_current, model_candidate):
self.current = model_current
self.candidate = model_candidate
self.results = {
'current': [],
'candidate': []
}
def score(self, sequence):
# Score with both models
current_score = self.current.score(sequence)
candidate_score = self.candidate.score(sequence)
# Log results for analysis
self.results['current'].append(current_score)
self.results['candidate'].append(candidate_score)
# Return current model's score (for alerts)
# Collect candidate metrics for analysis
return current_score, candidate_score
def compare_metrics(self):
"""Compare performance after collecting data"""
import numpy as np
return {
'mean_diff': np.mean(self.results['candidate']) - np.mean(self.results['current']),
'std_current': np.std(self.results['current']),
'std_candidate': np.std(self.results['candidate']),
'correlation': np.corrcoef(self.results['current'], self.results['candidate'])[0, 1]
}
Retraining Strategy
When to Retrain
Retrain models when:
- Performance degrades: AUC drops >10% on recent data
- Data distribution changes: Mean/std of features shift significantly
- Anomaly rate spikes: Normal baselines have changed
- System changes: New hardware, software updates, configuration changes
- Regular schedule: E.g., monthly retraining with recent data
Retraining Pipeline
#!/bin/bash
# Monthly retraining job
set -e # Exit on error
DATA_DIR="/data/deepsentry"
ARCHIVE_DIR="/archive/models"
DATE=$(date +%Y%m%d)
echo "Starting retraining at $(date)"
# Collect recent data
echo "Collecting training data..."
find /var/log -name "*.log" -mtime -30 | \
xargs cat | \
sort > $DATA_DIR/recent_logs.log
# Validate data
echo "Validating logs..."
line_count=$(wc -l < $DATA_DIR/recent_logs.log)
if [ $line_count -lt 100000 ]; then
echo "ERROR: Not enough logs ($line_count < 100000)"
exit 1
fi
# Run full pipeline
echo "Preparing data..."
python src/tx/prepare.py --data-dir $DATA_DIR
echo "Training text encoder..."
python src/tx/train.py --data-dir $DATA_DIR
echo "Encoding dataset..."
python src/tx/encode.py --data-dir $DATA_DIR
echo "Training anomaly detector..."
python src/an/train.py --data-dir $DATA_DIR
# Evaluate on holdout test set
echo "Evaluating..."
python src/an/analysis.py --data-dir $DATA_DIR
# Check performance
AUC=$(grep "AUC:" eval_results/metrics.json | cut -d: -f2)
if (( $(echo "$AUC < 0.85" | bc -l) )); then
echo "ERROR: AUC too low ($AUC < 0.85)"
exit 1
fi
# Backup old models
echo "Backing up old models..."
mkdir -p $ARCHIVE_DIR/$DATE
cp -r $DATA_DIR/*_model $ARCHIVE_DIR/$DATE/
# Deploy new models
echo "Deploying new models..."
systemctl restart deepsentry-live
echo "Retraining complete at $(date)"
Incident Response
When DeepSentry Detects Anomalies
- Alert: Send notification to on-call team
- Investigate: Look at logs around the anomaly
- Understand: What system state caused it?
- Respond: Take action or declare false positive
- Document: Add to runbook for future reference
Runbook Template
# Runbook: DeepSentry Anomaly Alert
## Alert: High Anomaly Score in HDFS Cluster
### Impact
- HDFS operations may be degraded
- Data replication might be slow
- File I/O operations could timeout
### Immediate Steps (first 5 minutes)
1. Check cluster health dashboard
2. Run: `hdfs fsck /`
3. Check DataNode logs for ERROR messages
4. Verify disk space: `df -h`
### Diagnosis (next 15 minutes)
1. Look at logs 5 minutes before anomaly timestamp
2. Search for: ERROR, Exception, FAILED, timeout
3. Cross-reference with system metrics (CPU, memory, network)
### Mitigation
- If disk full: clear temporary files
- If nodes offline: restart services
- If network issue: check connectivity
- If unknown: escalate to platform team
### Prevention
- Add trend monitoring for disk usage
- Set up node health checks
- Configure graceful degradation
Cost Optimization
Compute Optimization
- CPU: Use smaller models or batch processing for non-real-time analysis
- GPU: Use GPU only during training, CPU for inference
- Memory: Stream data instead of loading all at once
- Storage: Archive old logs and models, keep recent ones online
Cloud Cost Optimization
# AWS cost optimization example
# Train on EC2 Spot instances (70% cheaper)
aws ec2 run-instances \
--image-id ami-xxxxx \
--instance-type p3.2xlarge \
--instance-market-options '{"MarketType":"spot"}'
# Store models in S3 (cheap long-term storage)
aws s3 cp text_encoder.h5 s3://deepsentry-models/2022-01/
# Use CloudWatch for monitoring (built-in, cheap)
# Use Lambda for infrequent retraining (pay per execution)
# Estimate monthly cost
# Inference: 0.1 CPU core + 2GB RAM = ~$20/month
# Training (1x/month): 1 GPU + 8 CPU = ~$200/month
# Storage: 1GB models + 100GB logs = ~$5/month
# Total: ~$225/month for typical deployment
Checklist: Production Readiness
Before Going to Production:
- ☐ Security: Logs sanitized, access controlled
- ☐ Monitoring: Metrics exported, dashboards created
- ☐ Reliability: Redundant deployment, health checks
- ☐ Performance: Latency acceptable, throughput sufficient
- ☐ Testing: Evaluated on representative data, AUC >0.85
- ☐ Alerting: Integrations tested, on-call aware
- ☐ Runbooks: Incident response procedures documented
- ☐ Backups: Models and configs versioned and backed up
- ☐ Documentation: Team trained, procedures recorded
- ☐ Legal: Data privacy and compliance reviewed
Continuous Improvement
After deployment, continuously improve:
- Analyze false positives: Why did the model flag this? Can we improve?
- Track metrics: Detection rate, false positive rate, latency
- Retrain regularly: Monthly or quarterly with latest data
- Gather feedback: What did operators think of alerts?
- Update baselines: As systems evolve, retrain
- Expand scope: Start with one system, add others