DeepSentry Book, Production Best Practices

Production Best Practices

Running DeepSentry reliably, securely, and maintainably in production environments.

Deployment Architecture

Recommended Architecture

┌────────────────────────────────────────────────────────────┐ │ PRODUCTION DEPLOYMENT ARCHITECTURE │ ├────────────────────────────────────────────────────────────┤ │ │ │ LOG SOURCES (Various) │ │ ├─ Servers │ │ ├─ Applications │ │ └─ Devices │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────┐ │ │ │ LOG AGGREGATION LAYER │ │ │ │ (Fluentd, Logstash, Filebeat) │ │ │ │ • Collection from all sources │ │ │ │ • Parsing and enrichment │ │ │ │ • Routing and filtering │ │ │ └──────────────────────────────────────┘ │ │ │ │ │ ┌─────┴──────────────┬──────────────┐ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌────────────┐ ┌──────────────────┐ ┌──────────────┐ │ │ │ STORAGE │ │ DEEPSENTRY │ │ STREAMING │ │ │ │ LAYER │ │ (Live Monitor) │ │ PROCESSOR │ │ │ │ │ │ │ │ │ │ │ │ HDFS/S3 │ │ • Load models │ │ Real-time │ │ │ │ Long-term │ │ • Score logs │ │ processing │ │ │ │ storage │ │ • Manage state │ │ │ │ │ └────────────┘ │ • Threshold mgmt │ └──────────────┘ │ │ │ • Alert │ │ │ │ generation │ │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ ALERTING SYSTEMS │ │ │ ├──────────────────────┤ │ │ │ • Slack webhooks │ │ │ │ • PagerDuty │ │ │ │ • Email notifications│ │ │ │ • Syslog │ │ │ │ • Custom integrations│ │ │ └──────────────────────┘ │ │ │ │ MONITORING & OBSERVABILITY │ │ ├─ Prometheus metrics (throughput, latency, errors) │ │ ├─ Dashboards (Grafana, CloudWatch) │ │ ├─ Log aggregation (ELK, Splunk) │ │ └─ Incident management (on-call rotations) │ │ │ └────────────────────────────────────────────────────────────┘

Security Considerations

Sensitive Information in Logs

Logs often contain sensitive data (API keys, passwords, PII). Before training:


# Sanitize logs before storage
import re

def sanitize_log(message):
    # Remove API keys
    message = re.sub(r'api_key=[^\s]+', 'api_key=REDACTED', message)
    
    # Remove passwords
    message = re.sub(r'password=[^\s]+', 'password=REDACTED', message)
    
    # Remove email addresses (partial)
    message = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+', 'EMAIL_REDACTED', message)
    
    # Remove credit card numbers
    message = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', 'CC_REDACTED', message)
    
    return message

# Apply before preparation
with open("raw.log") as f_in, open("sanitized.log") as f_out:
    for line in f_in:
        timestamp, message = line.split(' ', 2)
        message = sanitize_log(message)
        f_out.write(f"{timestamp} {message}\n")

Access Control


# Secure file permissions
chmod 700 /data/deepsentry                   # Owner only
chmod 600 /data/deepsentry/models/*.h5       # Restrict model access
chmod 600 /data/deepsentry/logs/*.log        # Restrict logs

# Secure environment variables (don't hardcode credentials)
export DEEPSENTRY_ALERT_WEBHOOK_TOKEN="xxxxx"
# Or use secrets management: HashiCorp Vault, AWS Secrets, etc.

# Audit logging
journalctl -u deepsentry-live | grep ALERT   # Review all alerts

Reliability and High Availability

Redundant Deployment

Run multiple instances for fault tolerance:


# Kubernetes with multiple replicas
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepsentry-live
spec:
  replicas: 3                    # Run 3 instances
  strategy:
    type: RollingUpdate          # Gradual updates
    rollingUpdate:
      maxUnavailable: 1          # Keep 2+ running during updates
  template:
    # ... pod spec ...

Health Checks


# Docker health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD python -c "
    import requests
    response = requests.get('http://localhost:8080/health')
    exit(0 if response.status_code == 200 else 1)
  "

# Kubernetes liveness probe
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30

# Readiness probe (ready to accept traffic)
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

Monitoring DeepSentry Itself

Key Metrics to Track

Metric	What It Means	Alert If
Throughput (msgs/sec)	How fast logs are being processed	Drops 50% from baseline
Latency (ms per message)	How long scoring takes	Exceeds 100ms
Queue depth	Logs waiting to be processed	Continuously growing
Anomaly rate (%)	Percentage of logs flagged	Spikes >10% or drops to 0%
Model load time (ms)	Time to load models from disk	Exceeds 5 seconds
Memory usage (MB)	RAM consumed by process	Exceeds 80% available

Export Metrics


# Prometheus metrics in live monitor

from prometheus_client import start_http_server, Counter, Histogram, Gauge

# Define metrics
messages_processed = Counter(
    'deepsentry_messages_processed_total',
    'Total messages processed'
)

anomalies_detected = Counter(
    'deepsentry_anomalies_total',
    'Total anomalies detected'
)

processing_time = Histogram(
    'deepsentry_processing_seconds',
    'Time to process each message'
)

anomaly_score_gauge = Gauge(
    'deepsentry_last_anomaly_score',
    'Most recent anomaly score'
)

# Start metrics server
start_http_server(8000)

# Use in main loop
for log_entry in stream:
    start = time.time()
    
    score = detector.score(log_entry)
    anomaly_score_gauge.set(score)
    
    elapsed = time.time() - start
    processing_time.observe(elapsed)
    messages_processed.inc()
    
    if score > threshold:
        anomalies_detected.inc()

Model Management and Versioning

Versioning Strategy


# Save models with version and metadata
models/
├── text_encoder_v1.0.h5
├── text_encoder_v1.0_metadata.json
├── anomaly_detector_v1.0.h5
├── anomaly_detector_v1.0_metadata.json
├── text_encoder_v1.1.h5
├── text_encoder_v1.1_metadata.json
└── ...

# Metadata includes
{
  "version": "1.0",
  "training_date": "2022-01-15",
  "training_data_days": 14,
  "training_samples": 1000000,
  "model_hash": "sha256:abc123",
  "architecture": {
    "embedding_size": 128,
    "encoder_output_size": 128,
    "lstm_hidden_size": 64
  },
  "validation_auc": 0.92,
  "notes": "Trained on HDFS logs from Jan 1-14"
}

A/B Testing New Models


# Run new model in parallel, shadow mode (don't alert)
class ABTester:
    def __init__(self, model_current, model_candidate):
        self.current = model_current
        self.candidate = model_candidate
        self.results = {
            'current': [],
            'candidate': []
        }
    
    def score(self, sequence):
        # Score with both models
        current_score = self.current.score(sequence)
        candidate_score = self.candidate.score(sequence)
        
        # Log results for analysis
        self.results['current'].append(current_score)
        self.results['candidate'].append(candidate_score)
        
        # Return current model's score (for alerts)
        # Collect candidate metrics for analysis
        return current_score, candidate_score
    
    def compare_metrics(self):
        """Compare performance after collecting data"""
        import numpy as np
        return {
            'mean_diff': np.mean(self.results['candidate']) - np.mean(self.results['current']),
            'std_current': np.std(self.results['current']),
            'std_candidate': np.std(self.results['candidate']),
            'correlation': np.corrcoef(self.results['current'], self.results['candidate'])[0, 1]
        }

Retraining Strategy

When to Retrain

Retrain models when:

Performance degrades: AUC drops >10% on recent data
Data distribution changes: Mean/std of features shift significantly
Anomaly rate spikes: Normal baselines have changed
System changes: New hardware, software updates, configuration changes
Regular schedule: E.g., monthly retraining with recent data

Retraining Pipeline


#!/bin/bash
# Monthly retraining job

set -e  # Exit on error

DATA_DIR="/data/deepsentry"
ARCHIVE_DIR="/archive/models"
DATE=$(date +%Y%m%d)

echo "Starting retraining at $(date)"

# Collect recent data
echo "Collecting training data..."
find /var/log -name "*.log" -mtime -30 | \
  xargs cat | \
  sort > $DATA_DIR/recent_logs.log

# Validate data
echo "Validating logs..."
line_count=$(wc -l < $DATA_DIR/recent_logs.log)
if [ $line_count -lt 100000 ]; then
    echo "ERROR: Not enough logs ($line_count < 100000)"
    exit 1
fi

# Run full pipeline
echo "Preparing data..."
python src/tx/prepare.py --data-dir $DATA_DIR

echo "Training text encoder..."
python src/tx/train.py --data-dir $DATA_DIR

echo "Encoding dataset..."
python src/tx/encode.py --data-dir $DATA_DIR

echo "Training anomaly detector..."
python src/an/train.py --data-dir $DATA_DIR

# Evaluate on holdout test set
echo "Evaluating..."
python src/an/analysis.py --data-dir $DATA_DIR

# Check performance
AUC=$(grep "AUC:" eval_results/metrics.json | cut -d: -f2)
if (( $(echo "$AUC < 0.85" | bc -l) )); then
    echo "ERROR: AUC too low ($AUC < 0.85)"
    exit 1
fi

# Backup old models
echo "Backing up old models..."
mkdir -p $ARCHIVE_DIR/$DATE
cp -r $DATA_DIR/*_model $ARCHIVE_DIR/$DATE/

# Deploy new models
echo "Deploying new models..."
systemctl restart deepsentry-live

echo "Retraining complete at $(date)"

Incident Response

When DeepSentry Detects Anomalies

Alert: Send notification to on-call team
Investigate: Look at logs around the anomaly
Understand: What system state caused it?
Respond: Take action or declare false positive
Document: Add to runbook for future reference

Runbook Template


# Runbook: DeepSentry Anomaly Alert

## Alert: High Anomaly Score in HDFS Cluster

### Impact
- HDFS operations may be degraded
- Data replication might be slow
- File I/O operations could timeout

### Immediate Steps (first 5 minutes)
1. Check cluster health dashboard
2. Run: `hdfs fsck /`
3. Check DataNode logs for ERROR messages
4. Verify disk space: `df -h`

### Diagnosis (next 15 minutes)
1. Look at logs 5 minutes before anomaly timestamp
2. Search for: ERROR, Exception, FAILED, timeout
3. Cross-reference with system metrics (CPU, memory, network)

### Mitigation
- If disk full: clear temporary files
- If nodes offline: restart services
- If network issue: check connectivity
- If unknown: escalate to platform team

### Prevention
- Add trend monitoring for disk usage
- Set up node health checks
- Configure graceful degradation

Cost Optimization

Compute Optimization

CPU: Use smaller models or batch processing for non-real-time analysis
GPU: Use GPU only during training, CPU for inference
Memory: Stream data instead of loading all at once
Storage: Archive old logs and models, keep recent ones online

Cloud Cost Optimization


# AWS cost optimization example

# Train on EC2 Spot instances (70% cheaper)
aws ec2 run-instances \
  --image-id ami-xxxxx \
  --instance-type p3.2xlarge \
  --instance-market-options '{"MarketType":"spot"}'

# Store models in S3 (cheap long-term storage)
aws s3 cp text_encoder.h5 s3://deepsentry-models/2022-01/

# Use CloudWatch for monitoring (built-in, cheap)
# Use Lambda for infrequent retraining (pay per execution)

# Estimate monthly cost
# Inference: 0.1 CPU core + 2GB RAM = ~$20/month
# Training (1x/month): 1 GPU + 8 CPU = ~$200/month
# Storage: 1GB models + 100GB logs = ~$5/month
# Total: ~$225/month for typical deployment

Checklist: Production Readiness

Before Going to Production:

☐ Security: Logs sanitized, access controlled
☐ Monitoring: Metrics exported, dashboards created
☐ Reliability: Redundant deployment, health checks
☐ Performance: Latency acceptable, throughput sufficient
☐ Testing: Evaluated on representative data, AUC >0.85
☐ Alerting: Integrations tested, on-call aware
☐ Runbooks: Incident response procedures documented
☐ Backups: Models and configs versioned and backed up
☐ Documentation: Team trained, procedures recorded
☐ Legal: Data privacy and compliance reviewed

Continuous Improvement

After deployment, continuously improve:

Analyze false positives: Why did the model flag this? Can we improve?
Track metrics: Detection rate, false positive rate, latency
Retrain regularly: Monthly or quarterly with latest data
Gather feedback: What did operators think of alerts?
Update baselines: As systems evolve, retrain
Expand scope: Start with one system, add others