DeepSentry Book, Examples

Examples and Real-World Use Cases

Working examples and how different organizations use DeepSentry.

Example 1: HDFS Cluster Monitoring (Reference Implementation)

Scenario

A large data center runs a Hadoop HDFS cluster with 100+ nodes. They want to detect failures and anomalies early.

Implementation


# Step 1: Collect 2 weeks of clean HDFS logs
ssh hadoop-master "tar -czf hdfs_logs.tar.gz /var/log/hdfs/"

# Step 2: Extract to data directory
tar -xzf hdfs_logs.tar.gz -C /data/hdfs_logs/

# Step 3: Prepare data
python src/tx/prepare.py \
  --config dockerconfig/text_autoencoder_prepare_data.yml \
  --data-dir /data/hdfs_logs

# Step 4: Train models
bash dockerrun/run_text_autoencoder_train.sh
bash dockerrun/run_text_autoencoder_encode_dataset.sh
bash dockerrun/run_anomaly_detector_train.sh

# Step 5: Deploy live monitoring
nohup bash dockerrun/run_live_monitoring.sh > /var/log/deepsentry.log 2>&1 &

Results

Baseline: 0.2 anomaly score for normal operations
Detected failures: 3.5+ anomaly scores, 95% accuracy
False positive rate: 2-3% under normal conditions
Mean detection latency: 30 seconds after anomaly starts

Example 2: Microservices API Monitoring

Scenario

A SaaS platform has 20+ microservices writing to centralized logs. They want to detect service degradation and unusual request patterns.

Custom Setup


# Logs are in JSON format from API gateway
# {"timestamp":"2022-01-15T14:23:45Z","service":"user-api","level":"INFO","message":"Request received"}

# Custom log parser
# src/microservices_parser.py

import json
from datetime import datetime

class MicroservicesLogParser:
    def parse_message(self, line):
        data = json.loads(line)
        timestamp = datetime.fromisoformat(data['timestamp']).strftime("%y%m%d %H%M%S")
        
        # Combine service and message for richer context
        message = f"{data['service']}: {data['message']}"
        return timestamp, message

# Training configuration (tuned for API logs)
# dockerconfig/text_autoencoder_train.yml
parameters:
  embedding_size: 32              # APIs logs are consistent
  encoder_output_size: 64
  epochs: 5                       # Less training data needed
  batch_size: 64
  bidirectional: false            # Logs are mostly sequential

# Anomaly detection config (APIs are fast)
parameters:
  sequence_length: 5              # Shorter sequences
  lstm_hidden_size: 32
  epochs: 8
  bidirectional: true             # Need context both ways

Results

Detects service restarts and crashes within 10 seconds
Catches unusual API request patterns (e.g., spike in 404s)
Alerts on database connection failures
Distinguishes between normal traffic spikes and actual anomalies

Example 3: Database Server Monitoring

Scenario

PostgreSQL database server generating detailed query and transaction logs. They need to catch performance degradation and errors.

Log Extraction


# PostgreSQL logs go to syslog
# Extract with proper timestamp format

cat /var/log/postgresql.log | awk '{
  # Convert to Utah format: YYMMDD HHMMSS MESSAGE
  print strftime("%y%m%d %H%M%S", $1) " " $0
}' > /data/postgres.log

# Or directly from postgres logs with log_line_prefix configured
# log_line_prefix = '[%t] %u@%d %p: '

Tuning for Databases


# Databases have very structured logs
# Tune for high accuracy

parameters:
  embedding_size: 64
  encoder_output_size: 128        # Richer representation
  epochs: 15                      # More training for precision
  batch_size: 32
  bidirectional: true             # Transaction context matters
  dropout: 0.3                    # Prevent overfitting

# Anomaly detection
parameters:
  sequence_length: 15             # Longer sequences for transaction context
  lstm_hidden_size: 128           # Larger capacity
  epochs: 20                      # More training
  threshold_multiplier: 3.0       # Be conservative with alerts

Detected Anomalies

Queries taking longer than normal
Unusual connection patterns
Transaction deadlocks
Index corruption or missing indexes
Disk space issues

Example 4: Kubernetes Cluster Monitoring

Scenario

Kubernetes cluster with 50+ nodes and 500+ pods. They want to detect pod crashes, network issues, and resource exhaustion.

Log Aggregation


# Collect logs from kubelet
cat /var/log/kubelet.log | awk '{print strftime("%y%m%d %H%M%S") " " $0}' > k8s.log

# Or from container runtime (Docker)
journalctl -u docker --no-pager | \
  awk '{print strftime("%y%m%d %H%M%S") " " $0}' > container.log

# Combine multiple sources
cat k8s.log container.log | sort -k1,2 > combined.log

K8s-Specific Configuration


# Kubernetes logs are verbose with lots of repetition
# Tune to filter noise

parameters:
  root_dir: {ROOT_DATA_DIR}
  
  # Pre-filter unimportant logs
  min_word_count: 3               # Ignore very rare events
  max_vocab_size: 5000            # K8s has fewer unique patterns than you'd think
  
  # Training
  epochs: 12
  embedding_size: 48              # K8s logs are somewhat uniform
  
  # Anomaly detection focuses on rare patterns
  sequence_length: 8
  lstm_hidden_size: 64
  epochs: 15
  threshold_multiplier: 2.2       # K8s noise is high, be permissive

Detected Issues

Pod CrashLoopBackOff patterns
Node NotReady and MemoryPressure conditions
Network plugin failures
Scheduler backoff and binding failures
Volume mount and storage errors

Example 5: Web Application Log Monitoring

Scenario

Django/Flask web application generating logs from multiple processes. They want to detect:

Internal server errors (500s)
Authentication failures
Database query timeouts
Unusual request volumes

Application Instrumentation


# Flask example with structured logging

import logging
import json
from datetime import datetime

class LogFormatter(logging.Formatter):
    def format(self, record):
        timestamp = datetime.utcnow().strftime("%y%m%d %H%M%S")
        message = record.getMessage()
        return f"{timestamp} {message}"

# Configure app logging
handler = logging.FileHandler('app.log')
handler.setFormatter(LogFormatter())
logger = logging.getLogger()
logger.addHandler(handler)

# Log events with consistent format
logger.info("POST /api/users - 201 Created")
logger.error("Database connection timeout: 30s")
logger.warning("Authentication failed: IP 192.168.1.100")

Baseline Configuration


# Web app logs are simple and consistent

parameters:
  embedding_size: 32              # Applications speak uniformly
  encoder_output_size: 48
  epochs: 8
  batch_size: 64
  
  # Anomaly detection
  sequence_length: 5              # HTTP requests are quick
  lstm_hidden_size: 32
  threshold_multiplier: 2.0       # Web patterns are distinctive

Example 6: Time-Series Sensor Data (Non-Traditional)

Scenario

Industrial monitoring system with sensor logs. While not traditional text logs, the approach works:


# Sensor logs look like:
220115 14:23:45 SENSOR_001: temperature=42.3°C status=OK
220115 14:23:46 SENSOR_002: pressure=101.2kPa status=OK
220115 14:23:47 SENSOR_001: temperature=42.5°C status=WARN

# They can be processed like text
# The text encoder learns that:
# - temperatures in 40-45°C range are normal
# - status=WARN is less common than OK
# - pressure values cluster around 101kPa

# The anomaly detector learns:
# - SENSOR_001 usually sees gradual temperature changes
# - Rapid jumps (e.g., 42.3 → 48.1) are anomalous
# - Status should not jump from OK to ERROR without WARN

Comparison: Configuration Across Use Cases

Use Case	embedding_size	sequence_length	epochs	threshold
HDFS (complex)	128	20	15	2.5
Microservices (fast)	32	5	8	2.0
Database (precise)	64	15	20	3.0
Kubernetes (noisy)	48	8	12	2.2
Web App (simple)	32	5	8	2.0

Pattern: Getting Started with Your System

Characterize your logs: How many messages per day? What's message complexity?
Collect baseline: 2+ weeks of clean operations
Start simple: Use default config, run one pipeline
Evaluate: Check AUC and other metrics
Tune: Adjust parameters based on results
Iterate: Retrain as you learn more