DeepSentry Book, Reference

Reference: Commands, APIs, and Module Documentation

Quick lookup: command-line interfaces, Python APIs, file formats, and module references.

Command-Line Interface (CLI)

Data Preparation

python src/tx/prepare.py --config dockerconfig/text_autoencoder_prepare_data.yml

Arguments:
  --config       YAML configuration file
  --data-dir     Override {ROOT_DATA_DIR} in config
  --verbose      Print detailed logs

Output files:

prepared_data/
├── train.txt      # Training messages
├── val.txt        # Validation messages
└── vocab.pkl      # Word ID mapping

Text Autoencoder Training

python src/tx/train.py --config dockerconfig/text_autoencoder_train.yml

Arguments:
  --config       YAML configuration file
  --data-dir     Override {ROOT_DATA_DIR}
  --epochs       Override epochs in config
  --batch-size   Override batch size
  --gpu          Force GPU usage
  --verbose      Debug output

Output files:

text_autoencoder_model/
├── encoder.h5      # Trained encoder (for encoding)
├── decoder.h5      # Decoder (for inspection)
├── training_log.txt
└── metadata.json   # Parameters used

Dataset Encoding

python src/tx/encode.py --config dockerconfig/text_autoencoder_dataset_encoder.yml

Arguments:
  --config       YAML configuration file
  --data-dir     Override {ROOT_DATA_DIR}
  --batch-size   Encoding batch size
  --verbose      Debug output

Output files:

encoded_dataset/
├── train_encoded.pkl   # Training vectors (numpy array)
├── val_encoded.pkl     # Validation vectors
└── test_encoded.pkl    # Test vectors

Anomaly Detector Training

python src/an/train.py --config dockerconfig/anomaly_detector_train.yml

Arguments:
  --config       YAML configuration file
  --data-dir     Override {ROOT_DATA_DIR}
  --epochs       Override epochs
  --batch-size   Override batch size
  --gpu          Use GPU
  --verbose      Debug output

Output files:

anomaly_trained_model/
├── detector.h5      # Trained LSTM anomaly detector
├── training_log.txt
└── metadata.json

Anomaly Evaluation

python src/an/eval.py --config dockerconfig/anomaly_detector_eval.yml

# Or with ground truth labels:
python src/an/analysis.py --config dockerconfig/anomaly_detector_eval_labeled.yml

Arguments:
  --config       YAML configuration file
  --data-dir     Override {ROOT_DATA_DIR}
  --threshold    Anomaly score threshold
  --output-dir   Save results here
  --verbose      Debug output

Output files (unlabeled):

eval_results/
├── anomaly_scores.pkl      # All scores for test data
├── threshold.json          # Recommended threshold
└── statistics.json         # Mean, std, min, max scores

Output files (labeled):

eval_results/
├── anomaly_scores.pkl
├── roc_curve.png           # ROC curve plot
├── precision_recall.png    # Precision-recall curve
├── metrics.json            # AUC, accuracy, F1, etc.
└── confusion_matrix.png

Live Monitoring

python src/live/main.py --config dockerconfig/live_monitoring_config.yml

Arguments:
  --config       YAML configuration file
  --data-dir     Override {ROOT_DATA_DIR}
  --log-file     Log file to monitor
  --output-file  Where to write alerts
  --threshold    Override anomaly threshold
  --verbose      Debug output
  --daemonize    Run in background

Python APIs

Configuration Loader


from src.config_loader import Config

# Load from file
config = Config.load_from_file(
    "dockerconfig/train.yml",
    root_data_dir="/path/to/data"
)

# Get values with defaults
epochs = config.get("epochs", default=10)
embedding_size = config.get("embedding_size", default=64)

# Get file paths (resolves placeholders)
output_dir = config.get_path("output_model_dir")

# Get nested config
training_config = config.get_section("training")

Text Encoding


from src.tx.train import TextEncoderHandle
from src.tx.encode import DatasetEncoder

# Load trained text encoder
encoder = TextEncoderHandle(model_path="/path/to/encoder.h5")

# Encode a message
vector = encoder.encode("BlockReport processing took 120ms")
# vector is np.array of shape (128,)

# Batch encoding
messages = ["Message 1", "Message 2", ...]
vectors = encoder.encode_batch(messages)
# vectors is np.array of shape (len(messages), 128)

# Dataset encoding
dataset_encoder = DatasetEncoder(
    encoder=encoder,
    vocab_path="/path/to/vocab.pkl"
)
vectors_list = dataset_encoder.encode_file("/path/to/logs.txt")

Anomaly Detection


from src.an.eval import BidirectionalLSTMAnomalyDetectorHandle

# Load trained detector
detector = BidirectionalLSTMAnomalyDetectorHandle(
    model_path="/path/to/detector.h5"
)

# Score a sequence of vectors
sequence = np.random.randn(10, 128)  # 10 vectors of 128 dims
score = detector.get_anomaly_score(sequence)
# score is a float (reconstruction error)

# Batch scoring
sequences = [seq1, seq2, ...]
scores = detector.batch_score(sequences)

Log Parsing


from src.raw_log_file_process import UtahLogDatasetParseTools

# Parse HDFS logs in Utah format
parser = UtahLogDatasetParseTools()

# Extract messages from raw logs
with open("HDFS_1.log") as f:
    for timestamp, message in parser.parse_log_file(f):
        print(f"{timestamp}: {message}")

# Load labels if available
labels = parser.load_labels("labels.txt")
# labels is dict: {block_id: (0 or 1)}

Live Anomaly Detection


from src.live.main import DataStream, AnomalyIdentifier

# Create log stream
stream = DataStream(
    log_file="/var/log/app.log",
    tail_mode=True,
    batch_interval=1.0
)

# Create anomaly detector
detector = AnomalyIdentifier(
    text_encoder="/path/to/encoder.h5",
    anomaly_model="/path/to/detector.h5",
    sequence_length=10,
    threshold_multiplier=2.5,
    window_size=100
)

# Process logs
for log_entry in stream:
    is_anomaly, score = detector.check(log_entry)
    if is_anomaly:
        print(f"ANOMALY: {log_entry} (score: {score})")

File Formats

Raw Log Format (Utah)


YYMMDD HHMMSS MESSAGE TEXT
220115 14:23:45 Got block report from DataNode 192.168.1.100 with 50000 blocks
220115 14:23:46 BlockReport processing took 120ms for 50000 blocks
220115 14:23:47 Rescan of DataNode 192.168.1.100 finished

Labels File Format


block_id: anomaly_label
blk_123456789: 1   # Anomalous
blk_987654321: 0   # Normal
blk_111111111: 1   # Anomalous

Pickled Data Format

Encoded vectors and cached data use Python pickle format:


import pickle
import numpy as np

# Load encoded vectors
with open("encoded_dataset/train_encoded.pkl", "rb") as f:
    vectors = pickle.load(f)
    # vectors is numpy array of shape (num_samples, 128)

# Load vocabulary
with open("prepared_data/vocab.pkl", "rb") as f:
    vocab = pickle.load(f)
    # vocab is dict: {word: id}

Model Format (HDF5)

Neural network models are saved in Keras HDF5 format:


import tensorflow as tf

# Load text encoder
encoder = tf.keras.models.load_model("text_autoencoder_model/encoder.h5")

# Load anomaly detector
detector = tf.keras.models.load_model("anomaly_trained_model/detector.h5")

# Make predictions
output = encoder.predict(input_data)

Project Structure Reference


deepsentry/
├── src/                    # Main source code
│   ├── tx/                 # Text encoding pipeline
│   │   ├── prepare.py      # Data preparation
│   │   ├── train.py        # Train text autoencoder
│   │   └── encode.py       # Encode full dataset
│   ├── an/                 # Anomaly detection
│   │   ├── train.py        # Train anomaly detector
│   │   ├── eval.py         # Evaluate (unlabeled)
│   │   └── analysis.py     # Evaluate (labeled)
│   ├── live/               # Live monitoring
│   │   └── main.py         # Real-time detection
│   ├── config_loader.py    # Configuration management
│   ├── raw_log_file_process.py  # Log parsing
│   ├── time_tools.py       # Time normalization
│   └── dataset_labels.py   # Label handling
├── ta/                     # Text autoencoder package
├── kad/                    # Keras anomaly detector package
├── t/                      # Tests
│   └── test_smoke.py
├── dockerconfig/           # YAML configuration files
├── dockerrun/              # Docker run scripts
├── docs/                   # This documentation
├── Dockerfile              # Docker image specification
├── requirements.txt        # Python dependencies
└── README.md              # Project overview

Dependencies

See requirements.txt:


tensorflow==2.4.1
numpy==1.21.0
scipy==1.7.0
pyyaml==5.4.1
scikit-learn==0.24.2
matplotlib==3.4.2

Environment Variables

Configuration via environment variables (overrides config files):

Variable	Purpose
DEEPSENTRY_DATA_DIR	Root data directory (overrides {ROOT_DATA_DIR})
DEEPSENTRY_VERBOSE	Set to 1 for debug output
DEEPSENTRY_GPU	Set to 0 to force CPU (default: auto-detect)
DEEPSENTRY_BATCH_SIZE	Override batch size for all modules

Exit Codes

Python modules return standard exit codes:

Code	Meaning
0	Success
1	General error (see logs)
2	Invalid arguments or config
3	Missing input file or data