Reference: Commands, APIs, and Module Documentation
Command-Line Interface (CLI)
Data Preparation
python src/tx/prepare.py --config dockerconfig/text_autoencoder_prepare_data.yml
Arguments:
--config YAML configuration file
--data-dir Override {ROOT_DATA_DIR} in config
--verbose Print detailed logs
Output files:
prepared_data/
├── train.txt # Training messages
├── val.txt # Validation messages
└── vocab.pkl # Word ID mapping
Text Autoencoder Training
python src/tx/train.py --config dockerconfig/text_autoencoder_train.yml
Arguments:
--config YAML configuration file
--data-dir Override {ROOT_DATA_DIR}
--epochs Override epochs in config
--batch-size Override batch size
--gpu Force GPU usage
--verbose Debug output
Output files:
text_autoencoder_model/
├── encoder.h5 # Trained encoder (for encoding)
├── decoder.h5 # Decoder (for inspection)
├── training_log.txt
└── metadata.json # Parameters used
Dataset Encoding
python src/tx/encode.py --config dockerconfig/text_autoencoder_dataset_encoder.yml
Arguments:
--config YAML configuration file
--data-dir Override {ROOT_DATA_DIR}
--batch-size Encoding batch size
--verbose Debug output
Output files:
encoded_dataset/
├── train_encoded.pkl # Training vectors (numpy array)
├── val_encoded.pkl # Validation vectors
└── test_encoded.pkl # Test vectors
Anomaly Detector Training
python src/an/train.py --config dockerconfig/anomaly_detector_train.yml
Arguments:
--config YAML configuration file
--data-dir Override {ROOT_DATA_DIR}
--epochs Override epochs
--batch-size Override batch size
--gpu Use GPU
--verbose Debug output
Output files:
anomaly_trained_model/
├── detector.h5 # Trained LSTM anomaly detector
├── training_log.txt
└── metadata.json
Anomaly Evaluation
python src/an/eval.py --config dockerconfig/anomaly_detector_eval.yml
# Or with ground truth labels:
python src/an/analysis.py --config dockerconfig/anomaly_detector_eval_labeled.yml
Arguments:
--config YAML configuration file
--data-dir Override {ROOT_DATA_DIR}
--threshold Anomaly score threshold
--output-dir Save results here
--verbose Debug output
Output files (unlabeled):
eval_results/
├── anomaly_scores.pkl # All scores for test data
├── threshold.json # Recommended threshold
└── statistics.json # Mean, std, min, max scores
Output files (labeled):
eval_results/
├── anomaly_scores.pkl
├── roc_curve.png # ROC curve plot
├── precision_recall.png # Precision-recall curve
├── metrics.json # AUC, accuracy, F1, etc.
└── confusion_matrix.png
Live Monitoring
python src/live/main.py --config dockerconfig/live_monitoring_config.yml
Arguments:
--config YAML configuration file
--data-dir Override {ROOT_DATA_DIR}
--log-file Log file to monitor
--output-file Where to write alerts
--threshold Override anomaly threshold
--verbose Debug output
--daemonize Run in background
Python APIs
Configuration Loader
from src.config_loader import Config
# Load from file
config = Config.load_from_file(
"dockerconfig/train.yml",
root_data_dir="/path/to/data"
)
# Get values with defaults
epochs = config.get("epochs", default=10)
embedding_size = config.get("embedding_size", default=64)
# Get file paths (resolves placeholders)
output_dir = config.get_path("output_model_dir")
# Get nested config
training_config = config.get_section("training")
Text Encoding
from src.tx.train import TextEncoderHandle
from src.tx.encode import DatasetEncoder
# Load trained text encoder
encoder = TextEncoderHandle(model_path="/path/to/encoder.h5")
# Encode a message
vector = encoder.encode("BlockReport processing took 120ms")
# vector is np.array of shape (128,)
# Batch encoding
messages = ["Message 1", "Message 2", ...]
vectors = encoder.encode_batch(messages)
# vectors is np.array of shape (len(messages), 128)
# Dataset encoding
dataset_encoder = DatasetEncoder(
encoder=encoder,
vocab_path="/path/to/vocab.pkl"
)
vectors_list = dataset_encoder.encode_file("/path/to/logs.txt")
Anomaly Detection
from src.an.eval import BidirectionalLSTMAnomalyDetectorHandle
# Load trained detector
detector = BidirectionalLSTMAnomalyDetectorHandle(
model_path="/path/to/detector.h5"
)
# Score a sequence of vectors
sequence = np.random.randn(10, 128) # 10 vectors of 128 dims
score = detector.get_anomaly_score(sequence)
# score is a float (reconstruction error)
# Batch scoring
sequences = [seq1, seq2, ...]
scores = detector.batch_score(sequences)
Log Parsing
from src.raw_log_file_process import UtahLogDatasetParseTools
# Parse HDFS logs in Utah format
parser = UtahLogDatasetParseTools()
# Extract messages from raw logs
with open("HDFS_1.log") as f:
for timestamp, message in parser.parse_log_file(f):
print(f"{timestamp}: {message}")
# Load labels if available
labels = parser.load_labels("labels.txt")
# labels is dict: {block_id: (0 or 1)}
Live Anomaly Detection
from src.live.main import DataStream, AnomalyIdentifier
# Create log stream
stream = DataStream(
log_file="/var/log/app.log",
tail_mode=True,
batch_interval=1.0
)
# Create anomaly detector
detector = AnomalyIdentifier(
text_encoder="/path/to/encoder.h5",
anomaly_model="/path/to/detector.h5",
sequence_length=10,
threshold_multiplier=2.5,
window_size=100
)
# Process logs
for log_entry in stream:
is_anomaly, score = detector.check(log_entry)
if is_anomaly:
print(f"ANOMALY: {log_entry} (score: {score})")
File Formats
Raw Log Format (Utah)
YYMMDD HHMMSS MESSAGE TEXT
220115 14:23:45 Got block report from DataNode 192.168.1.100 with 50000 blocks
220115 14:23:46 BlockReport processing took 120ms for 50000 blocks
220115 14:23:47 Rescan of DataNode 192.168.1.100 finished
Labels File Format
block_id: anomaly_label
blk_123456789: 1 # Anomalous
blk_987654321: 0 # Normal
blk_111111111: 1 # Anomalous
Pickled Data Format
Encoded vectors and cached data use Python pickle format:
import pickle
import numpy as np
# Load encoded vectors
with open("encoded_dataset/train_encoded.pkl", "rb") as f:
vectors = pickle.load(f)
# vectors is numpy array of shape (num_samples, 128)
# Load vocabulary
with open("prepared_data/vocab.pkl", "rb") as f:
vocab = pickle.load(f)
# vocab is dict: {word: id}
Model Format (HDF5)
Neural network models are saved in Keras HDF5 format:
import tensorflow as tf
# Load text encoder
encoder = tf.keras.models.load_model("text_autoencoder_model/encoder.h5")
# Load anomaly detector
detector = tf.keras.models.load_model("anomaly_trained_model/detector.h5")
# Make predictions
output = encoder.predict(input_data)
Project Structure Reference
deepsentry/
├── src/ # Main source code
│ ├── tx/ # Text encoding pipeline
│ │ ├── prepare.py # Data preparation
│ │ ├── train.py # Train text autoencoder
│ │ └── encode.py # Encode full dataset
│ ├── an/ # Anomaly detection
│ │ ├── train.py # Train anomaly detector
│ │ ├── eval.py # Evaluate (unlabeled)
│ │ └── analysis.py # Evaluate (labeled)
│ ├── live/ # Live monitoring
│ │ └── main.py # Real-time detection
│ ├── config_loader.py # Configuration management
│ ├── raw_log_file_process.py # Log parsing
│ ├── time_tools.py # Time normalization
│ └── dataset_labels.py # Label handling
├── ta/ # Text autoencoder package
├── kad/ # Keras anomaly detector package
├── t/ # Tests
│ └── test_smoke.py
├── dockerconfig/ # YAML configuration files
├── dockerrun/ # Docker run scripts
├── docs/ # This documentation
├── Dockerfile # Docker image specification
├── requirements.txt # Python dependencies
└── README.md # Project overview
Dependencies
See requirements.txt:
tensorflow==2.4.1
numpy==1.21.0
scipy==1.7.0
pyyaml==5.4.1
scikit-learn==0.24.2
matplotlib==3.4.2
Environment Variables
Configuration via environment variables (overrides config files):
| Variable | Purpose |
|---|---|
| DEEPSENTRY_DATA_DIR | Root data directory (overrides {ROOT_DATA_DIR}) |
| DEEPSENTRY_VERBOSE | Set to 1 for debug output |
| DEEPSENTRY_GPU | Set to 0 to force CPU (default: auto-detect) |
| DEEPSENTRY_BATCH_SIZE | Override batch size for all modules |
Exit Codes
Python modules return standard exit codes:
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error (see logs) |
| 2 | Invalid arguments or config |
| 3 | Missing input file or data |