DeepSentry Book, Data and files

Data and Log Formats

Understanding the data: log formats, file structures, datasets, and how to prepare your own logs.

Log Data Fundamentals

DeepSentry works with structured text logs. A log is a sequence of messages, each typically on a single line, with consistent format. Examples:

2022-01-15 14:23:45 INFO Starting server on port 8080
2022-01-15 14:23:46 INFO Loading configuration from /etc/app.conf
2022-01-15 14:23:47 DEBUG Connecting to database at localhost:5432
2022-01-15 14:23:48 INFO Database connection established
2022-01-15 14:24:00 INFO Request received: GET /api/users/123
2022-01-15 14:24:00 INFO Request processed: 45ms
2022-01-15 14:24:01 INFO Request completed: 200 OK

Each message contains information about what the system did at a specific point in time. By analyzing sequences of these messages, DeepSentry learns the normal patterns of system behavior.

The HDFS Dataset (Built-in Example)

DeepSentry is validated on the HDFS-1 dataset, a collection of real logs from the Hadoop Distributed File System used at Yahoo. This dataset has become a benchmark for anomaly detection research because:

Real production logs: From actual large-scale systems, not synthetic
Labeled anomalies: Contains logs marked with anomaly ground truth
Large scale: Millions of log entries spanning weeks of operation
Diverse failures: Includes network failures, disk errors, timeouts, and other issues

HDFS Log Format

HDFS logs follow the Utah format (a standard in anomaly detection research):

YYMMDD HHMMSS MESSAGE TEXT

Example:

220115 14:23:45 HDFS-1 BlockReport processing took 120ms for 50000 blocks from 192.168.1.100
220115 14:23:46 HDFS-1 Got block report from DataNode 192.168.1.100 with 50000 blocks
220115 14:23:47 HDFS-1 Rescan of DataNode 192.168.1.100 finished
220115 14:23:48 HDFS-1 BlockReport verification complete

Components:

YYMMDD: Date (year-month-day)
HHMMSS: Time (hour:minute:second)
MESSAGE TEXT: The actual log message (arbitrary string)

Ground Truth Labels

The HDFS dataset includes labeled anomalies. Anomalies are linked to HDFS block IDs. A file containing mappings looks like:

blk_123456789: 1  # Block 123456789 contains an anomaly
blk_987654321: 0  # Block 987654321 is normal
blk_111111111: 1  # Block 111111111 contains an anomaly

DeepSentry uses these labels to evaluate model performance: "of all the blocks marked as anomalous, how many did we detect?"

Data Organization

The pipeline expects data organized in specific directories:

Input Directory Structure

data/
├── raw_logs/
│   ├── HDFS_1.log         # Raw log file in Utah format
│   └── labels.txt         # Ground truth (optional, for evaluation)
└── dockerconfig/          # Configuration files (provided by DeepSentry)
    ├── train.yml
    ├── encode.yml
    ├── anomaly_train.yml
    └── ...

Output Directories (Created by Pipeline)

data/
├── prepared_data/
│   ├── train.txt          # Training messages
│   ├── val.txt            # Validation messages
│   └── vocab.pkl          # Learned vocabulary
│
├── text_autoencoder_model/
│   ├── encoder.h5         # Trained text encoder
│   └── metadata.json
│
├── encoded_dataset/
│   ├── train_encoded.pkl  # Encoded training vectors
│   ├── val_encoded.pkl    # Encoded validation vectors
│   └── test_encoded.pkl   # Encoded test vectors
│
├── anomaly_trained_model/
│   ├── detector.h5        # Trained anomaly detector
│   └── metadata.json
│
└── eval_results/
    ├── anomaly_scores.pkl
    ├── roc_curve.png
    ├── precision_recall.png
    └── metrics.json

Data Flow Through the Pipeline

┌──────────────────────────────────────────────────────────────┐ │ DATA FLOW THROUGH PIPELINE │ ├──────────────────────────────────────────────────────────────┤ │ │ │ Raw Logs Text Processing Encoding │ │ ────────── ──────────────── ──────── │ │ │ │ HDFS_1.log │ │ "14:23:45 BlockReport [PREPARE] train.txt │ │ processing took 120ms" │ val.txt │ │ "14:23:46 Got block..." │ vocab.pkl │ │ "14:23:47 Verification" │ │ │ │ "14:23:48 BlockReport" ▼ [TEXT ENCODER] │ │ │ │ │ labels.txt Ground Truth │ │ │ blk_123:1 ▼ │ │ blk_456:0 │ │ blk_789:1 train_encoded.pkl Output Shape │ │ val_encoded.pkl (N, 128) │ │ test_encoded.pkl vectors │ │ │ ├──────────────────────────────────────────────────────────────┤ │ │ │ Anomaly Training Evaluation │ │ ──────────────── ────────── │ │ │ │ Encoded Sequences [ANOMALY] metrics.json │ │ [v1,v2,...,v10] ──→ TRAINING ──→ AUC: 0.92 │ │ [v2,v3,...,v11] [LSTM] ROC curve │ │ ... Performance │ │ Analysis │ │ │ └──────────────────────────────────────────────────────────────┘

Preparing Your Own Logs

To use DeepSentry with logs from your own systems:

Step 1: Export Your Logs

Export logs from your system in a text file, one log entry per line. The format must be:

YYMMDD HHMMSS MESSAGE TEXT

For example, from a web server:

220115 14:23:45 GET /api/users - 200 OK - 45ms
220115 14:23:46 POST /api/users - 201 Created - 120ms
220115 14:23:47 GET /api/users/123 - 200 OK - 23ms
220115 14:23:48 Database connection timeout
220115 14:23:49 Retry attempt 1 for user lookup

Step 2: Ensure Clean Training Data

The training set should represent normal system behavior. If your training data contains anomalies, the model will learn wrong patterns.

Critical: Anomaly detection works by learning normal patterns. If your training data is contaminated with anomalies, the model will consider anomalies "normal" and fail to detect them.

Tips for clean data:

Use logs from a period when the system was known to be healthy
Remove obvious errors and failures from training data
Train on at least 1 week of logs (more is better)
Separate test data from training data (don't overlap)

Step 3: Split into Train/Test

Divide your logs into:

Training set (80%): Used to learn normal patterns. Should be clean.
Test set (20%): Used to evaluate performance. Can contain anomalies.

The prepare.py script does this automatically, taking the first 80% as training and last 20% as validation.

File Size Considerations

Log files can be large. Here's what to expect:

File Type	Typical Size	Memory Impact
Raw logs (1 week)	100 MB - 1 GB	Streamed (low memory)
Prepared text data	50-500 MB	Loaded into memory
Encoded vectors	10-100 MB	Loaded into memory
Trained models	5-50 MB	Single model in memory

Tip: Even multi-GB log files can be processed thanks to streaming. The pipeline reads logs line-by-line without loading everything into memory at once.

Dataset Requirements Checklist

Before running the pipeline, ensure:

☐ You have raw logs in Utah format (YYMMDD HHMMSS MESSAGE)
☐ Training logs are from a known-healthy period
☐ Test logs are separate from training data
☐ You have at least 1 week of log data (more is better)
☐ Logs are in a single text file, one entry per line
☐ You have ground truth labels for evaluation (optional but recommended)