Data and Log Formats
Log Data Fundamentals
DeepSentry works with structured text logs. A log is a sequence of messages, each typically on a single line, with consistent format. Examples:
2022-01-15 14:23:45 INFO Starting server on port 8080
2022-01-15 14:23:46 INFO Loading configuration from /etc/app.conf
2022-01-15 14:23:47 DEBUG Connecting to database at localhost:5432
2022-01-15 14:23:48 INFO Database connection established
2022-01-15 14:24:00 INFO Request received: GET /api/users/123
2022-01-15 14:24:00 INFO Request processed: 45ms
2022-01-15 14:24:01 INFO Request completed: 200 OK
Each message contains information about what the system did at a specific point in time. By analyzing sequences of these messages, DeepSentry learns the normal patterns of system behavior.
The HDFS Dataset (Built-in Example)
DeepSentry is validated on the HDFS-1 dataset, a collection of real logs from the Hadoop Distributed File System used at Yahoo. This dataset has become a benchmark for anomaly detection research because:
- Real production logs: From actual large-scale systems, not synthetic
- Labeled anomalies: Contains logs marked with anomaly ground truth
- Large scale: Millions of log entries spanning weeks of operation
- Diverse failures: Includes network failures, disk errors, timeouts, and other issues
HDFS Log Format
HDFS logs follow the Utah format (a standard in anomaly detection research):
YYMMDD HHMMSS MESSAGE TEXT
Example:
220115 14:23:45 HDFS-1 BlockReport processing took 120ms for 50000 blocks from 192.168.1.100
220115 14:23:46 HDFS-1 Got block report from DataNode 192.168.1.100 with 50000 blocks
220115 14:23:47 HDFS-1 Rescan of DataNode 192.168.1.100 finished
220115 14:23:48 HDFS-1 BlockReport verification complete
Components:
- YYMMDD: Date (year-month-day)
- HHMMSS: Time (hour:minute:second)
- MESSAGE TEXT: The actual log message (arbitrary string)
Ground Truth Labels
The HDFS dataset includes labeled anomalies. Anomalies are linked to HDFS block IDs. A file containing mappings looks like:
blk_123456789: 1 # Block 123456789 contains an anomaly
blk_987654321: 0 # Block 987654321 is normal
blk_111111111: 1 # Block 111111111 contains an anomaly
DeepSentry uses these labels to evaluate model performance: "of all the blocks marked as anomalous, how many did we detect?"
Data Organization
The pipeline expects data organized in specific directories:
Input Directory Structure
data/
├── raw_logs/
│ ├── HDFS_1.log # Raw log file in Utah format
│ └── labels.txt # Ground truth (optional, for evaluation)
└── dockerconfig/ # Configuration files (provided by DeepSentry)
├── train.yml
├── encode.yml
├── anomaly_train.yml
└── ...
Output Directories (Created by Pipeline)
data/
├── prepared_data/
│ ├── train.txt # Training messages
│ ├── val.txt # Validation messages
│ └── vocab.pkl # Learned vocabulary
│
├── text_autoencoder_model/
│ ├── encoder.h5 # Trained text encoder
│ └── metadata.json
│
├── encoded_dataset/
│ ├── train_encoded.pkl # Encoded training vectors
│ ├── val_encoded.pkl # Encoded validation vectors
│ └── test_encoded.pkl # Encoded test vectors
│
├── anomaly_trained_model/
│ ├── detector.h5 # Trained anomaly detector
│ └── metadata.json
│
└── eval_results/
├── anomaly_scores.pkl
├── roc_curve.png
├── precision_recall.png
└── metrics.json
Data Flow Through the Pipeline
Preparing Your Own Logs
To use DeepSentry with logs from your own systems:
Step 1: Export Your Logs
Export logs from your system in a text file, one log entry per line. The format must be:
YYMMDD HHMMSS MESSAGE TEXT
For example, from a web server:
220115 14:23:45 GET /api/users - 200 OK - 45ms
220115 14:23:46 POST /api/users - 201 Created - 120ms
220115 14:23:47 GET /api/users/123 - 200 OK - 23ms
220115 14:23:48 Database connection timeout
220115 14:23:49 Retry attempt 1 for user lookup
Step 2: Ensure Clean Training Data
The training set should represent normal system behavior. If your training data contains anomalies, the model will learn wrong patterns.
Tips for clean data:
- Use logs from a period when the system was known to be healthy
- Remove obvious errors and failures from training data
- Train on at least 1 week of logs (more is better)
- Separate test data from training data (don't overlap)
Step 3: Split into Train/Test
Divide your logs into:
- Training set (80%): Used to learn normal patterns. Should be clean.
- Test set (20%): Used to evaluate performance. Can contain anomalies.
The prepare.py script does this automatically, taking the first 80% as training and last 20% as validation.
File Size Considerations
Log files can be large. Here's what to expect:
| File Type | Typical Size | Memory Impact |
|---|---|---|
| Raw logs (1 week) | 100 MB - 1 GB | Streamed (low memory) |
| Prepared text data | 50-500 MB | Loaded into memory |
| Encoded vectors | 10-100 MB | Loaded into memory |
| Trained models | 5-50 MB | Single model in memory |
Dataset Requirements Checklist
- ☐ You have raw logs in Utah format (YYMMDD HHMMSS MESSAGE)
- ☐ Training logs are from a known-healthy period
- ☐ Test logs are separate from training data
- ☐ You have at least 1 week of log data (more is better)
- ☐ Logs are in a single text file, one entry per line
- ☐ You have ground truth labels for evaluation (optional but recommended)