Archiving

The following article exclusively pertains to a Graylog Operations feature or functionality. To learn more about obtaining an Operations license, please contact the Graylog Sales team.

Archiving allows you to configure a retention period and automatically delete older messages. Configuring a retention period helps you to control the costs of storage in Elasticsearch and allows you to store data for long periods of time to comply with HIPAA, PCI, or other requirements.

The Archiving functionality archives log messages until you need to re-import them into Graylog for analysis. You can instruct Graylog to automatically archive log messages to compressed flat files on the local file system, or to an S3-compatible object storage, before retention cleaning kicks in and messages are deleted from Elasticsearch. Archiving works through a REST call or the web interface, if you don’t want to wait until retention cleaning happens. Flat files are vendor-agnostic, so your data is always accessible.

Here are your options to manage and maintain archived files:

Move them to inexpensive storage or write them on tape.
Print them out if you need physical access to these files.
Move any selection of archived messages back into the Graylog archive folder, so that you can search through, and analyze, archived data through the web interface.

Installation

Archiving is part of the Graylog Operations plugin. Please check the Graylog Operations setup page for installation instructions.

Configuration

You can configure the Graylog Archive via the Graylog web interface without any configuration file changes.

In the web interface menu, navigate to "Enterprise/Archives" and click Configuration to adjust the configuration.

Archive Options

Here are the archive configuration options:

Configuration Options
Name	Description
Backend	Backend on the master node where the archive files will be stored.
Max Segment Size	Maximum size (in bytes) of archive segment files.
Compression Type	Compression type used to compress the archives.
Checksum Type	Checksum algorithm used to calculate the checksum for archives.
Restore index batch size	Elasticsearch batch size when restoring archive files.
Streams to archive	Streams included in the archive.

Backends

The archived indices are stored in a backend. You can choose either type:

File system
S3

File System Backend

Initial server startup triggers the creation of a backend. Data is then stored in /tmp/graylog-archive. If you want to store the data in a different path, you can create a new backend.

S3 Archiving Backend

The S3 Archiving Backend can be used to upload archives to an AWS S3 object storage service and is built to work with AWS. AWS S3 is also compatible with other S3 implementations, like MinIO, CEPH, Digital Ocean Spaces, etc.

On the "Archive" page:

Click the Manage Backends button on the top right.
Click Create Backend under Archive Backends. This routes to Edit archive backend configuration options.
Go to the Backend configuration section, and on the Backend Type dropdown, select S3.
Fill in the fields that best suit your choice of archive.

Name	Description
Title	A title to identify the backend.
Description	Description of the backend.
S3 Endpoint URL	Only configure this if not using AWS.
AWS Authentication Type	Choose access type from the dropdown menu.
AWS Assume Role (ARN)	An optional input for alternate authentication mechanisms.
Bucket Name	The name of the S3 bucket.
Spool Directory	Directory where archiving data is stored before it is uploaded.
AWS Region	Choose Automatic or configure the appropriate option.
S3 Output Base Path	Archives are stored under this path.

AWS Authentication Type

Graylog provides several options for granting access. You can:

Ude the Automatic authentication mechanism if you provide AWS credentials through your file system or process environment.
Enter credentials manually.

AWS Assume Role (ARN)

ARN is typically used for allowing cross-account access to a bucket. See ARN for further details.

Spool Directory

The archiving process needs the spool directory to store some temporary data before it can be uploaded to S3.

Ensure that the directory is writable and has sufficient space for 10 times the Max Segment Size. Make adjustments with the form mentioned in Configuration.

AWS Region

If you are not using AWS, you do not need to configure AWS Region.

Select the AWS region where your archiving bucket resides. If you select nothing, Graylog will use the region from your file system or process environment, if available.

S3 Output Base Path

S3 Output Base Path is a prefix to the file name that works similarly to a directory. Configure this to help organize your data.

You can use the following variable to construct a dynamic value for each archive to give it structure:

Variable	Description
index-name	Name of the index that gets archived.
year	Archival date year.
month	Archival date month.
day	Archival date day.
hour	Archival date hour.
minute	Archival date minute.
second	Archival date second.

AWS Security Permissions

When writing AWS Security policies, make them as restrictive as possible. It is a best practice to enable specific actions needed by the application rather than allowing all actions.

These permissions are required for Graylog to successfully make use of the S3 bucket:

Permission	Description
CreateBucket	Creates an S3 bucket.
HeadBucket	Determines if an action is useful and if you have permission to access it.
PutObject	Adds an object to a bucket.
CreateMultipartUpload	Initiates a multipart upload and returns an upload ID.
CompleteMultipartUpload	Completes a multipart upload by assembling previously uploaded parts.
UploadPart	Uploads a part in a multipart upload.
AbortMultipartUpload	Aborts a multipart upload.
GetObject	Retrieves objects from Amazon S3.
HeadObject	Retrieves metadata from an object without returning the object itself.
ListObjects	Returns some or all (up to 1,000) of the objects in a bucket with each request.
DeleteObjects	Enables you to delete multiple objects from a bucket using a single HTTP request.

Activate Backend

Configure your bucket, and click Save.

This routes you back to the Edit archive backend configuration page.

To activate the backend, follow the instructions below.

Click on the Configuration tab located in the top right-hand corner.
Under the Backend dropdown menu, select the backend you want to activate.
You can choose to change configurations or use the defaults provided.
Click the green Update configuration button at the bottom of the screen.
This will route you to the Archives screen.

Max Segment Size

When you archive an index, the archive job writes the data into segments. The Max Segment Size setting sets the size limit for each of these data segments to control the size of the segment files and process them with tools with a file-size limit.

Once the size limit is reached, a new segment file is started.

Example:

Copy

/path/to/archive/
  graylog_201/
    archive-metadata.json
    archive-segment-0.gz
    archive-segment-1.gz
    archive-segment-2.gz

Compression Type

Archives are compressed with gzip by default, but you can switch to a different compression type.

The selected compression type significantly impacts the time it takes to archive an index. Gzip, for example, is slow but has a great compression rate. Snappy and LZ4 are quicker, but the archives are larger.

Here is a comparison between the available compression algorithms with test data.

Compression Type Comparison
Type	Index Size	Archive Size	Duration
gzip	1 GB	134 MB	15 minutes, 23 seconds
Zstandard	1 GB	225.2MB	5min 55sec
Snappy	1 GB	291 MB	2 minutes, 31 seconds
LZ4	1 GB	266 MB	2 minutes, 25 seconds

Hint: Results may vary based on your data. Test the different compression types to find the best one for you.

Warning: The current implementation of LZ4 is not compatible with LZ4 CLI tools, so it is currently impossible to decompress LZ4 archives outside of Graylog.

Checksum Type

When Graylog writes archives, it also computes a CRC32 checksum over the files. You can select a different option to use a different checksum algorithm if needed.

To select the appropriate type of checksum, look to the use case. CRC32 and MD5 are quick to compute and are a reasonable choice to detect damaged files, but neither is suitable as protection against malicious changes in the files. Graylog also supports using SHA-1 or SHA-256 checksums, which can ensure the files were not modified, as they are cryptographic hashes.

The best way to choose a checksum type is to consider whether the necessary system tools to compute them later are installed (SHA-256 utility, for example), the speed of checksum calculation for larger files, and security considerations.

Restore Index Batch Size

The Batch Size setting controls the batch size for re-indexing archive data into Elasticsearch. When set to 1000, the restore job will re-index the archived data in document batches of 1000.

Use this setting to control the speed of the restore process and how much of a load it will generate on the Elasticsearch cluster. The higher the batch size, the faster the restore will progress, and the more load will be put on your Elasticsearch cluster beyond the normal message processing.

Tune this carefully to avoid any negative impact on your message indexing throughput and search speed.

Streams To Archive

The Streams to Archive setting is included in the archive and allows you to archive only important data, rather than everything that is brought into Graylog.

Hint: New streams are archived automatically. If you create a new stream and don’t want it to be archived, disable it in this configuration dialogue.

Bulk Restore

Archives can be restored in bulk, which can be useful in situations where extensive investigations or auditing requires searching across a larger time range.

To bulk restore archives in Graylog, navigate to the Graylog web interface and click on the Enterprise tab.
Select Archives from the dropdown menu.
Then, select the archives you intend to restore by clicking on the checkboxes, and click on the Bulk Actions button.
To complete, select Restore from the options list.

The selected archives will be displayed on the user interface after they have been restored.

Backends

You can use a backend to store archived data. Currently, Graylog only supports a single-file system backend type.

File System

The archived indices are stored in the Output base path directory. This directory needs to exist and be writable for the Graylog server process to store the files.

Hint: Because the archiving process runs on the master node, only the master node needs access to the Output base path directory.

We recommend housing the Output base path directory on a separate disk or partition to avoid any negative impacts on message processing when the archive fills up a disk.

Configuration Options
Name	Description
Title	A title to identify the backend.
Description	A description for the backend.
Output base path	Directory path where the archive files are stored.

Output base path
Use a simple directory path string or a template string as the output base path to build dynamic paths.

You can use a template string to store the archive data in a directory tree, which is based on the archival date.

Example:

Copy

# Template
/data/graylog-archive/${year}/${month}/${day}

# Result
/data/graylog-archive/2017/04/01/graylog_0

Available Template Variables
Name	Description
`${year}`	Archival date year (e.g. “2017”).
`${month}`	Archival date month (e.g. “04”).
`${day}`	Archival date day (e.g. “01”).
`${hour}`	Archival date hour (e.g. “23”).
`${minute}`	Archival date minute (e.g. “24”).
`${second}`	Archival date second (e.g. “59”).
`${index-name}`	Name of the archived index (e.g. “graylog_0”).

Index Retention

Graylog uses configurable index retention strategies to delete old indices. By default, indices can be closed or deleted if you have more than the configured limit.

The Graylog Archive offers a new index retention strategy that you can configure to automatically archive an index before closing or deleting it.

Index retention strategies can be configured in the system menu under System/Indices. Select an index set, and click Edit to change the index rotation and retention strategies.

As with the regular index retention strategies, you can configure a maximum number of Elasticsearch indices. When there are more indices than the configured limit, the oldest indices are archived in the backend and closed or deleted. You can also decide to do nothing (NONE ) after archiving an index. In that case, no cleanup of old indices will happen, and you can manage the archive yourself.