Setup
  • 26 Jul 2022
  • 8 Minutes to read
  • Dark
    Light

Setup

  • Dark
    Light

The Graylog Archive is a commercial feature that can be installed in addition to the Graylog open source server.

Installation

Archiving is part of the Graylog Operations plugin. Please check the Graylog Operations setup page for details on how to install it.

Configuration

The Graylog Archive can be configured via the Graylog web interface and does not require making any changes in the Graylog server configuration file.

In the web interface menu, navigate to “Operations/Archives” and click “Configuration” to adjust the configuration.

archiving-setup-config

Archive Options

There are several options to configure archiving.

Configuration Options

Name

Description

Backend

Backend on the master node where the archive files will be stored.

Max Segment Size

Maximum size (in bytes) of archive segment files.

Compression Type

Compression type that will be used to compress the archives.

Checksum Type

Checksum algorithm that is used to calculate the checksum for archives.

Restore index batch size

Elasticsearch batch size when restoring archive files.

Streams to archive

Streams that should be included in the archive.

Backends

The archived indices will be stored in a backend. You may choose either type:

  • File system
  • S3

File System Backend

When the server starts up for the first time, a backend is created. It stores the data in /tmp/graylog-archive. You can create a new backend if you want to store the data in a different path.

S3 Archiving Backend

The S3 Archiving backend can be used to upload archives to an AWS S3 object storage service. It is built to work with AWS, but should be compatible with other S3 implementations like MinIO, CEPH, Digital Ocean Spaces, etc.

On the Archive page:

  1. Click the Manage Backends button at the top right.
  2. Click Create Backend under Archive Backends; this takes you to Edit archive backend configuration options .
  3. Go to the Backend configuration section and on the Backend Type dropdown select S3 .
  4. Fill out the form, completing the fields that best suit your choice of archive.

Name

Description

Title

A simple title to identify the backend

Description

Longer description of the backend

S3 Endpoint URL

Only configure this if not using AWS

AWS Authentication Type

Choose access type from the dropdown menu

AWS Assume Role (ARN)

This is an optional input for alternate authentication mechanisms

Bucket Name

The name of the S3 bucket

Spool Directory

Directory where archiving data is stored before being uploaded

AWS Region

Choose Automatic or configure the appropriate option

S3 Output Base Path

Archives will be stored under this path

AWS Authentication Type

Graylog provides several options for granting access. You can:

  • use the Automatic authentication mechanism if you provide AWS credentials through your file system or process environment.
  • enter credentials manually

AWS Assume Role (ARN)

This is typically used for allowing cross-account access to a bucket. See ARN for further details.

Spool Directory

The archiving process needs this directory to store some temporary data before it can be uploaded to S3.

This directory should be writable and have enough space to fit 10 times the Max Segment Size. You can make adjustments in the form mentioned in Configuration.

AWS Region

Select the AWS region where your archiving bucket resides. If nothing is selected, Graylog will try to get the region from your file system or process environment.

If you are not using AWS, you do not need to configure this.

S3 Output Base Path

This is a prefix to the file name that works similar to a directory. Configuring this will help you organize data.

You can use the following variable to construct a dynamic value for each archive to give it structure:

Variable

Description

index-name

Name of the index that gets archived

year

Archival date year

month

Archival date month

day

Archival date day

hour

Archival date hour

minute

Archival date minute

second

Archival date second

AWS Security Permissions

When writing AWS Security policies, make them as restrictive as possible. It is a best practice to enable specific actions needed by the application rather than allowing all actions.

These permissions are required in order for Graylog to successfully make use of the S3 bucket:

Permission Description
CreateBucket Creates an S3 bucket.
HeadBucket Determines if an action is useful and if you have permission to access it.
PutObject Adds an object to a bucket.
CreateMultipartUpload Initiates a multipart upload and returns an upload ID.
CompleteMultipartUpload Completes a multipart upload by assembling previously uploaded parts.
UploadPart Uploads a part in a multipart upload.
AbortMultipartUpload Aborts a multipart upload.
GetObject Retrieves objects from Amazon S3.
HeadObject Retrieves metadata from an object without returning the object itself.
ListObjects Returns some or all (up to 1,000) of the objects in a bucket with each request.
DeleteObjects Enables you to delete multiple objects from a bucket using a single HTTP request.

Activate Backend

After configuring your bucket, click Save .

This will bring you back to the Edit archive backend configuration page.

To activate the backend, you need to:

  1. Click on the Configuration tab located in the top right hand corner.
  2. Under the Backend dropdown menu, select the backend you want to activate.
  3. You can choose to change configurations or use the defaults provided.
  4. Click the green Update configuration button at the bottom of the screen.
  5. This will return you to the Archives screen.

Max Segment Size

When archiving an index, the archive job writes the data into segments. The Max Segment Size setting sets the size limit for each of these data segments.

This allows control over the file size of the segment files and makes it possible to process them with tools which have a size limit for files.

Once the size limit is reached, a new segment file will be started.

Example:

/path/to/archive/
  graylog_201/
    archive-metadata.json
    archive-segment-0.gz
    archive-segment-1.gz
    archive-segment-2.gz

Compression Type

Archives will be compressed with gzip by default. This option can be changed to use a different compression type.

The selected compression type has a big impact on the time it takes to archive an index. Gzip for example is pretty slow but has a great compression rate. Snappy and LZ4 are way faster but the archives will be bigger.

Here is a comparison between the available compression algorithms with test data.

Compression Type Comparison

Type

Index Size

Archive Size

Duration

gzip

1 GB

134 MB

15 minutes, 23 seconds

Zstandard

1 GB

225.2MB

5min 55sec

Snappy

1 GB

291 MB

2 minutes, 31 seconds

LZ4

1 GB

266 MB

2 minutes, 25 seconds

Note

Results may vary based on your data! Make sure to test the different compression types to find the one that is best.

Warning

The current implementation of LZ4 is not compatible with LZ4 CLI tools, so decompressing the LZ4 archives outside of Graylog is currently not possible.

Checksum Type

When writing archives Graylog computes a CRC32 checksum over the files. This option can be changed to use a different checksum algorithm.

The type of checksum you choose will depend on the use case. CRC32 and MD5 are quick to compute and are a reasonable choice to be able to detect damaged files but neither is suitable as protection against malicious changes in the files. Graylog also supports using SHA-1 or SHA-256 checksums which can be used to make sure the files were not modified, as they are cryptographic hashes.

The best way to choose a checksum type would be to consider: whether the necessary system tools to compute them later on are installed (not all systems come with a SHA-256 utility for example), the speed of checksum calculation for larger files and security considerations.

Restore Index Batch Size

This setting controls the batch size for re-indexing archive data into Elasticsearch. When set to 1000, the restore job will re-index the archived data in document batches of 1000.

You can use this setting to control the speed of the restore process and also how much of a load it will generate on the Elasticsearch cluster. The higher the batch size, the faster the restore will progress and the more load will be put on your Elasticsearch cluster in addition to the normal message processing.

Make sure to tune this carefully to avoid any negative impact on your message indexing throughput and search speed!

Streams To Archive

This option can be used to select which streams should be included in the archive. With this you are able to archive only important data instead of archiving everything that is brought into Graylog.

Note

New streams will be archived automatically. If you create a new stream and don’t want it to be archived, you have to disable it in this configuration dialog.

Backends

A backend can be used to store the archived data. For now, we only support a single file system backend type.

File System

The archived indices will be stored in the Output base path directory. This directory needs to exist and be writable for the Graylog server process so the files can be stored.

Note

Only the master node needs access to the Output base path directory because the archiving process runs on the master node.

We recommend putting the Output base path directory onto a separate disk or partition to avoid any negative impact on message processing if the archive fills up a disk.

archiving-setup-backend-new

Configuration Options

Name

Description

Title

A simple title to identify the backend.

Description

Longer description for the backend.

Output base path

Directory path where the archive files should be stored.

Output base path
The output base path can either be a simple directory path string or a template string to build dynamic paths.

You could use a template string to store the archive data in a directory tree which is based on the archival date.

Example:

# Template
/data/graylog-archive/${year}/${month}/${day}

# Result
/data/graylog-archive/2017/04/01/graylog_0
Available Template Variables

Name

Description

${year}

Archival date year. (e.g. “2017”)

${month}

Archival date month. (e.g “04”)

${day}

Archival date day. (e.g. “01”)

${hour}

Archival date hour. (e.g. “23”)

${minute}

Archival date minute. (e.g. “24”)

${second}

Archival date second. (e.g. “59”)

${index-name}

Name of the archived index. (e.g. “graylog_0”)

Index Retention

Graylog uses configurable index retention strategies to delete old indices. By default, indices can be closed or deleted if you have more than the configured limit.

The Graylog Archive offers a new index retention strategy that you can configure to automatically archive an index before closing or deleting it.

Index retention strategies can be configured in the system menu under “System/Indices”. Select an index set and click “Edit” to change the index rotation and retention strategies.

archiving-setup-index-retention-config

As with the regular index retention strategies, you can configure a maximum number of Elasticsearch indices. Once there are more indices than the configured limit, the oldest ones will be archived into the backend and then closed or deleted. You can also decide to not do anything (NONE ) after archiving an index. In that case no cleanup of old indices will happen and you will have to take care of that yourself!


Was this article helpful?

What's Next