Extractors
  • 08 Sep 2022
  • 11 Minutes to read
  • Dark
    Light

Extractors

  • Dark
    Light

Syslog (RFC3164 , RFC5424 ) is the de facto standard logging protocol since the 1980s. It was originally developed as part of the send mail project. GELF is an updated log format for application logging.

Because syslog has a clear specification in its RFCs, we should be able to parse it relatively easily. Unfortunately there are a lot of devices (especially routers and firewalls) out there that send logs that look like syslog but actually break several rules stated in the RFCs. We tried our best to write a parser that reads all of them but we were not able to. Such loosely defined text messages usually break the compatibility in the first date field. Some devices leave out hostnames completely, some use localized time zone names (e. g. “MESZ” instead of “CEST”), and some omit the current year in the timestamp field.

There are devices out there that do not claim to send syslogs but have other completely different log formats that need to be parsed specifically.

We decided to not write custom message inputs and parsers for all those devices, formats, firmwares and configuration parameters out there but came up with the concept of Extractors introduced in the v0.20.0 series of Graylog.

Graylog Extractors Explained

The extractors allow you to instruct Graylog nodes on how to extract data from any text in a received message (regardless of the format and even if it's an extracted field) to message fields. If you're a Graylog user, you may already know why structuring data into fields is important: Full text searches provide a great deal of possibilities for analysis but the real power of log analytics unveils when you can run queries like http_response_code:>;=500 AND user_id:9001 to get all internal server errors triggered by a specific user.

Wouldn’t it be nice to be able to search for all blocked packages of a given source IP or to get a quick terms analysis of recently failed SSH login usernames? Hard to do when all you have is just a single long text message.

Attention

Graylog extractors only work in text fields and are not executed for numeric fields or anything other than a string.

Creating extractors is possible via either Graylog REST API calls or on the web interface via a wizard. Select a message input on the System -> Inputs page and hit Manage extractors in the actions menu. The wizard allows you to load a message to test your extractor configuration against. You can extract data using regular expressions, Grok patterns, substrings, or even by splitting the message into tokens using separator characters. The wizard looks like this and is pretty intuitive:

extractors1.png
You can also choose to apply so called converters on the extracted value to convert a string consisting of numbers to an integer or double value (important for range searches that come later), anonymize IP addresses, lower-/uppercase a string, build a hash value, and much more.

Import Extractors

The recommended way of importing extractors in Graylog is using Content Packs. The Graylog Marketplace provides access to many content packs that you can easily download and import into your Graylog setup.

You can still import extractors from JSON if you want to. Just copy the JSON extractor, export it into the import dialog of a message, input the fitting type (every extractor set entry in the directory tells you what type of input to spawn, e. g. syslog, GELF, or Raw/plaintext) and you are good to go. From now on messages coming in will include the extracted fields with possibly converted values.

A message sent by Heroku and received by Graylog with the imported Heroku extractor set on a plaintext TCP input looks like this: (look at the extracted fields in the message detail view)

extractors2.png

Using Regular Expressions to Extract Data

Extractors support matching field values using regular expressions. Graylog uses the Java Pattern class to evaluate regular expressions.

For the individual elements of regular expression syntax, please refer to Oracle’s documentation; however, the syntax largely follows the familiar regular expression languages in widespread use today and will not be unfamiliar to most.

One key question that is often raised is how to match a string in case insensitive manner. Java regular expressions are case sensitive by default. Certain flags, such as the one to ignore case sensitivity can either be set in the code or as an inline flag in the regular expression.

For example, to create an extractor that matches the browser name in the following user agent string:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36

the regular expression (applewebkit) will not match because it is case sensitive. In order to match the expression using any combination of upper and lowercase characters use the (?i) flag as such:

(?i)(applewebkit)

Most of the other flags supported by Java are rarely used in the context of matching stream rules or extractors, but if you need them their use is documented on the same Javadoc page in Oracle's documentation. One common reason to use regular expression flags in your regular expression is to make use of non-capturing groups. These are parentheses which only group alternatives, but do not make Graylog extract the data they match and are indicated by (?:).

Using Grok Patterns to Extract Data

Graylog also supports extracting data using the popular Grok language to enable you to make use of your existing patterns.

Grok is a set of regular expressions that can be combined into more complex patterns, allowing for the naming of different parts of matched groups.

By using Grok patterns, you can extract multiple fields from a message field in a single extractor, which often simplifies specifying extractors.

Simple regular expressions are often sufficient to extract a single word or number from a log line, but if you know the entire structure of a line beforehand, for example the format of an access log or a firewall log, using Grok is advantageous.

A firewall log line could contain:

len=50824 src=172.17.22.108 sport=829 dst=192.168.70.66 dport=513

We can now create the following patterns on the System/Grok Patterns page in the web interface:

BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
NUMBER (?:%{BASE10NUM})
IPV6 ((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?
IPV4 (?<![0-9])(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}))(?![0-9])
IP (?:%{IPV6}|%{IPV4})
DATA .*?

Then, in the extractor configuration, we can use these patterns to extract the relevant fields from the line:

len=%{NUMBER:length} src=%{IP:srcip} sport=%{NUMBER:srcport} dst=%{IP:dstip} dport=%{NUMBER:dstport} 

This will add the relevant extracted fields to our log message, allowing Graylog to search in those individual fields, which can lead to more effective search queries because it means you are specifically looking for packets that came from an individual source IP instead of all matching destination IPs that were searched for across all fields.

If the Grok pattern creates many fields, which can happen if you make use of heavily nested patterns, you can tell Graylog to skip certain fields (and the output of their subpatterns) by naming a field with the special keyword UNWANTED.

Let’s say you want to parse a line like:

type:44 bytes:34 errors:122

But you are only interested in the second number bytes. You could use a pattern like:

type:%{BASE10NUM:type} bytes:%{BASE10NUM:bytes} errors:%{BASE10NUM:errors}

However, this would create three fields named type, bytes, and errors. Even not naming the first and last patterns would still create a field named BASE10NUM. In order to ignore fields, but still require matching them use UNWANTED:

type:%{BASE10NUM:UNWANTED} bytes:%{BASE10NUM:bytes} errors:%{BASE10NUM:UNWANTED}

This creates a single field called bytes while making sure the entire pattern matches.

If you already know the data type of the extracted fields, you can make use of the type conversion feature built into the Graylog Grok library. Going back to the earlier example:

len=50824 src=172.17.22.108 sport=829 dst=192.168.70.66 dport=513

We know that the content of the field len is an integer and would like to make sure it is stored with that data type, so we can later create field graphs with it or access the field’s statistical values, like average etc.

Grok directly supports converting field values by adding ;datatype at the end of the pattern, such as:

len=%{NUMBER:length;int} src=%{IP:srcip} sport=%{NUMBER:srcport} dst=%{IP:dstip} dport=%{NUMBER:dstport}

The currently supported data types, and their corresponding ranges and values, are:

Type

Range

Example

byte

-128 … 127

%{NUMBER:fieldname;byte}

short

-32768 … 32767

%{NUMBER:fieldname;short}

int

-2^31 … 2^31 -1

%{NUMBER:fieldname;int}

long

-2^63 … 2^63 -1

%{NUMBER:fieldname;long}

float

32-bit IEEE 754

%{NUMBER:fieldname;float}

double

64-bit IEEE 754

%{NUMBER:fieldname;double}

boolean

true, false

%{DATA:fieldname;boolean}

string

Any UTF-8 string

%{DATA:fieldname;string}

date

See SimpleDateFormat

%{DATA:timestamp;date;dd/MMM/yyyy:HH:mm:ss Z}

datetime

Alias for date


There are many resources on the web with useful patterns, and one very helpful tool is the Grok Debugger, which allows you to test your patterns while you develop them.

Graylog uses Java Grok to parse and run Grok patterns.

Using the JSON Extractor

Since version 1.2, Graylog also supports extracting data from messages sent in JSON format.

Using the JSON extractor is easy: once a Graylog input receives messages in JSON format, you can create an extractor by going to System -> Inputs and clicking on the Manage extractors button for that input. Next, you need to load a message to extract data from and select the field containing the JSON document. The following page lets you add some extra information to tell Graylog how it should extract the information. Here's an example of how messages are extracted:

{"level": "ERROR", "details": {"message": "This is an example error message", "controller": "IndexController", "tags": ["one", "two", "three"]}}

Using default settings, the above message would be extracted into these fields:
details_tags
one, two, three

level
ERROR

details_controller
IndexController

details_message
This is a sample error message

In the create extractor page you can also customize how to separate a list of elements, keys, and key/values. It is also possible to flatten JSON structures or expand them into multiple fields, as shown in the example above.

Automatically Extract All key=value Pairs

You may receive a message like this:

This is a test message with some key/value pairs. key1=value1 some_other_key=foo

You might want to extract all key=value pairs into Graylog message fields without having to specify all possible key names or even their order. This is how you can easily do this:

Create a new extractor with type: “Copy Input” and select to read from the field message. (Or any other string field that contains key=value pairs). Configure the extractor to store the (copied) field value to the same field. In this case message. The trick is to add the “Key=Value" pairs to the fields converter as a last step. Because we use the “Copy Input” extractor, the converter will run over the complete field you selected and convert all key=value pairs it can find.

This is a screenshot of the complete extractor configuration:

keyvalueconverter1.png
… and this is the resulting message:

keyvalueconverter2.png

Normalization

Many log formats are similar to each other, but not quite the same. In particular they often only differ in the names attached to pieces of information.

For example, consider different hardware firewall vendors, whose models log the destination IP in different fields of the message. Some use dstip, some dst and yet others use destination-address:

2004-10-13 10:37:17 PDT Packet Length=50824, Source address=172.17.22.108, Source port=829, Destination address=192.168.70.66, Destination port=513
2004-10-13 10:37:17 PDT len=50824 src=172.17.22.108 sport=829 dst=192.168.70.66 dport=513
2004-10-13 10:37:17 PDT length="50824" srcip="172.17.22.108" srcport="829" dstip="192.168.70.66" dstport="513"

You can use one or more non-capturing groups to specify alternative field names, and still be able to extract the aparentheses group in the regular expression. Remember that Graylog will extract data from the first matched group of the regular expression. An example of a regular expression matching the destination IP field of all those log messages from above is:

(?:dst|dstip|[dD]estination\saddress)="?(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"?

This will only extract the IP address without caring about which of the three naming schemes was used in the original log message. This way you don’t have to set up three different extractors.

The standard date converter

Date parser converters for extractors allow you to convert extracted data into timestamps - usually used to set the timestamp of a message based on some date it contains. Let’s assume we get this message from a network device:

< 131 > : foo - bar - dc3 - org - de01 : Mar 12 00 : 45 : 38 : % LINK - 3 - UPDOWN : Interface GigabitEthernet0 / 31 , changed state to down 

Extracting most of the data is not a problem and can be done easily. However, we need a date parse converter to use the date in the message () as a Graylog message timestamp.

Use a copy input extractor rule to select the timestamp and apply the Date converter with a format string:

MMM dd HH:mm:ss

(format string table at the end of this page)

dateparser1.png
dateparser2.png

Standard date converter format string table

Symbol

Meaning

Presentation

Examples

G

era

text

AD

C

century of era (>=0)

number

20

Y

year of era (>=0)

year

1996

x

weekyear

year

1996

w

week of weekyear

number

27

e

day of week

number

2

E

day of week

text

Tuesday; Tue

y

year

year

1996

D

day of year

number

189

M

month of year

month

July; Jul; 07

d

day of month

number

10

a

halfday of day

text

PM

K

hour of halfday (0~11)

number

0

h

clockhour of halfday (1~12)

number

12

H

hour of day (0~23)

number

0

k

clockhour of day (1~24)

number

24

m

minute of hour

number

30

s

second of minute

number

55

S

fraction of second

millis

978

z

time zone

text

Pacific Standard Time; PST

Z

time zone offset/id

zone

-0800; -08:00; America/Los_Angeles

escape for text

delimiter


‘’

single quote

literal

The flexible date converter

Now imagine you had one of those devices that sends messages that are not so easy to parse because they do not follow a strict timestamp format. Some network devices for example like to send days of the month without adding a padding 0 for the first 9 days. You’ll have dates like Mar 9 and Mar 10 and end up having problems defining a parser string for that. Or maybe you have something else that is really exotic like last Wednesday as timestamp. The flexible date converter accepts any text data and tries to build a date from that the best it can.

Examples:

  • Mar 12 , converted at 12:27:00 UTC in the year 2014: 2014-03-12T12:27:00.000
  • 2014-3-12 12:27 : 2014-03-12T12:27:00.000
  • Mar 12 2pm : 2014-03-12T14:00:00.000

Note that the flexible date converter will use UTC as a default time zone unless you have time zone information in the parsed text or you have configured another time zone when adding the flexible date converter to an extractor (see this comprehensive list of time zones available for the flexible date converter).


Was this article helpful?

What's Next