LogMX Regular Expression Parsers

LogMX Regular Expression Parsers

Describing your log format using Regular Expression is pretty simple when you already know Regular Expression syntax, but if it's not your case, learning basics through the next few examples should be fine. If you want, you can read more about Regular Expression here: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html (or directly from Google since this syntax is also used by UNIX/Perl/Python/PHP and others).

To create a Regex (Regular Expression) Parser, simply give its Regular Expression, then, choose the right log entry fields that match your Regex accordingly to your log format, like in the following examples.
Of course, you can also use your own user-defined log entry fields (e.g. "PID", "Server") in your log format using the combobox for field selection. You do not have to worry about line wrapping in "Message" field, LogMX will automatically handle it, you do not have to specify it in your Regular Expression. Yet, if all your log entries use at least two lines of text because a new line is used between some fields, then see the Example 10.

Table of contents:

Example 1 - Simple and short format
Example 2 - Simple example using .*?
Example 3 - Characters classes / Quantifiers
Example 4 - Using negation to reduce matching
Example 5 - Quoting special characters
Example 6 - Date format
Example 7 - Dealing with extra whitespace characters (e.g. padding)
Example 8 - Optional fields
Example 9 - Non capturing groups
Example 10 - Multiple lines format

Example 1 - Simple and short format

Log format	LEVEL-EMITTER-MESSAGE
Regular expression	(.)-(.)-(.*)
Example of log entries to be parsed	INFO-MyEmitter-My log message WARN-MyEmitter-Another message... ...on two lines

Parentheses ( ) in Regular Expression mean "capture a log entry field". That's why we use this for each log entry field: Level, Emitter, and Message. These two parentheses and the text between them is called a "group".
.* means "any character, 0 or more times, trying to match as much characters as possible".

Example 2 - Simple example using .*?

Log format	LEVEL-EMITTER-MESSAGE
Regular expression	(.?)-(.?)-(.*)
Example of log entry to be parsed	INFO-MyEmitter-Server [prod-srv-1] is DOWN

.*? means "any character, 0 or more times, but try to match as few characters as possible".
.* means "any character, 0 or more times, trying to match as many characters as possible".
The '?' in .*? is very important here. If we use .* for Emitter field instead of .*?, the text "INFO-MyEmitter-Server [prod-srv-1] is DOWN" will be parsed this way:
Level: INFO Emitter: MyEmitter-Server [prod-srv Message: 1] is DOWN
Whereas using .*? for Emitter field, it will be parsed like expected:
Level: INFO Emitter: MyEmitter Message: Server [prod-srv-1] is DOWN
This is because .* will match as many characters as possible and .*? will match as few characters as possible.
Note: when you can use either .*? or .*, always go for .*? which will improve your parser performances.

Example 3 - Characters classes / Quantifiers

Log format	LEVEL-EMITTER-ClientID:MESSAGE
Regular expression	(.?)-(.?)-(\d+):(.*)
Example of log entries to be parsed	INFO-MyEmitter-243:My log message WARN-Emitter 2-3:Another message

The token \d+ means "any digit, one or more times".
We have seen above that .* means "any character, 0 or more times". More precisely, . means "any character", and * means "0 or more times".
Then \d means "any digit" and + means "one or more times". That's why * and + are called "quantifiers". Other quantifiers are:

Quantifier	Meaning
?	0 or 1 occurrence
+	1 occurrence or more
*	0 occurrence or more
{n}	Exactly n occurrences
{n,}	n or more occurrences
{n,N}	n to N occurrences

These quantifiers can be followed by a ? to match as less characters as possible like presented above (i.e. 0 for ? and *, 1 for +).

As for . and \d used in this example, they are called "character classes". Regular Expressions offers many other character classes:

Character class or token	Meaning
.	Any character
[abc]	Character a, b, or c
[^abc]	Any character except a, b, or c (negation)
[a-z]	Any character between a and z, inclusive
[a-zA-Z]	Any character between a and z, or between A and Z, inclusive
[a-zA-Z0-9]	Any character between a and z, or between A and Z, or between 0 and 9, inclusive
\d	Any digit: synonym of [0-9]
\D	Any non-digit: synonym of [^0-9]
\s	Any whitespace character: synonym of [ \t\n\x0B\f\r]
\S	Any non-whitespace character: synonym of [^\s]
\w	Any word character: synonym of [a-zA-Z_0-9]
\W	Any non-word character: synonym of [^\w]

Example 4 - Using negation to reduce matching

Instead of using .*? like described above, you may also prefer to use negation to reduce matching. In the following example, we use [^<]* to capture Emitter field. This regular expression means "0 or more characters that are not <". Indeed, the Emitter field finished when the < character is encountered:

Log format	LEVEL-EMITTER<THREAD>MESSAGE
Regular expression	(.?)-([^<])<([^>])>(.)
Example of log entry to be parsed	INFO-MyEmitter<MyThread>My log message

Example 5 - Quoting special characters

If your log format contains special characters such as ( ) [ ] | . * + ? ^ \ these characters must be preceded by the \ character, since these characters have a special meaning in Regular Expression syntax:

Log format	[LEVEL] (EMITTER) MESSAGE
Regular expression	\[(.?)\] \((.?)\) (.*)
Example of log entry to be parsed	[INFO] (MyEmitter) My log message

Note: if the log format contains several consecutive special characters, you can surround these special characters with \Q … \E to avoid suffixing each character with \. Example for this log format: \[(.*?)\Q] (\E(.*?)\) (.*)

Example 6 - Date format

When your log format contains a date, you can optionally specify the format used for this date. If you do not, you will not be able to use date/time functions like elapsed time computation, calendar filter, time statistics ….

Log format	dd/MM/yy HH:mm:ss.SSS LEVEL EMITTER MESSAGE
Regular expression	(\S+ \S+) (.?) (.?) (.*)
Date format	dd/MM/yyyy HH:mm:ss.SSS
Example of log entries to be parsed	11/03/14 12:34:56.789 INFO MyEmitter My log message

To include any fixed characters in date format, simply surround them with simple quotes ' ', like in this example:
dd MMMM yyyy, hh 'o''clock' a, zzzz
to match:
11 March 2014, 12 o'clock PM, Pacific Daylight Time
LogMX uses Java date format syntax, you can read http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html if you want to describe more advanced date formats (e.g. timezones, fixed characters in date, …).

Example 7 - Dealing with extra whitespace characters (e.g. padding)

If your log format uses whitespace characters (to visually "align" fields in columns, for example), you can simply tell LogMX that one or more whitespace character is used via \s+ like shown below:

Log format	LEVEL EMITTER MESSAGE
Regular expression	(.?)\s+(.?) (.*)
Example of log entries to be parsed	INFO MyEmitter My log message CRITICALERROR MyEmitter My log message WARNING MyEmitter My log message

Example 8 - Optional fields

Some log entry fields can be optional. To do so, simply use the ? quantifier:

Log format	LEVEL EMITTER [ClientID]MESSAGE or LEVEL EMITTER MESSAGE
Regular expression	(.?) (.?) (\[\d+\])?(.*)
Example of log entries to be parsed	INFO MyEmitter My message INFO MyEmitter [243]My message

In this example, we capture ClientID in a user-defined field, including its surrounding [ ] . To see how to exclude a part of an optional field, see next example (Non capturing groups).

Example 9 - Non capturing groups

The non capturing group syntax is (?:) and is often used when a group of text is optional and should not be captured (e.g. (?: )? ), like in this example:

Log format	LEVEL EMITTER <ClientID> MESSAGE or LEVEL EMITTER MESSAGE
Regular expression	(.?) (.?) (?:<(\d+)> )?(.*)
Example of log entries to be parsed	INFO MyEmitter My message INFO MyEmitter <243> My message

In this example, we capture only ClientID in a user-defined field, without its surrounding < > .

Example 10 - Multiple lines format

If your log format uses several lines of text for each log entry (e.g. XML file, or more generally, two entry fields that must be separated with a string containing a new line character), you can use \n like shown below:

Log format	[ENTRY]LEVEL[/LEVEL] EMITTER[/EMITTER] MESSAGE[/ENTRY]
Regular expression	\[ENTRY\](.?)\[/LEVEL\]\n(.?)\[/EMITTER\]\n(.*)\[/ENTRY\]
Example of log entry to be parsed	[ENTRY]INFO[/LEVEL] MyEmitter[/EMITTER] My message[/ENTRY]