Structuring Unstructured Data with GROK

Elastic (ELK) Stack Tips and Tricks for Transforming Log Data

If you’re using the Elastic (ELK) Stack and are interested in mapping custom Logstash logs to Elasticsearch, then this post is for you.

The ELK Stack is an acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Together, they form a log management platform.

Elasticsearch is a search and analytics engine.
Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch.
Kibana lets users visualize data with charts and graphs in Elasticsearch.

Beats came later on and is a lightweight data shipper. The introduction of Beats transformed ELK Stack to Elastic Stack, but that is besides the point.

This article focuses on Grok, which is a feature within Logstash that can transform your logs before they are forwarded to a stash. For our purposes, I will only talk about processing data from Logstash to Elasticsearch.

Grok

Grok is filter within Logstash that is used to parse unstructured data into something structured and queryable. It sits on top of Regular Expression (regex) and uses text patterns to match lines in log files.

As we will see in the following sections, using Grok makes a big difference when it comes to effective log management.

Without Grok your Log Data is Unstructured

A single log line in Kibana.

Without Grok, when logs get sent from Logstash to Elasticsearch and rendered in Kibana, it only appears in the message value.

Querying for meaningful information is difficult in this situation because all of the log data is stored in one key. It would be better if the log messages were organized better.

Log Data

Unstructured

localhost GET /v2/applink/5c2f4bb3e9fda1234edc64d 400 46ms 5bc6e716b5d6cb35fc9687c0

If you take a closer look at the raw data, you can see that it’s actually made up of different parts, each separated by a space delimiter.

For more experienced developers, you can probably guess what each of the parts mean and that it’s a log message from an API call. The representation of each item is outlined below.

Structured

localhost == environment
GET == method
/v2/applink/5c2f4bb3e9fda1234edc64d == url
400 == response_status
46ms == response_time
5bc6e716b5d6cb35fc9687c0 == user_id

As we can see in the structured data, there is an order to unstructured logs. The next step then is to programmatically refine the raw data. This is where Grok shines.

Grok Patterns

Built In

Logstash comes with over a 100 built in patterns for structuring unstructured data. You should definitely take advantage of this when possible for common system logs like apache, linux, haproxy, aws, and so forth.

However, what happens when you have custom logs like the example above? You have to build your own custom Grok pattern.

Custom

It takes trial and error to build your own custom Grok pattern. For me, I used the Grok Debugger and Grok Patterns to figure it out.

Please note that the syntax for Grok patterns is: %{SYNTAX:SEMANTIC}

The first thing I tried doing was going to the Discover tab in Grok Debugger. I thought that it would be great if this tool can auto generate the Grok pattern, but it wasn’t too helpful as it only found two matches.

Grok Debugger ‘Discover’ only matched 2 words

Using this discovery, I began building my own pattern on Grok Debugger using the syntax found on Elastic’s github page.

https://github.com/elastic/logstash/blob/v1.4.2/patterns/grok-patterns

After playing around with different syntaxes, I was finally able to structure the log data in the way I wanted to.

Structuring Unstructured Log Data with Grok Debugger

https://grokdebug.herokuapp.com/

localhost GET /v2/applink/5c2f4bb3e9fda1234edc64d 400 46ms 5bc6e716b5d6cb35fc9687c0

%{WORD:environment} %{WORD:method} %{URIPATH:url} %{NUMBER:response_status} %{WORD:response_time} %{USERNAME:user_id}

{"environment": [["localhost"]],"method": [["GET"]],"url": [["/v2/applink/5c2f4bb3e9fda1234edc64d"]],"response_status": [["400"]],"BASE10NUM": [["400"]],"response_time": [["46ms"]],"user_id": [["5bc6e716b5d6cb35fc9687c0"]]}

With the Grok pattern in hand and the data mapped, the final step is to add it to Logstash.

Update Logstash.conf

On the server that you installed the ELK stack on, navigate to Logstash config.

sudo vi /etc/logstash/conf.d/logstash.conf

Paste in the changes.

input {file {path => "/your_logs/*.log"}}filter{grok {match => { "message" => "%{WORD:environment} %{WORD:method} %{URIPATH:url} %{NUMBER:response_status} %{WORD:response_time} %{USERNAME:user_id}"}}}output {elasticsearch {hosts => [ "localhost:9200" ]}}

After you save the changes, restart Logstash and check its status to make sure that it’s still working.

sudo service logstash restartsudo service logstash status

Lastly, to make sure that the changes take affect, be sure to refresh the Elasticsearch index for Logstash in Kibana!

Refresh the Elasticsearch index for Logstash in Kibana

With Grok your Log Data is Structured!

Grok automatically structures unstructured logs

As we can see in the image above, Grok is able to automatically map log data to Elasticsearch. This makes it easier to manage your logs and to quickly query for information. Instead of digging through log files to debug, you can simply filter by what you’re looking for like environment or url.

Try giving Grok expressions a shot! If you have another way of doing this or you have any problems with examples above, just drop a comment below to let me know.

Thanks for reading — and please follow me here on Medium for more interesting software engineering articles!

Resources

https://www.elastic.co/blog/do-you-grok-grok

https://github.com/elastic/logstash/blob/v1.4.2/patterns/grok-patterns

https://grokdebug.herokuapp.com/