Filtering Logs With Miller
I occasionally monitor my /var/log/haproxy.log
to see what the traffic looks like on this blog. I don’t have any analytics in place and don’t really think I want or need one. From what I’ve gathered in the logs, most of the traffic comprises of attempts to find flaws in my server. The rest is from google/search engine bots. I’m pretty sure I’m immune to most of the attacks because this is just a static site. However, it never hurts to be sure. I started by looking at the logs with vim and sometimes using grep to filter certain IPs for more details. However, that doesn’t always work, and I miss the querying capabilities that some of the advanced logging stacks (mostly ELK or even Loki) provide. My next attempt was around writing some scripts to parse the log files and output them to a SQLite database to make queries simpler, but then I found out about miller, which I’m definitely adding to my toolbelt.
The tool is very well documented, so I won’t go into more detail there, but here are just some commands I’ve been running to get more insight into my traffic.
To start, this is what my haproxy log lines look like:
Jun 3 23:41:01 sharmaso haproxy[59377]: a.b.c.d:59550 [03/Jun/2023:23:41:01.332] mynet~ mynet/<NOSRV> 0/-1/-1/-1/0 403 197 - - PR-- 1/1/0/0/0 0/0 {www.sharmaso.com} "GET / HTTP/1.1"
That is - its formatted as $date $username haproxy[$pid]: $client-ip:$client-port [$haproxy_date_time] $frontend $backend/node $haproxy_stats $status_code $bytes ...(unused)... {$hostname} "$http_verb /$http_path $http_version"
Miller works with a wide variety of formats, but for my purposes, I’m using the NIDX format, which the man page documents as:
NIDX: implicitly numerically indexed (Unix-toolkit style)
+---------------------+
| the quick brown | Record 1: "1" => "the", "2" => "quick", "3" => "brown"
| fox jumped | Record 2: "1" => "fox", "2" => "jumped"
+---------------------+
This will allow me to start processing the columns in the log lines with numeric indices.
Using the prepipe
option
Sometimes you might need to clean the input, and that’s where the prepipe option helps. Here, I’m removing the initial date field because I was having issues dealing with the ‘day’ field. Miller would introduce a new column when the date was a single digit (Jun 3
would produce 3 fields and Jun 10
would produce 2 because of the extra space in front of the single digit). There might be a better solution here, but this was the most straightforward.
mlr --prepipe 'cut -d ':' -f2-'\
--from /var/log/haproxy.log\
--n2c --opprint filter '$18 =~ "posts"'
: client-ip:54515 [05/Jun/2023:11:46:42.900] mynet~ blog_node/node 1/0/0/3/4 200 5683 - - ---- 1/1/0/0/0 0/0 {www.sharmaso.com} "GET /posts/security-tips/ HTTP/1.1"
: client-ip:24798 [05/Jun/2023:23:02:03.404] mynet~ blog_node/node 0/0/0/1/1 200 4123 - - ---- 3/3/0/0/0 0/0 {www.sharmaso.com} "GET /posts/firstpost/ HTTP/1.1"
With the command above, we are reading the file from /var/log/haproxy.log
, using the --n2c
option to convert NIDX to CSV, and outputting pretty-printed results. It also filters results so that we only look for lines where the 18th column matches posts
. That column represents the HTTP path, and we are basically looking for lines involving requests to my blog posts.
Using cut and put
Simply using the NIDX format can be a bit frustrating because you need to keep track of the column numbers. You can use the cut
and put
verbs to give names to the columns instead.
For instance,
mlr --prepipe 'cut -d ':' -f4-' --from /var/log/haproxy.log \
--n2c --opprint cat \
then put '$client_ip_port=$2;$dt=$3;$backend=$4;$hostname=$14;$httpverb=$15;$httppath=$16;' \
then put '$client=joinv(mapselect(splitnvx($client_ip_port, ":"),1),"")' \
then cut -f client,backend,hostname,httpverb,httppath \
then filter '$httppath=~"posts"' \
then tail \
then put '$client="<some-ip>"'
The joinv(mapselect(splitnvx($client_ip_port, ":"),1),"")
extracts the client from the client:port
pair. Could have used a regex replacement there but decided to use this approach instead. Run mlr -f
to list all the string functions provided by the tool.
That will output the relevant lines with the new column names, which can be further filtered if necessary.
backend hostname httpverb httppath client
mynet~ {www.sharmaso.com} "GET /posts/filter-logs-with-miller/ <some-ip>
mynet~ {www.sharmaso.com} "GET /posts/filter-logs-with-miller/ <some-ip>
mynet~ {www.sharmaso.com} "GET /posts/short-posts/ <some-ip>
mynet~ {www.sharmaso.com} "GET /posts/short-posts/ <some-ip>
mynet~ {www.sharmaso.com} "GET /posts/filter-logs-with-miller/ <some-ip>
mynet~ {www.sharmaso.com} "GET /posts/filter-logs-with-miller/ <some-ip>
mynet~ {www.sharmaso.com} "GET /posts/filter-logs-with-miller/ <some-ip>
mynet~ {www.sharmaso.com} "GET /posts/filter-logs-with-miller/ <some-ip>
mynet~ {www.sharmaso.com} "GET /posts/filter-logs-with-miller/ <some-ip>
mynet~ {www.sharmaso.com} "GET /posts/filter-logs-with-miller/ <some-ip>
Note that I have the last put
command to protect the client ip and is not necessary.
Getting some statistics
A common sql query I run is the select client,http_path,count(*) from table group by client,http_path
to get the count of fields that I’m interested in.
The equivalent for the sql would be to add a
then uniq -g client, httppath -c
which would produce
client httppath count
<some-ip> /posts/security-tips/ 2
<some-ip> /posts/security-tips/ 2
<some-ip> /posts/security-tips/ 2
<some-ip> /posts/security-tips/ 20
<some-ip> /posts/short-posts/ 6
<some-ip> /posts/firstpost/ 8
<some-ip> /posts/firstpost/ 2
<some-ip> /posts/security-tips/ 2
<some-ip> /posts/short-posts/ 2
<some-ip> /posts/security-tips/ 2
<some-ip> /posts/security-tips/ 2
<some-ip> /posts/firstpost/ 2
<some-ip> /posts/firstpost/ 2
<some-ip> /posts/security-tips/ 2
<some-ip> /posts/short-posts/ 2
<some-ip> /posts/filter-logs-with-miller/ 36
<some-ip> /posts/short-posts/ 2
See all the different verbs provided by miller here