For the past week we’ve been seeing a massive increase of fluentbit log parsing errors in Log Explorer. They are being reported on node level in the order of several thousand entries per hour.
This is on Kubernetes 1.25.8-gke.500 with e2-standard-4 nodes and ConfigMap version fluentbit-gke-config-v1.2.0 (GKE default Fluent Bit v1.8.12 installation)
You can configure the Fluent Bit to filter certain information to be logged.
Open the kubernetes/fluentbit-configmap.yaml file in an editor.> 1. Uncomment the lines after ### sample log scrubbing filters and before ### end sample log scrubbing filters.> 1. Change the name of the ConfigMap from fluent-bit-config to fluent-bit-config-filtered by editing the metadata.name field.> 1. Save and close the file.
To make the update, change the daemonset to use a different ConfigMap that contains your chosen filters. Use Kubernetes rolling updates feature and preserve the old version of the ConfigMap.
thanks for the answer. Will this be reverted to the google default with the next update?
To clarify, we have not made any changes ourselves and do want to remain with the upstream version. We are simply confused as to why this problem suddenly appeared with the latest update.
Currently, this problem is logging about 16000 entries more than common cluster logs, by hour.
Just like Carsten, we have not made any changes ourselves. Could this be resolved in the next update?
We have one more cluster with v1.24.11-gke.1000 and fluent-bit ConfigMap version 1.1.5. We have compared glog parser which is failing and both versions have the same Time_Format value. It means (or we can conclude at leats) what has changed is the component which is logging the date (the datetime format).
Even though we were able to create a copy of the fluent-bit’s configmap, we couldn’t update the daemonset. The daemonset and original configmap have a label which delegates the control on them to K8s Addon Manager, this manager can “block” any updating attempt, we updated the spec many times just to find the daemonset had the same configmap configured again.
What we were trying to change on the configmap was glog’s Time_format from:
Time_Format %m%d %H:%M:%S.%L%z → Time_Format %m%d %H:%M:%S.%f
This should have been enough to fix the problem.
We were reading a bit about Addon Manager and apparently, changing its configuration or behavior requires change node images. Another alternative is (to the best of our understanding), uninstall fluent-bit and install a whole custom configuration.
At this point, we are wondering if recreating our cluster with a previous gke version could be the best option.
We’ve fixed this issue internally and will start rolling it out over the next couple of weeks.
If you have not opened a support case, I’d suggest doing so in order to better track the release for the fix.
We started seeing this issue on Friday (9 June) without having done any changes to the cluster ourselves. Initially (on Friday) on version 1.25.8, we have tried upgrading to 1.26.4 today to see if that would fix the problem, but the problem remains. We are getting 100.000+ warnings and errors pr. hour because of this.
The problem can be easily reproduced - simply create a new cluster in console. Select Standard mode, static version 1.26.4-gke.1400. Leave everything else at default. After the cluster has been created, the logs will start filling up with the “invalid time format” and “cannot parse” errors.
It might take quite some time before we are able to roll out a fix for this. In the meantime, the recommended action to take is to create an exclusion filter on the “_Default” Log Router Sink.
You can find the log routers here. In the list, you’ll see a bucket/sink name “_Default”.
Click the three dots to the right and select “Edit Sink”
Scroll down to “Choose logs to filter out of sink (optional)”
Click “+ ADD EXCLUSION”
Name it something like “FluentBit Parser Errors”
In the filter box, enter the following query:
LOG_ID("fluentbit")
(jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog"))
Is there any update on the fix for this being rolled out? I opened a support request but they gave me the same recommendation for creating an exclusion rule.
Sorry for the maybe stupid question but, with a bug report open and clearly an issue on a system level and not just due to user’s error, shouldn’t we get a refund for the amount billed due to this error?
We have had our monthly bill increase of roughly 33% and we were lucky because we only started having this error spam from the 26th of June and we noticed right now.
I don’t want to think if we had this error starting towards the start of the month.
Can someone clarify to me, if that’s not the case, why?
I really hope so, it’s just not fair charging users for something that was not our fault. In my case I used autopilot, the default option that was supposed to configure everything right OOB