Running a Kubernetes-based infrastructure is challenging and complex. Administrators often lament how complicated performance optimization and monitoring are, which can lead to problems in production. Additionally, even finely-tuned Kubernetes deployments can encounter sporadic issues.
When Kubernetes starts behaving in strange ways, digging into logs can help you uncover breadcrumbs. These contextual hints can help lead you to possible solutions.
In this guide, you’ll learn the basics of Kubernetes audit logging, as well as advice for how to set it up and choose an appropriate backend. You’ll also learn about best practices for getting the most value from the processes.
Logs are naturally central to audit logging; throughout Kubernetes’ runtime, the system is set up to track performance metrics, metadata, and any administrative activity that’s deemed significant. From the point of capture, these logs are “streamed” to their storage destinations, where they’re then available for retrospective analysis. Each is time-stamped for added context.
During the audit logging process, those with access to their historical log records begin to look for any information that stands out or anything outwardly signaling questionable activity. These might include unexpected logins or system crashes. Effective auditing ultimately allows you to answer the following questions:
Logs exist locally on the filesystem, or they can be transported via API to an external storage location. However, auditing isn’t always easy because Kubernetes doesn’t log everything in its base configuration.
It also isn’t advisable because this mass logging generates plenty of noise that further obscures useful data points. You’ll then have to sift through a larger haystack before locating those informative “needles” of information, which are consequently easier to overlook.
The burden to capture data pertinent to the appropriate audit level (which directly impacts the quantity of logged data) falls on administrators. Audit logging is particularly resource intensive on the server because each request stores context around each event. When systemic changes are made within Kubernetes, one can often observe associated changes in behavior across their clusters.
That said, events can vary widely within Kubernetes, and it’s important to troubleshoot auditing to maintain strong security. While it’s understood that complex systems can go haywire of their own accord, bad actors or haphazard session activity are sometimes to blame. Accordingly, crashes, loops and memory leaks can indicate the existence of potential vulnerabilities.
Audit logging is also often crucial for compliance purposes. In sensitive industries that leverage microservices, closely detailing infrastructure activity helps organizations satisfy regulatory requirements such as HIPAA, PCI DSS or SOX. Regular security evaluations and disaster recovery are hallmarks of these regulations.
You know you need to audit regularly, but how do you do it? Crafting a clear-cut set of policies around auditing will ensure that Kubernetes captures the proper information to begin with.
Audit policies follow a specific structure. This is defined within the
audit.k8s.io API grouping — containing customizable rules dictating logging behavior. Kubernetes automatically processes all events under this umbrella, and each action is compared against the API group’s policy list to determine the auditing level.
Policy rules are required under Kubernetes’ auditing guidelines, and admins should design their rules based on their unique deployments. A few things can determine this; for instance, your available server memory can place a resource restriction on log gathering over the course of the Kubernetes cluster’s existence. Logs can be dumped after a certain time interval, but accumulation can still create excess burden depending on the audit pacing. Accordingly, collecting logs of the deepest level may be unnecessary for your infrastructure and its purpose.
Here’s how these rules break down:
None: it’s the default, where events matching this rule aren’t logged.
Metadata: metadata related to users, timestamps, resources, verbs, and more is logged; however, requests and response bodies are not.
Request: event metadata and request bodies are logged, but response bodies are excluded. This is non-applicable for non-resource requests.
RequestResponse, all previous components are logged except during non-resource requests.
Stages are another way to granularly control Kubernetes auditing strategy. With
stages each request can be recorded depending on what stage the request-response process is in.
The defined stages are:
Data is captured while Kubernetes makes various internal API requests during runtime. For example, you can create a rule where Kubernetes logs an event once a specific Kubernetes API response is completed. This is a simplistic example, but highlights how levels (
Metadata, for example) interact with “stages” rules (
Audit policy rules live within a YAML configuration file. This set of policies exists in Kubernetes as a
PolicyList, but what if you want to target multiple resources with your rules? Defining
GroupResources policies using strings for
resourceNames lets you design unique specifications atop K8s’ stock logging configuration.
Note that escalating your logging levels means that logs will take longer to propagate. This might be something to consider, given that collection can take anywhere from a few seconds to several minutes. That active collection has an ongoing resource cost until its completion. That’s one consequence of gathering extra data. Consider your server setup and application demands in parallel while designing these policies.
Additionally, the stage your business is in may determine your “affinity” for logging. For instance, a startup may be less audit-focused than a company with mature products. Yet, having a strong audit log can be useful for troubleshooting regardless of your company size. If you’re running a Homelab you might be interested in what actions you did last weekend, whereas a FinTech startup will find logging critical to stay in SOC2 compliance.
It’s important to consider where your logs are going in order to keep highly sensitive data locked down, and it’s even more advantageous when security concerns are elevated. If you have the necessary disk space locally, writing your Kubernetes audit logs to the local filesystem is possible, which is one of Kubernetes’ default capabilities.
However, this approach might be problematic given K8s’ distributed nature; spreading logging data across stoppable nodes puts that data in a vulnerable position. That information might not be retrievable when disks become unavailable.
Alternatively, the webhook backend leverages an external HTTP API to receive events. When using this method, it is important to configure your networking tools effectively to help you avoid exposing the internal backend — keeping your logs and their repositories relatively safe.
This has its advantages. Webhooks ultimately make remote log access possible since these files end up residing in an external, cloud-based repository. These outside services can also afford you more storage space when you’re resource-strapped.
Remember to properly configure your backend, just like you would your auditing rules. This is possible via flags, which interact directly with the
kube-apiserver. You can choose a logging pathway, determine how long logs are saved for, and define the maximum number of retained files or maximum megabyte sizes of those files — as desired. This helps prevent logs from becoming too burdensome.
Before you determine where your log should end up, you need to find out if your control plane runs the
kube-apiserver as a pod. If so, you must mount the
hostPath to the location of your policy files and log files; otherwise, your records may not be persistent.
Within your YAML configuration file, mount the appropriate volume designated for auditing, including its name, path and read status. Once you’ve completed this step, simply finish configuring your
hostPath, and you’re good to go.
The process for webhooks is slightly different since the logging mechanism has now changed. While
kube-apiserver flags are still used, you’ll be centering on two unique flags called
–audit-webhook-initial-backoff. Respectively, these will influence your webhook file paths and retry timeouts after failed requests. Retry timeouts are especially important to consider because the webhook communicates with the cluster via HTTP connections.
Consider batching events together for processing to improve logging performance. Webhook requests are automatically throttled, and slashing the number of attempted API calls can be beneficial.
Finally, think about how many requests you make per second and how your
Response stages might impact server activity during log capture. While the default parameters are typically sufficient, you might have to perform parameter tuning to manage API server loads. The last thing you’d want is to create a bottleneck or induce a server crash, which carries much larger consequences.
While Kubernetes’ baked-in logging capabilities are relatively powerful, your team might desire more flexibility. The standard method for logging in Kubernetes includes writing to standard output and standard error streams.
However, cluster, node and pod behaviors can influence the viability of full-fledged, native logging within your ecosystem. Container crashes, pod evictions and node death can undermine logging or log access. You can use cluster-level logging to circumvent this and sidestep any complex dependencies. Unfortunately, that decision requires the establishment of another backend, and it means setting up another storage solution; piggybacking off of Kubernetes’ native integrated storage mechanism can be hit or miss.
Open source tools can provide this functionality comprehensively. The third-party vendor can maintain their own backend infrastructure while they support the streaming of logs. Finding a tool with included logs storage is useful for keeping everything under one roof. If possible, you should try to find one that offers integrations with popular storage services — preferably avoiding lock-in.
You can opt for a streaming sidecar container to capture vital audit logs; however, this can consume significant resources, and its isolation from the
kubelet prevents access via the
kubectl logs command. You can avoid this problem by making a streaming sidecar container write to its own standard error and standard output streams. You should set up this sidecar so it can read logs from multiple sources and print any retrieved logs.
Creating multiple logging streams concurrently can be beneficial and, surprisingly, incur little overhead, which is why many external logging agents (like Datadog’s) operate in this way. However, beware that audit logs stemming from multiple sources to multiple locations can become very difficult to manage at scale especially when tracking down individual data points. Fragmentation can undermine logging efforts and is inefficient; it’s best to strive for consolidation wherever possible.
Logs can contain sensitive information tied to infrastructure security. Keeping them under lock and key is important — especially since they can reveal key infrastructure configuration details. Kubernetes administrators should enjoy access to these logs, though not every employee whose role touches IT should necessarily have similar privileges.
Dictating access based on job role or to admin-controlled service accounts can prevent problematic data leaks and stop unauthorized users from deleting or copying log files. Restricting access makes it easier to track internal activity based on those logs as well. By handing out keys to limited numbers of individuals, the potential for abuse is lowered significantly.
When audit logs are created, the files resulting from that process automatically adhere to a predetermined structure. Keeping this structure consistent inherently makes logs more readable for computers and humans alike.
Each entry should follow a standardized layout and should include the same breadth of information wherever possible. It’s useful to standardize numerical details, like timestamps, and even make an effort to include unique IDs to every entry. Unique IDs are especially helpful because you can tie any log file IDs to requests and tickets, which makes remediation and auditing that much easier.
Finally, it’s common to run logging agents on every active node in your cluster. This ubiquitous deployment of logging means the agent can run as a
DaemonSet since it ensures all nodes run copies of a given pod.
Unfortunately, these collection daemons do have a resource cost since they’re actively completing their own tasks in unison with core infrastructure tasks.
It’s important to cap processing and memory consumption. This prevents performance issues and server stress. Running these daemons in their own containers provides isolation from app-related container daemons.
As you can see, there are numerous considerations at play when it comes to establishing audit logging. From security concerns to efficiency goals, configurations are critical to crafting smooth and reliable logging. While Kubernetes gives users plenty of recourse, it’s also possible to follow best practices while using external tools.
Access plays a central role in this equation. The beauty of Kubernetes (and microservices) is that infrastructure management can occur anywhere. Accordingly, monitoring activity and Kubernetes behaviors at all times is paramount. We’ve covered a lot of machine-based log and events, but it’s just as important to capture humans interacting with the Kubernetes API and kubectl. Remoteler is an open source tool that admins use to provide short-lived kubeconfigs and offer complete auditing into the users use of kubectl, and even have
kubectl exec recordings.