Osquery and Splunk - a logging pipeline

Note: This is a fairly lengthy post, so if you’re just looking for the punchline, I suggest you start reading about 50% of the way down the page. Anyway…

Over the weekend (actually two weeks ago now, since I started writing this) a question was posted in the Osquery slack channel which touched on a lot of the issues I see from people first coming to Osquery or trying to decide how they will approach getting visibility into their servers or workstations.

The question, from user @Matt_Brown, was this:

I am currently rolling out log consolidation with SIEM capabilities (splunk)
and it's truly unclear to me what the difference between Osquery and 
configuring splunk clients with scheduled powershell code?
or pushing DSC, and tailing the results of the LCM reported drift

What struck me about these questions was that I’ve heard variations of them many times before, but they still remain extremely relevant and also encompase a huge body of knowledge that can make finding the answers quite daunting.

Matt also had one further criteria, which really resonated with me:

I want to be all leanish, and not double up on my products

This is a really important point and something that security and IT teams alike often overlook. Users end up with 10 different agents on their endpoints sucking up tons of valuable system resoures, or 30% of a server’s capacity ends up being dedicated just to monitoring it.

This one single question is actually a very complex problem. The abstracted question reads something like this:

Which seems pretty simple. However, if broken into the pieces required to accomplish all this, it becomes a much more complex issue. Frequently, the functionality of several components are bundled together in a single installed application or endpoint agent, which leads to a lot of confusion about what systems are responsible for which actions.

It is helpful to break these pieces out so we can examine them a bit, so I’ve written them out as to discuss them a bit more clearly. It is also imporant to note that many components have a large selection of applications which can perform the desired functions and come with their own pros and cons.

Knowing each component enables you replace or choose only those which you need rather than bloating a system with 10 different overlapping applications

Components

Data/Telemetry producer/Generator

This is the primary function of Osquery, to generate information about the endpoint which it is installed on. There are may other products that do this as well, or code which can do the same job. This component is simply concerned with creating data from an endpoint which can be analyzed.

Data Analysis/Aggregation

This component is used to collect and analyze the data generated by the data produces. This can be as simple as a collection of files or as complicated as an ML pipeline doing automated analysis of events.

Configuration management

An oft overlooked or un-thought of component of an endpoint system, configuration management is an important piece of the puzzle for managing what happens on endpoints. Configuration management is often found bundled together with a different component, causing many users confusion as to what

Log/Data/Event forwarders

This is the system component that is responsible for getting the data generated on endpoints from the endpoint to the data analysis component. This is component that is often combined with or baked into another component, but it is imporant to realize that his is a component itself and can be done in many different ways.

Osquery Management Server

An Osquery management servers is a component specifically for those utilizing Osquery as their data producer. An Osquery management server IS NOT STRICTLY NECESSARY FOR USING OSQUERY. However, an Osquery management server can enable an Osquery deployment to be much more flexibile and also allows the operator a mechanism to directly interact with the Osquery agents deployed in their fleet.


Now that we have the components outlines and a solid understand of what they do, let’s take a look at our specific example question and where things fall from a component perspective.

Osquery

  • Data producer
  • can also be used to ship logs (log forwarder)

Splunk

  • Data Analysis/Aggregation

Splunk Universal Fowarder

  • Log forwarder (tails logs)
  • Configuration Management (ships powershell used to configure system)

Powershell (DSC)

  • Configuration management
  • Data producer (log differentials from state drift, (LCM))

Kolide Fleet

  • Osquery Management Server
  • Log Forwarder (depending on configuration)
Component Data producer Data Analysis Config Management Log Forwarder Osquery Management Server
Osquery x x
Splunk x
Splunk Forwarder x x
Powershell x x
Kolide Fleet x x

Discussion

Examining our components and the table above, we can see that there’s actually a fair amount of overlap between applications and their component roles. The reason this is important is because this is EXTREMELY common when dealing with logging pipelines and data collection for endpoints. Most off-the-shelf solutions you can buy will try to package many or all of these components into a single binary, or platform, to alleviate the load on their end users.

When buiding our own pipeline, however, it is important to realize that we are not stuck with components that have been bundled together for convenience. In fact, bundled components can often become a hindrance as we have less control over each component and are forced to make tradeoffs between what fits our environment and how the bundled components function.

This makes our goal of building and end-to-system fairly clear, if not entirely simple. Idealy, we will pick the minimum components that provide the functionality we need, without compromising on the capabilities of each component.

Let’s walk through our components and discuss a bit of what each one provides, how well it performs its intended function and how they all fit together.

  1. Data producer: Powershell vs Osquery

    Powershell: In this scenario, I will have to admit an enormous bias towards Osquery. Powershell has come a very, VERY long way and is now a very competent tool for a lot of things in Windows environments. However, in this case we would be leaning heavily on the logs produced by the LCM to gather data about current state and state drift. There are a few issues with this. In order for state drift to be tracked, the state must first be described and managed via DSC. This means that any state we have not configured will not be logged, or even noticed. It also means that anything we wish to monitor requires that we describe the state we wish it to be in. This is not itself a bad thing, but become rather a matter of time and effort. There are many, many system components that could be in any number of states which are considered fine or that we do not care to set, but simply would like to monitor.

    Osquery: Osquery was made specifically for this purpose. To monitor and watch the state of a system and report back on it. This means that a MASSIVE amount of work has already been dedicated to creating the monitoring for various components of an operating system. Most, if not all of these things could be recreated using powershell, but they would require you to write all of this from scratch. An extremely daunting task and one which I would imagine would take several engineering years worth of work to bring in-line with Osquery.

    Winner: Osquery

  2. Data Analysis: Splunk

    There’s actually not much to say here that would not require an entire blog (or book). For our purposes, splunk is an excellent choice for data aggregation and analysis. At the point where this debate is meaningful and can be truly discussed, you should have an expert or two on staff who can speak to the nuances of splunk vs other logging and aggregation systems.

    Winner Splunk

  3. Configuration Management: Powershell vs Splunk UEF

    Splunk Universal Event Forwarder: This is an interesting use case. I have used the splunk UEF extensively, but had never heard of this functionality. After some digging, I found that this is a relatively new featured, introduced in splunk 6.3 Read about it here. It seems that this is an attempt to let the splunk UEF become a sort of config management/script runner process as well as a log forwarding mechanism. After reading up on this and testing it out a bit, this seems to me to be the worst of both worlds. This capability provides a very limited method for adding some scripts to your endpoint via the splunk UEF and using the UEF to schedule the run of those scripts. I can absolutely understand why this might sound appealing and I imagine Splunk has tricked quite a few people into thinking this is a fully-baked solution and they can just use the UEF for all their needs. This is, however, a prime example of packing too many features into a single application. You get none of the benefits of a real configuration management system (centralized management, desired state, version control, etc).

    Powershell DSC: Powershell Desired State configuration, on the otherhand, is precisely what we’re looking for in a configuration management system. It can be centrally managed, has a robust ecosystem built around it and is well on its way to becoming the defacto infrastructure-as-code platform for windows. It is well supported, can be version controlled, can support multiple environments for different states and opperates with a declarative methodology. If you’re working in an entirely windows environment, this probably the best option out there. If you’re in a mixed environment, it may be worth considering chef, puppet, saltstack, etc.

    Winner Powershell DSC

  4. Log Forwarders: Osquery Vs. Splunk UEF Vs. Kolide Fleet

    Osquery: Osquery comes ready built with a wide variety of log forwarders and this is one of the few places I would actually recommend combining functionality. Log shipping is a relatively mundane task, so the real answer here is use whatever works best for your environment. I recommend using Osquery’s built in log shipping if you can, as it eliminates the need for yet another agent on your endpoint.

    Splunk UEF: The Splunk UEF is quite competent at its job and generally tends to “just work”. One thing to keep in mind is that the Splunk UEF reads data from disk, meaning that it will add to disk IO when Osquery logs to the disk as well as when the UEF reads the logs to ship them off. Not a huge deal, but something to consider.

    Kolide Fleet (or other management server): The Kolide Fleet logger is actually just Osquery’s built-in TLS logger plugin, meaning the Osquery agent sents all logs over TLS to the management server, which is in turn responsible for either analyzing the events or forwarding them on to their final destination. This has the added benifit of giving you (possibly) one less system to manage. However, depending on your data volume, or where your data’s final destination is, this may actually add complexity instead of removing it.

    Winner Osquery (by a very slim margin).

    Note: It is also worth mentioning that Osquery’s extension ecosystem is idealy suited for building custom loggers, allowing Osquery to ship directly to ANY system. See This excellent blog post by @Groob on how you can do this with the golang Osquery bindings

  5. Management Server: Kolide

    Kolide Kolide wins by default here, being the only management server in the running. Kolide is an excellent product and a great place to start (and possibly finish) for management servers. There are a plethora of other management servers out there as well (Disclosure, I have authored one of them) but for the purposes of this Blog, Kolide is a great choice and nothing more need be said.

    Winner Kolide

Bringing them together

Going strickly by our winners, we would have a pipeline that looked like this:

Powershell DSC(configuration) – > Osquery(data producer) –> Osquery(Log forwarder) –> Splunk(Analysis)

However, because we are working with splunk and aren’t going to write a splunk plugin for Osquery (I may write another blog post for this if theres interest, the code is ~65% done), the splunk UEF makes more sense for our given architecture. It’s easy to use and configure and will ship our logs off of our hosts very reliably.

Our final pipeline looks like this:

Powershell DSC(configuration) –> Osquery(Data producer) –> Splunk UEF(Log forwarder) –> Splunk(Analysis)

If the capabilities of a management server are desirable, our scenario will work several ways. However, the most straight forward is simple to add in Kolide to the final pipeline, leaving an already solid architecture intact and simply adding capabilities and would be my recommended approach. It would also be possible to have Osquery use its TLS logger plugin to send logs directly to Kolide and then have Kolide forward them on to Splunk.

Conclusion

In this particular scenario, there ends up being a fairly straight forward path to creating a solid and easily managed log pipeline for Osquery. However, depending on your environment, the desired capabilities and and existing requirements, this can be far more complicated and require very in depth knowledge of a lot of different technologies. I Highly recommend writing down a list of requirements for your pipeline, the possible technologies you want to use and then creating a table similar to the one I used above to break down their capabitlies. The end goal should be to create a pipeline that meets all your essential requirements with the absolute minimum of capabilty overlap.

If you find your self with more than 1 or two components overlapping, it may be a sign that you need to re-architect your plan into somthing more simple.

Hopefully this helps someone reason through their goals a bit more, or provides a good starting point. Please feel free to reach out to me either on the osquery slack or on twitter @PansyMcCoward

Cheers, and Happy Querying!