Data Driven Analysis – My Approach

So today my thoughts have been centered around data driven analysis. What is it? Why should it be used? How to implement it? So in this post I wanted to cover these topics as well as a PowerShell script I use to collect and parse data from offline Windows event logs or online. I will also take you through the start of my approach when analyzing this data inside of a Jupyter Notebook. I will load Sysmon logs from three compromised devices and take you on a data driven hunt looking for evil. Let’s begin.

First things first, what is data driven analysis? In this type of analysis, decisions are being made on the interpretation of the data. So in this approach the data is the centerpiece of your analysis. Integrity of your data is of utmost importance. If you can not trust the data you are analyzing you have already lost. We will be taking facts from our data as well as sprinkling domain expertise and critical thinking skills to try and disprove our hypothesis, which we will define before we start our hunt.

Now, why should we use data driven analysis for our hunting expeditions? To me it is just another tool in the toolbelt. Same goes for intelligence driven analysis. These are tools to be used and they both have their place. The biggest reason is that there is no better data to derive intelligence from, than the data in your environment. External intelligence feeds have their place for hunting threats, but the intelligence pulled from data in your environment is supreme. The insights you can pull from your data can tell you all sorts of things such as:

  1. Do I have visibility gaps?
  2. Is there noise that I can eliminate from my logs?
  3. Answering questions posed by hypotheses generated by your hunt team.

How would you go about implementing a data driven approach? This in and of itself could comprise a whole book. For the sake of this blog post our approach will consist of:

  1. You need data, so we have Sysmon installed on endpoints.
  2. We need a way to collect our data, so we will use PowerShell to collect our data and output it to a json file.
  3. We also need a way to analyze our data, which we will be using Python and Jupyter Notebooks to parse and explore our data in order to answer questions posed by our hypothesis.

So now that we have a basic understanding of what data driven analysis is let’s start digging into our data. So in this scenario we have Kape files from three potentially compromised machines.

Our mission will be to parse the Sysmon logs into json and load them up into Jupyter Notebooks and answer our simple hypothesis of: Are attackers running any processes from user space in our environment? Now that we have our logs and our hypothesis let’s use our Powershell to parse out these logs and load up.

The PowerShell script I will be using is based on one created by Endgame that I have tweaked a little but for my purposes. Here is a link to their script: and here is a link to the script I tweaked, 

We first need to dot source our script, which will load up our functions so we can use them. Once you dot source your script, you can cd into Function: and list it out, to see the functions we will need. The one we will be using is Get-LatestLogsFromPath.

We can now parse our logs using this command: Get-LatestLogsFromPath -Path </path/to/sysmon_log> | ConvertTo-Json | Out-File -Encoding ASCII -FilePath sysmon-data.json

Here is a helper script that takes a text document with the computer names and will call our function to parse out our Sysmon files: 

foreach ($comp in (gc comp.txt)) {

    Get-LatestLogsFromPath -Path </path/$comp/sysmon_logs | ConvertTo-Json | Out-File -Encoding ASCII -FilePath $comp-sysmon-data.json


This assumes you have sysmon logs in different directories like server1, server2 and etc…

Now I have some Sysmon files named server1-sysmon-data.json, server2-sysmon-data.json and server3-sysmon-data.json. We can now load our files into Jupyter Notebooks and start conducting our analysis. If you need to install Jupyter Notebooks I would refer you to the anaconda website, which can be found here: 

To load our data we first need to import our libraries and set some options so we can see our data and load our json into a dataframe.

Now we can do a to see what all we have. This will tell us how many rows and columns we have as well as the names of the columns. 

From here I always like to start by taking a look at the processes and get a feel for what is being run on the system. Looking for things that look weird to me. To do this we will carve out the Sysmon event ID 1 – Process Create and create a new dataframe named sysmon_proc. The dropna(axis=1) will help clean up some null data.

From here I like to use a technique called stacking to count the number of times a certain value is being run looking for outliers. This technique can be especially powerful at scale. If there is a process that is only running on a handful of systems out of potentially thousands this should be looked into. I like to use this technique for parent and child processes.

So in my parent processes after doing a little analysis I see a pair of executables that have been run from user space, one from the Downloads folder and the other AppData\local\Temp, which gets my attention. 

These two processes will be my first actual pivots into the data. I am going to want to know their hashes, do they have any child processes, what LogonGuid is associated with these processes, which computers were they run on. Always Be Pivoting!!! So now let’s drill down into ResizeFormToFit.exe and see where it takes us. For this I am going to use a regex on my ParentCommandLine column looking for the word resize.

Right away this looks bad. It appears to be doing a network check for connectivity to Google before it runs. From here I want to know what VirusTotal knows about this hash. 

Well from the looks of it we have a problem. It appears we have Bazar running in our environment. I will also take a look at this executable’s LogonGuid to see what else it is associated with and start looking for network connections, lateral movement and persistence techniques.

Based on what has been found it is definitely time to roll directly into incident response. I would continue to pivot through the data to create indicators for incident response to start sweeping my environment. 

This is only the very beginning of the approach I take when I want to conduct data driven analysis. I hope I have given you some ideas you can use and that it was somewhat helpful. 

Until next time…

Happy Hunting,


3 thoughts on “Data Driven Analysis – My Approach

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s