Edit 29-06-2023: Based upon your feedback on the project I created a ‘novpn’ version, this version does not require most of the prerequisites and is therefore even easier to deploy. Instructions, code, and topology can be found here: WesSec/VelociDeploy-o-Matic at novpn (github.com)

Introduction/Preface

Welcome again to my blog. You probably got here because of my repository WesSec/VelociDeploy-o-Matic (github.com) :)) or a link on my socials. This blog will describe the process of the code written and the thought process behind it. If you just want the instructions to get the project up and running, check out the GitHub page here.

Incident response is a fast-moving world. Back in the day you would physically go to the customer, and rip out the network cables to mitigate. Pull out the hard drives to clone and wait for ages for FTK to finish indexing the image. Loading up terabytes of data you won’t need for your analysis. This process is very slow, and one day when you ask the customer where the server is, so it can be cloned, it is either in the cloud or on another continent. Because of this development, DFIR (let’s use this term from now on) needed to adapt. With awesome tools being released such as CyLR, Dissect, and Velociraptor, a world opened to flexible analysis. No more FedExed drives, simple uploads to the (s)FTP server of only the files needed, probably parsed with Plaso and then analyzed further.

A few years back, a tool entered the space called Velociraptor, it rapidly became the de facto standard for artifact collection and analysis. The documentation says: “Just run velociraptor.exe gui bro” and you’re good to go. While that is indeed a way to set up, when using Velociraptor in a more professional way, there is much more to do and think of. Hosting, firewalls, configuration, access management, separation and integrity of data, etc. For me, this resulted in not making as much use of Velociraptor as I desired, as setting up an environment while being busy with crisis management or the actual intake (the customer wants to see results as quick as possible) was too time-consuming.

On after hours, I thought of the idea to automate the process of velociraptor deployment, my CISO at that time forbid me to run this all on on-prem hardware, so cloud it was. Lots of spaghetti code was developed but eventually, I got it working. Having a command at hand that would deploy a new Velociraptor instance for me with my requirements. Perfect!

Shortly after this, I left the company I worked for, leaving this codebase on the shelf. I cleaned it up a bit and it’s now ready to share with the world, hoping it makes someone as happy as it does me. If you need any help deploying or modifying this script/process for your environment, please reach out via one of my socials.

This blog is split up into a few sections: (Dont know this yet, will type and reorder probably). Firstly I’ll start with the plan for the infrastructure, then I’ll walk you through the files which are used for the deployment. After a VM is deployed, it needs to be provisioned and Velociraptor should be configured, which will i describe in the second part of this blog.

Once again, if you’re here for installation instructions, follow the ReadME on GitHub.

Criteria

I set the following criteria, although it is probably not compliant with all ISO/CIS/whatever standards, it works and you can consider for yourself if you find this safe enough to use or not:

  • The GUI (Analyst interface) of Velociraptor may not be exposed to the internet.
  • Authentication should be done via Azure AD SSO.
  • Endpoints (Customer machines) should be able to connect from all over the world.
  • The server/VM IP should be fixed (for firewall reasons).
  • There should be a separation of data for instances/engagement.
  • There should be an easy overview of costs per ‘instance/engagement’
  • Deployment (and destroying) should be easy.
The idea, visualized

Infrastructure plan

At the time of starting this journey, I had a little experience in Azure, but that was self-thought, by the time of now I have some (security-related) Azure certifications, and it’s safe to say I understand a bit of the ecosystem now. The idea is that a single resource group exists with resources used throughout the whole project and that there will be destroyable resource groups that will be deployed during an engagement and destroyed afterward. The project mainly focuses on automating the latter, as the persistent one is a one-time setup and easily deployed using an ARM template found on the GitHub repo.

Persistent Resource Group idea

There are several components that will be ‘fixed/permanent’, these are the DNS Zone and VPN. The DNS zone will hold all A records for the instances, where the VPN will also be persistent. According to Azure, this is the topology once deployed:

DNS Zone

The DNS zone will act as an umbrella for the Velociraptor instances. each instance will get 2 (randomly generated) subdomains under the main domain, why 2? I’ll explain that later.

There will be two DNS records, one for the front end (where the client’s machines connect to) and one for the GUI (analyst interface). The public one will resolve to the public IP address of the velociraptor instance, where the gui. record will resolve to the internal IP (reachable via VPN, 10.0.0.0/8 range). This will make it easier for an analyst to determine which environment he/she works on multiple cases at the same time.

VPN (Azure VPN Gateway)

A VPN will be used to connect to the GUI interface for analysts, the VpnGw1 tier is used so that users can authenticate to the VPN using AAD SSO, allowing conditional access and all other sorts of security benefits. Analyst machines will connect via the Point-to-Site configuration and all velociraptor instances will be connected to a specific subnet inside the VPN space. I do not think this is the place to discuss the whole setup of the VPN, I basically followed this guide by Microsoft. For lab environments it works really well, I’m not sure about larger enterprise environments, but that was not my scope while developing this.

The Vnet has the following subnets configured:

The VPN clients will get a 10.0.2.x address and velociraptor instances will be connected to the Velociraptor_instances subnet and receive a 10.0.1.x address. By doing this we can configure nginx to only allow connections from the VPN clients for the GUI. It is an easy additional layer of security.

A nice visualization made by Azure

App Registration

Not a resource, but very important, is that an app registration is created in Azure, this app registration is used for authenticating the analysts to the Velociraptor instance. The app registration is also a one-time setup and is handled in the ReadME on GitHub. Permission to access a Velociraptor instance is still handled inside Velociraptor itself (by creating users). If this user exists in AAD and can pass all challenges for the App registration, the User is allowed access to Velociraptor.

Instance resource group idea

This group will be destroyable. For each time we run the script, a new Resource group will be created and velociraptor will be deployed, this ensures separation of data, easy cost management per engagement, and is just easier to code. It is a pretty standard VM setup if you ask me. the NSG will only allow SSH from a specified IP address (for debugging purposes), ports 80/443 are opened for any but can also be specified later on.

The VM

I used a basic Azure-provided Ubuntu Server image, reason for this is that I know my way around Ubuntu when I needed to debug. I think any other distro might work if you make sure the dependencies are in there.

I selected a Standard_B2s tier VM as it was the best bang for the buck at the time. A standard LRS 30GB disk was sufficient for me to develop, but this is all easily configurable. I will get into that later

Note: lately I experienced some issues with this image used, often a redeployment fixed this

As mentioned earlier, in the network configuration we connect the NIC to the Velociraptor_instances subnet

Infrastructure Deployment (Terraform)

At first, I clicked my environment together in the Azure portal, quickly realizing that this should be done via some automation way, that’s where I found my way into Terraform. The whole project is a string of Terraform (for infra), Ansible (for provisioning), and Bash (for stuff not supported by the first two).

I split the Terraform files up into segments for easier management, but it’s not necessary. In the repository, we have:

  • main.tf
  • variables.tf
  • data.tf
  • dns.tf
  • netsec.tf
  • networking.tf
  • output.tf
  • providers.tf
  • ssh.tf

I will briefly go through the files and what they do, if you find this boring, I’m sorry, you can of course skip parts.

main.tf

Main is the main file for Terraform, the main parts such as VM and calling the ansible script after deployment are defined here.

In this file, we generate a ‘random pet‘, which will be used for identifying the instance, also a random string of 8 characters will be rendered, this will be our subdomain for reaching a velociraptor instance.

With this information, a resource group is created with the name IR_<random pet>. Inside this Resource group, a VM is deployed with the tiering, image, and disk size as described above. It’s very awesome that you can define stuff like username, computer name, and even a public key at this phase. So we make use of that.

After the VM creation, we also define the provision which will be run after creating the infrastructure

In the next chapter, I’ll go deeper into the specifics of the provisioning

Variables.tf

The variables file contains all variables which can be user-defined. Because a full setup is done, we need a lot of information, Azure app registration ID and secret, email addresses for accounts, preferred region (which is still hardcoded is some places, sorry) etc should all reside here.

data.tf
Contains the definition of the in the ‘other’ (Persistent) resource group existing resources. Terraform needs to know what to use in order to use it, data is used for that. By doing this we can link the Instances to the right subnet

dns.tf

Surprise, this file does stuff with DNS. It will create an A record in the DNS Zone of the persistent Management Resource group for both the GUI and Frontend Subdomain. Both will be linked to the public IP at this time. The reasoning behind this is that in order for LetsEncrypt to generate certificates, a DNS check will be done. After the certificates have been generated (in the provisioning step) the GUI IP will be changed to the internal IP

netsec.tf

This file defines the network security group and security rules. We allow SSH from the IP address defined in variables.tf and open up web traffic ports to the big scary world. In the provisioning step, we will make use of NGINX for further handling of the traffic.

networking.tf

We still need some networking for the rest. In this file, we define a public IP for the velociraptor VM to be reachable on. Also, a NIC will be created which will add the connection to the VPN Subnet discussed earlier. In this file, the network security groups defined in netsec.tf are linked to the NIC

ssh.tf

This file defines a private key to be generated and locally saved, this can be used for ssh connections to the server

output.tf

Outputs are stuff that is printed to the terminal after a successful deployment, such as the domain and IP of the instance deployed

providers.tf

This file defines the providers needed by Terraform, it defines that we do some stuff with Azure, during development the provider was updated so I had to change some stuff, it’s nice that you can specify a version here for debug reasons


Now that the boring stuff is over, once all the files above are fired off, a VM is created:

DNS records set:

and they’re connected to the VPN subnet too 🙂

using the SSH key we can log in to the machine, awesome!


“All fun, but you can do this with a few clicks in the Azure portal too”, That is correct, but a few clicks added up a thousand times, is a lot of clicks, especially for the next part, where we incorporate the properties of the setup into the configuration of Velociraptor.

VM + Velociraptor Provisioning (Ansible)

In this part, I’ll walk through the steps of provisioning (basically the Ansible playbook). Ansible is an ‘Infrastructure as a code’ utility where we define the desired configuration. So whenever it is run on a fresh VM, we end up with the same configuration, so this is very useful in terms of repeatability and fast deployment. Ansible can configure a lot quicker than you can by hand ;). Ansible integrates with a lot of stuff, so you should be able to get the provisioning to work on machines on other platforms/distros too.

The Ansible playbook is executed directly from the Terraform deployment after the VM is created, eliminating manual steps. Because there were a few settings I was unable to set in Ansible, setup.sh contains a few extra lines of code for handling these settings (step 14).

I’ll elaborate on each step/task from the playbook briefly, just in case you need to debug something, or are still here just here for general interest.

The Ansible command is run with the IP/username/ssh key/variables of the instance created by Terraform and executes the following:

  1. Wait for the ssh host to respond
    Directly after deployment, the ssh service might not be available yet, so we wait for max 2 minutes for it to wake up
  2. Gather Facts
    This command gathers some ‘facts’ from the remote server. Ansible makes use of these variables while provisioning.
  3. Run apt-update & apt-upgrade
    We update all the binaries, as the image might not be very recent and we don’t want any vulnerable software running on it. In addition, an apt-update makes sure we get the correct versions for the packages to be installed in the next step
  4. Install required packages
    We install the following packages:
    nginx > to be used as a reverse proxy
    jq > a JSON processor needed for grabbing the latest Velociraptor version
    certbot > Needed for certificate generation
    python3-certbot-nginx > Which handles automatic configuration of nginx when running certbot.
  5. Stop NGINX
    For generating the certificate we use the HTTP challenge, because this challenge has to be served on port 80/443 we need to stop nginx to prevent port conflicts.
  6. Patch the nginx config
    On our provisioning machine, under templates/nginx.j2 a Jinja template, by using this template we can render a nginx config on the VM with variables such as the FQDN and put it in the right location
  7. Generate certificates
    We ask certbot to generate a certificate for both the frontend and the GUI endpoints, we pass the nginx argument, by doing this it will automatically add the certificate locations to the nginx config which was created in the previous step
  8. Download the latest Velociraptor release binary
    By using some spaghetti we call a shell command on the VM, which will download the latest Velociraptor release from GitHub
    In addition, we make the velociraptor executable by setting the +x flag
  9. Copy our velociraptor config properties to the server
    Just like in step 6, we generate a file on the VM containing our velociraptor config files. but unless step 6, we do not generate the final config. Instead, we create a file called velopatch.json, this file will be used in the following steps to take the default Velociraptor config and patch this config with the values found in velopatch.json. This function is not very well documented, but very useful in Automation. The velopatch.json will contain info set/generated in the variables.tf and generated throughout the process
  10. Generate a velociraptor server config
    We generate a file for the server side of Velociraptor, and we merge it in the patch file velopatch.json. If something is broken in your config, I suggest checking/editing the velopatch.json and rerunning this command by hand
  11. Generate a velociraptor client config
    Same as 10, but then for the client side. Currently, the file stays on the server, but it can be retrieved using ssh for local usage
  12. Create a service for Velociraptor
    To run Velociraptor, we create a systemd service
  13. Start Velociraptor Service
    We start the velociraptor service created in the previous step.
  14. Other steps in the setup.sh file
    After running the Ansible playbooks, there are a few settings left that I was unable to set with Ansible (or it was just too difficult to find and I took a shortcut)
    • We change the DNS record from the GUI Record to the local VPN one, we do this so that the GUI is reachable by visiting the GUI URL in the search bar, but it’s not publicly resolvable. Because we also define an allowed range in the nginx config, this should be safe.
      Please take into consideration that certbot will not be able to renew the certificate for the GUI domain as the DNS record will not match anymore, because in my experience an environment only stays online for a few weeks and a certificate is valid for a year, this risk was accepted
    • We add the Callback URL of the instance to the Azure App Registration. Azure only accepts an array of all callback URIs currently valid, therefore we first fetch all the URIs, append our new instance, and push them back to Azure

If all of these steps are completed

Destroying the environment

As a criterion was that data should be separated. An instance can be manually removed by deleting the resource group of that instance. terraform destroy will just remove all resources created, basically doing the same.

After the destruction, cleanup.sh is run, just like setup.sh there is a leftover that needs to be handled, which is removing the URI from the app registration.

Cost handling

Per instance active, there is a resource group. each resource group has its own cost analysis in Azure, which can be used for handling costs. During testing, I did not spend more than a few bucks a week on instance resources.

The Azure Gateway VPN will be a bit more expensive, which is currently rated at €130 euro per month + Normal Azure Data charges. IMO very neglectable costs for IR engagements. By modifying the scripts you can also make the GUI available directly to the internet (or behind an app proxy), which might or might not fit your risk profile.

Outro

Hopefully, you understand a bit more of what happens in the code now. Again, if you are still looking for a step-by-step guide of which commands to run to get this project running, please check the GitHub ReadMe.

I’d like to thank you for your time. For me, this was a project that has been shelved for a long time, mainly because continuing on the project required me to fully deploy the permanent stack again, which takes up to an hour. During this project, I learned a lot about Infrastructure as a code and deployments on Azure.

If you like this idea, you want to implement it but need some help, please reach out to me, I’ll be glad to help.

Leave a Reply

Your email address will not be published. Required fields are marked *