AWS Forensics - How a credential leak led to scanning 1000's of AWS EBS volumes

Introduction

A Friday wouldn’t be complete without an incident involving our DFIR team. Nonetheless, this was no ordinary incident; someone’s AWS credentials had been leaked and accessed from an unfamiliar location.

Furthermore, it wasn’t just the client’s AWS interface that the compromised credentials had access to, but also several other accounts. Given these circumstances, my task was clear: “We need to confirm if the instances have been tampered with, here are the IOC’s, and I want you to scan all of the instances in the accounts.”

To determine whether the instances had been tampered with or if any residual elements remained, we had to scan all 1000 dispersed disks across multiple accounts, not just the 10 or 20 instances with a single disk each.
Obtaining and scanning all the EBS volumes used by an instance in each affected account was crucial.

The entire process has 5 main steps:

Acquisition
Pre processing
Mounting
Scanning
Reporting

Besides acquisition ,all of the other steps were done on 1 host, more on this later.

Acquisition

Using automation is the way to go since manually going through each account and processing each EBS volume can be tiresome and time-consuming. In addition, automation can help us to be more efficient and accurate in our scanning process, which is crucial when dealing with large-scale incidents like this one.

The approach for acquiring these images will be done using the AWS SDK for python (boto3)
Boto3 has two distinct levels of APIs:

Client APIs provides one-to-one mappings to the underlying HTTP API operations, the API response data can be accessed in JSON format
Resource APIs provide resource objects and collections to access attributes and perform actions.

Acquisition process

Authenticate to the target account(s)
Poll all volumes
Save all the volume ids to a list
Start a workflow for processing

To optimize the handling of these actions, I chose to employ asynchronous operations. Processing each volume one by one would be counterproductive since they are independent of one another and can be handled simultaneously.

We do need to take into account the AWS API rate restriction, particularly the request rate limit, that is applied to API queries.

Here is an example code snippet in Python to query all EBS volumes in an AWS account:

import boto3

# Create a Boto3 client for interacting with AWS
client = boto3.client('ec2')

# Use the client to get a list of all EBS volumes in the account
response = client.describe_volumes()

# Iterate through each volume and print its details
for volume in response['Volumes']:
    print(f"Volume ID: {volume['VolumeId']}")
    print(f"Size: {volume['Size']} GB")
    print(f"Availability Zone: volume['AvailabilityZone']}")

This code uses the Boto3 library to create a client for interacting with AWS, then calls the describe_volumes method to get a list of all EBS volumes in the account. It then iterates through each volume and prints its details, such as its ID, size, and availability zone. This is a simple example, but it could be modified to perform more complex operations on each volume as needed.

The workflow for acquisition consists of some steps:

Query all EBS volumes used by instances in affected accounts
For each volume, create a snapshot of the volume
Wait for the snapshot to be ready
Share the snapshot with the appropriate IR account
Create a volume of the shared snapshot

To ensure proper tracking and reporting of the snapshot creation, it is important to apply relevant tags at the time of creation. These tags can be customized based on the specific needs of the investigation.

In my experience, certain tags are crucial for effective management of large-scale investigations. These tags not only aid in tracking the progress of the investigation, but also provide valuable information for reporting purposes.

Source ARN
AWS account
Status
Device Name
Instance Name

Pre-processing

To effectively process the acquired EBS volumes, we will utilize boto3 to poll them as they are obtained. As pre-processing and acquisition are independent, we periodically fetch the available volumes obtained from our account. By doing this, we can automate the next steps in our workflow and ensure that the acquired volumes are processed in a timely and efficient manner.

In order to extract the volume ids and volume tags from the data returned by Boto3, we can use a simple Python script.

Here’s an example script:

import boto3

# set up the AWS session
session = boto3.Session()

# create the EC2 client
ec2_client = session.client('ec2')

# get a list of all the volumes in the account
volumes = ec2_client.describe_volumes()

# create empty lists to store the volume ids and tags
volume_ids = []
volume_tags = []

# loop through the volumes and extract the ids and tags
for volume in volumes['Volumes']:
    volume_id = volume['VolumeId']
    volume_tag = volume['Tags'][0]['Value'] if 'Tags' in volume else None
    volume_ids.append(volume_id)
    volume_tags.append(volume_tag)

# print the list of volume ids and tags
print('Volume IDs:', volume_ids)
print('Volume Tags:', volume_tags)

The following script uses the describe_volumes method of the EC2 client to retrieve a list of all the EBS volumes in the account. It subsequently loops through the volumes and extracts both the volume id and volume tag(s) for each one. Thereafter the volume ids and tags are added to separate lists, which can then be used for further processing. The command line to initiate the scanning is dependent on the tags that were extracted during pre-processing.

Mounting

As time goes on, volumes will be sent to the intended IR account; in order to be scanned, they must be mounted on an EC2 instance(s). The virtualization type of the instance where you want to mount the EBS volumes will determine what restrictions apply.

Paravirtual
HVM

HVM was the standard for our Linux instances which gave us the following options for device names:

/dev/sd[a-z]
/dev/xvd[b-c][a-z]

AWS provides more information on the specific device mappings available here.

I encountered a variety of disk kinds and file systems as I went through them, this depends entirely on your environment.

1 or N partition(s)
1 or N partition(s) with LVM
1 or N partition(s) with xfs and/or ext4
….

I needed to use a combination of subprocess and shell commands in order to make this possible. Because I was unable to locate any appropriate Python wrappers that could recognize these details.

Who would have guessed that despite the cloud’s standardization principles, alternative configurations can still be found and stray from the norm. In order to mount each disk safely and later retrieve the scanning result for our reporting, we must set up a suitable structure during this stage.

For example:

/ir-cases/case-xxxxx/evidence/mount/vol-xxxxxxx/
/ir-cases/case-xxxxx/evidence/mount/vol-xxxxxxx/part1/vg-xxxx
/ir-cases/case-xxxxx/evidence/mount/vol-xxxxxxx/vg-xxxx

Scanning

We use the excellent tool THOR APT scanner in order to scan all of the EBS volumes; it has a large built-in yara library and enables us to quickly add our own yara rules.

Once all of the preceding checks have been completed, we are able to start the scanning process.:

Are there any device names available for mounting?
Does the disk have one or multiple partitions?
Is the mount path available or do we need to create it?
What is the filesystem of the disk or the partitions?
Was the mount successful and the files are available?
Was the disk successfully attached to the EC2 instance?

We can distinguish between the scanning output for each of these several volumes, notably json, txt, and html reports, thanks to THOR APT Scanner, a full list of output options can be found here.

Reporting

As mentioned in the scanning section, the output parameters used when executing the scan will decide the naming convention of the files and where they are placed.
All the output needs to be placed in a S3 bucket, which can be easily done by the following command using the cli:

aws s3 sync /ir-cases/case-xxxx/evidence/ s3://ir-cases/case-xxxx/evidence

It also helps for usage later to gzip the different output files separately, namely the json and html scan reports.

Scaling exercise

I made the decision to continue with a single mounting and scanning system throughout the development of this code. It would have been difficult to deal with a distributed system design right away.

I was conscious that this decision would have a negative effect on the code’s portability and reusability, but time was of the importance, therefore the damage was minimized. A distributed system requires some kind of service to process messages, and SQS from AWS is that service.

SQS is used to build distributed applications with decoupled components without having to deal with the overhead of creating and maintaining message queues.

Standard and FIFO: In a standard queue, the messages are picked up randomly. It might not be in the order it entered the queue while the FIFO queue uses first-in-first-out, it ensures the order.

At-Least-Once Delivery: A message in the queue is delivered at least once. Message delivery is guaranteed, no message is lost

Visibility Timeout: Multiple components can work on a single queue. SQS uses a lock mechanism, if one component is using a message, it is made hidden to other components. Upon successful processing, the message is deleted from the queue. If the message processing fails, it stays in the queue and is made visible to all the components.

And more importantly SQS also supports Dead Letter Queues, as this exercise showed that there were a bunch of times that some volumes could not be processed,the DLQ actively addresses this problem by informing us of unconsumed messages.

Our setup has 1 producer which starts the workflow for all targeted volumes and multiple consumers that will process the volumes by consuming the messages in the queue.
All scanning reports are stored locally on the machine which will be synced to a S3 case folder.

Lessons Learned

Automation is a double-edged sword

Automation, APIs, and all the other tools at our disposal are fantastic to have, but if you’re not careful, they could have negative effects. On the plus side, they allow for easy accessibility without the obvious disadvantage.

The acquisition API had several flaws where some snapshots/images timed out and their volume id’s were not added to a list of ids, which made it difficult to notice while collecting 1000 volumes. The API was not designed to be utilized at such a scale.

By using some async/await magic I was able to eventually resolve the issue.

Scanning without proper timeouts and error handling

That’s all there is to it, right? We create the command with the proper output parameters, point it in the direction of the mount path, and start a subprocess for it.

The defect was minor but inconvenient; it required that a subprocess be run with unique timeout settings because scanning a 3TB drive might occasionally take some time. Without correct error handling or the ability to save the scan id number for later use, this can lead to issues.

More consideration should have been paid to appropriate exception handling policies.
Especially when THOR APT provides support to resume a scan when it had been interrupted.

Compute time in the cloud

A specific code implementation had a significant impact on the cloud bill. Who was the perpetrator of this entire ordeal you ask?

When THOR starts up, all of its corresponding libraries, files, yara rules, are loaded to be used to scan the target path. Depending on the number of files, this setup takes between 5 and 10 seconds to complete before executing the scan.

This is acceptable for typical drives with a single partition, but not when the image has several partitions or volume groups.

If every disk has five volume groups, each mounted on a different path, and every Thor scan requires ten seconds of setup. Considering that I have 1000 volumes, there haven’t been 833 hours or 50000 seconds of compute time dedicated to volume scanning.

It would only take 10,000 seconds or 166 hours of computing time—a substantial difference in the cloud—to scan all 5 volume groups simultaneously by setting the target to the parent path of every mount path.
Luckily enough THOR APT scanner does support this feature with the lab license.

Conclusion

The main takeaway here is that proper coding practices are necessary when dealing with cloud resources, regardless of your background. And automation may be both your ally and your enemy when dealing with large-scale issues.