Learning Python for Forensics

上QQ阅读APP看书，第一时间看更新

Developing our first forensic script – usb_lookup.py

Now that we've gotten our feet wet writing our first Python script, let's write our first forensic script. During forensic investigations, it is not uncommon to see references to external devices by their vendor ID (VID) and product ID (PID) values; these values are represented by four hexadecimal characters. In cases where the vendor and product name are not identified, the examiner must look up this information. One such location for this information is the webpage http://linux-usb.org/usb.ids. For example, on this webpage, we can see that a Canon EOS Digital Rebel has a vendor ID of 0x04A9 and a product ID of 0x3084. We will use this data source when attempting to identify the vendor and product names, using the defined identifiers.

First, let's look at the data source we're going to be parsing. A hypothetical sample illustrating the structure of our data source is mentioned later. There are USB vendors and for each vendor, a set of USB products. Each vendor or product has four-digit hexadecimal characters and a name. What separates vendor and product lines are tabs because products are tabbed over once under their parent vendor. As a forensic developer, you will come to love patterns and data structures, as it is a happy day when data follows a strict set of rules. Because of this, we will be able to preserve the relationship between the vendor and its products in a simple manner. Here is the earlier-mentioned hypothetical sample:

0001 Vendor Name  
    0001 Product Name 1  
    0002 Product Name 2  
    ...
    000N Product Name N

This script, named usb_lookup.py, takes a VID and PID supplied by the user and returns the appropriate vendor and product names. Our program uses the urllib2 module to download the usb.ids database to memory and create a dictionary of vendor IDs and their products. If a vendor and product combination are not found, error handling will inform the user of any partial results and exit the program gracefully.

The main() function contains the logic to download the usb.ids file, store it in memory, and create the USB dictionary using the getRecord() helper function. The structure of the USB dictionary is somewhat complex and involves mapping a vendor ID to a list, containing the name of the vendor as the first element, and a product dictionary as the second element. This product dictionary maps product IDs to their names. The following is an example of the USB dictionary containing two vendors, VendorId_1 and VendorId_2, each mapped to a list containing the vendor name, and a dictionary for any product ID and name pairs:

usbs = {VendorId_1: [VendorName_1, {ProductId_1: ProductName_1, ProductId_2: ProductName_2, ProductId_N: ProductName_N}], VendorId_2: [VendorName_2, {}], etc.}

It may be tempting to just search for VID and PID in the lines and return the names rather than creating this dictionary that links vendors to their products. However, products can share the same ID across different vendors, which could result in mistakenly returning a product from a different vendor. With our method, we can be sure that the product belongs to the associated vendor.

Once the USB dictionary has been created, the searchKey() function is responsible for querying the dictionary for a match. It first checks that the user supplied two arguments, VID and PID, before continuing execution of the script. Next, it searches for a VID match in the outermost dictionary. If VID is found, the innermost dictionary is searched for the responsive PID. If both are found, the resolved names are printed to the console. See the following code:

001 import urllib2
002 import sys
...
009 def main():
...
043 def getRecord():
...
057 def searchKey():
...
092 if __name__ == '__main__':
093     main()

For larger scripts, such as this, it is helpful to view a diagram that illustrates how these functions are connected together. Fortunately, a library named code2flow, available on Github (https://github.com/scottrogowski/code2flow.git), exists to automate this process for us. The following schematic illustrates the interactions between the main() function, which first calls and receives data from getRecord(), before calling the searchKey() function. There are other libraries that can create similar flow charts. However, this library does a great job of creating a simple and easy to understand flow chart:

Developing our first forensic script – usb_lookup.py

Understanding the main() function

Let's start by examining the main() function, which is called on line 93 as seen in the previous code block. As discussed previously, the particular if statement on line 92 is a common and preferred way of calling our starting function. The if statement will only evaluate to True if the script is called by the user. If, for example, it were imported, the main function would not be called and the script would not run. However, the function could still be called just like any other imported module as follows:

009 def main():
010     """
011     The main function opens the URL and parses the file before searching for the
012     user supplied vendor and product IDs.
013     :return: Nothing.
014     """

On lines 15 through 20, we create our initial variables. The url variable stores the URL containing the USB data source. We use the urlopen() function from the urllib2 module to create a list of strings from our online source. We will use a lot of string operations, such as startswith(), isdigit(), islower(), and count(), to parse the file structure and store the parsed data in the usbs dictionary. The curr_id variable, defined as an empty string on line 20, will be used to keep track of which vendor is currently being processed by our script:

015     url = 'http://www.linux-usb.org/usb.ids'
016     usbs = {}
017     # The urlopen function creates a file-like object and can be treated similarly.
018     usb_file = urllib2.urlopen(url)
019     # The curr_id variable is used to keep track of the current vendor being processed into the dictionary.
020     curr_id = ''

On line 22, we begin to iterate through our USB data we retrieved with urllib2. Starting with line 24, we create a conditional clause to identify vendor, product, and trivial lines. Trivial lines, which our script passes on, are those that .start with() a pound symbol or is a blank line. If a line is important, we check whether it is a vendor or product line as follows:

022     for line in usb_file:
023         # Any commented line or blank line should be skipped.
024         if line.startswith('#') or line == '\n':
025             pass

On line 28, we check whether the line does not start with a tab and that the first character is either a number or lower case. We perform this check because there are entries that do not follow the general structure seen in the first three quarters of the data source. Upon further inspection, the lines starting with an upper case character are inconsequential to our current task and so we disregard them:

026         else:
027             # Lines that are not tabbed are vendor lines.
028             if not(line.startswith('\t')) and (line[0].isdigit() or line[0].islower()):

If the line is what we have defined as a vendor line, we call the getRecord() function on line 30 and pass the current line. The line is first stripped of any newline characters to avoid any potential errors in the getRecord() function. Note that we have assigned two variables—id and name—to capture the returned output from the getRecord() function. This is because the function returns a tuple of values and hence why we capture the results into two variables. Consider the following code:

029                 # Call the getRecord helper function to parse the Vendor Id and Name.
030                 id, name = getRecord(line.strip())

On line 31, we set the curr_id variable to the current vendor ID. We add our id key to the usbs dictionary and map the key to a list containing the name of the vendor and an empty dictionary, which we will populate with product information. This completes the logic necessary to handle vendor lines. Now, let's take a look at the following code to see how product lines are handled:

031                 curr_id = id
032                 usbs[id] = [name, {}]

A line is a product line if it starts with one tabbed character. There are lines in the dataset that contain more than one tab, but again, they are the last quarter of the data source that do not contribute to our current goal. If we do encounter a product line, we send it for processing on line 36 by passing it to the getRecord() function.

On line 37, we update the usbs dictionary by adding the id key to the embedded dictionary within the list mapped to the curr_id key. Great, lots of brackets with words and numbers following our usbs variable, you might ask "Didn't you tout Python as highly readable?" Yep. You'll grow accustomed to accessing items in this manner for lists and dictionaries, but let's break it down one by one.

The first variable, curr_id, is our current vendor ID. By the time we access a product line, its vendor line will have already been encountered and therefore will have been added as a key to usbs dictionary. At this point, we have access to list, which contains the vendor name and the product dictionary in the zero and first index, respectively. We want to add our product ID and name to the product dictionary, which is accomplished by specifying the first index of the list. Then, it is a matter of setting the id key to our name value. See the following code:

033             # Lines with one tab are product lines
034             elif line.startswith('\t') and line.count('\t') == 1:
035                 # Call the getRecord helper function to parse the Product Id and Name.
036                 id, name = getRecord(line.strip())
037                 usbs[curr_id][1][id] = name

Finally, we call the searchKey() function to look for the user-supplied vendor and product IDs in our usbs dictionary. The line is indented in a manner so that it falls outside the logic of the for loop and will not be called until all lines have been processed as follows:

039     # Pass the usbs dictionary and search for the supplied vendor and product id match.
040     searchKey(usbs)]

This takes care of the logic in the main() function. Now, let's take a look at the getRecord() function called on lines 30 and 36 to determine how we separate the id and name for vendor and product lines.

Exploring the getRecord() function

This helper function, defined on line 43, takes the vendor or product line and returns its ID and name. The following is an example of one of the vendor lines in the file and what our function outputs:

Input from the usb.ids file: "04e8 Samsung Electronics Co., Ltd"
getRecord() output: "04e8", "Samsung Electronics Co., Ltd"

We accomplish this by using string slicing. In the previous example, the vendor ID, 04E8, is separated from the name by one space. Specifically, it will be the first space on the left side of the string. We use the string.find() function, which searches from left to right in a string for a given character. We give this function a space as its input to find the index in the string of the first space. In the previous example, our split variable would contain a value of 4. Knowing this, we know that the ID is everything to the left of the space. Our name is everything to the right of the space. We add one to our split variable for the name in order to not capture the space before the name:

043 def getRecord(record_line):
044     """
045     The getRecord function splits the ID and name on each line.
046     :param record_line: The current line in the USB file.
047     :return: record_id and record_name from the USB file.
048     """
049     # Find the first instance of a space -- should be right after the id
050     split = record_line.find(' ')
051     # Use string slicing to split the string in two at the space.
052     record_id = record_line[:split]
053     record_name = record_line[split + 1:]
054     return record_id, record_name

Interpreting the searchKey() function

The searchKey() function, originally called on line 40 of the main() function, is where we search for the user-supplied vendor and product IDs and display the results to the user. In addition, all of our error handling logic is contained within this function.

Let's practice accessing nested lists or dictionaries. We discussed this in the main() function; however, it pays to actually practice rather than take our word for it. Accessing nested structures requires us to use multiple indices rather than just one. For example, let's create list and map that to key_1 in a dictionary. To access elements from the nested list, we will need to supply key_1 to access the list and then a numerical index to access elements of the list:

>>> inner_list = ['a', 'b', 'c', 'd']
>>> print inner_list[0]
a
>>> outer_dict = {'key_1': inner_list}
>>> print outer_dict['key_1']
['a', 'b', 'c', 'd']
>>> print outer_dict['key_1'][3]
d

Now, let's switch gears, back to the task at hand, and leverage our new skills to search our dictionary to find vendor and product IDs. The searchKey() function is defined on line 57 and takes our parsed USB data as its only input as follows:

057 def searchKey(usb_dict):
058     """
059     The search_key function looks for the user supplied vendor and product IDs against
060     the USB dictionary that was parsed in the main and getRecord functions.
061     :param usb_dict: The dictionary containing our list of vendors and products to query against.
062     :return: Nothing.
063     """

On line 66, we use a try and except block to assign the first and second index of the sys.argv list to variables. If the user does not supply this input, we will receive an IndexError, in which case, we print to the console, requiring additional input and exit our script with an error (traditionally, non-zero exits indicate an error) as follows:

064     # Accept user arguments for Vendor and Product Id. Error will be thrown if there are less
065     # than 2 supplied arguments. Any additional arguments will be ignored
066     try:
067         vendor_key = sys.argv[1]
068         product_key = sys.argv[2]
069     except IndexError:
070         print 'Please provide the vendor Id and product Id separated by spaces.'
071         sys.exit(1)

We use another try and except block on line 75 to find our vendor in the USB dictionary. When querying a dictionary by key, if that key is not present, Python will throw a KeyError. If this is the case, then the vendor ID does not exist in our USB dictionary that we print to the user. Because this wasn't technically an error and just the case of our data source being incomplete, we perform a zero exit. We have the following code:

073     # Search for the Vendor Name by looking for the Vendor Id key in usb_dict. The zeroth index
074     # of the list is the Vendor Name. If Vendor Id is not found, exit the program.
075     try:
076         vendor = usb_dict[vendor_key][0]
077     except KeyError:
078         print 'Vendor Id not found.'
079         sys.exit(0)

In a similar manner, if the product key is not found under its vendor key, we print out this result to the user. In this scenario, we can at least print the vendor information to the console before exiting, as follows:

081     # Search for the Product Name by looking in the product dictionary in the first index of
082     # the list. If the Product Id is not found print the Vendor Name as a partial match and exit.
083     try:
084         product = usb_dict[vendor_key][1][product_key]
085     except KeyError:
086         print 'Vendor: {}\nProduct Id not found.'.format(vendor)
087         sys.exit(0)

If both our vendor and product IDs were found, we print this data to the user. On line 89, we use the string format() function to print the vendor and product names separated by a newline character as follows:

088     # If both the Vendor and Product Ids are found, print their respective names to the console.
089     print 'Vendor: {}\nProduct: {}'.format(vendor, product)

Running our first forensic script

The usb_lookup script requires two arguments—vendor and product IDs for the USB of interest. We can find this information by looking at a suspect HKLM\SYSTEM\%CurrentControlSet%\Enum\USB registry key. For example, supplying the vendor, 0951, and product, 1643, from the subkey VID_0951&PID_1643, results in our script identifying the device as a Kingston DataTraveler G3.

Our data source is not an all-inclusive list, and if you supply a vendor or a product ID that does not exist in the data source, our script will print that the ID was not found. The full code for this and all of our scripts can be downloaded from https://packtpub.com/books/content/support.