machine learning

Handling IP Addresses in Machine Learning Datasets

As I was working on a small research project that employs a typical cybersecurity dataset, I came across this question. How should we handle IP addresses in these datasets? Here are my thoughts about this.

Many datasets used in research conducted in cybersecurity contain IP addresses, such as source IP address, and/or destination IP address. In most of these datasets, IPv4 addresses are expressed in the usual dotted decimal format (ex 192.168.0.1). Most programming languages, like Python, will go crazy when they see a number with three decimal points and hence the question; how do we handle them?

A question that I would like to pause before addressing the IP address handling problem; how useful are IP addresses in training models?

If you’re training a model to detect intrusions, what impact does the source IP address has on your model’s decision to classify the packet/stream as malicious? The source IP address of a single packet might not be of great help to recognize the packet as malicious. However, if that is combined with other packets maybe coming from the same source, there might be some value in knowing the source IP address. In a similar conclusion, the destination IP address of a single packet might not help in the proper classification process, but combined with packets coming from many other sources to the same destination might flag a DDoS attack. This, makes classical classification algorithms incapable of properly classifying traffic based on features relevant to a single packet.

There are other scenarios where some or all the IP addresses in the dataset do not really contribute to the classification process. On the contrary, these addresses can confuse the classifier and through the results into the wrong class. In a dataset that contains IP addresses

Handling them as a String (really??)

If you choose to handle them in your model as strings, then their actual contribution to the training of your model is close to nothing. I advise highly against this because it doesn’t help the classification process by any means.

Handling them as Decimal numbers

Handling IP addresses as numbers can take one of the following two paths:

Path 1

Convert the IP address to a binary numbers, and then convert the binary number to an integer. IP addresses can easily be converted into binary numbers from their original dotted-decimal format as each octet can be converted into an 8-bit binary representation. Then, you combine these 4 8-bit octets into a single 32-bit number, and covert that to a decimal number.

The code below shows a way of doing this in python:

Assuming that the IP addresses are held in a list named “array”,

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
for pnt in range(len(array)):
x = array[pnt]
z = 0
parts = x.split('.')
z = (int(parts[0]) << 24) + (int(parts[1]) << 16) + (int(parts[2]) << 8) + int(parts[3])
array[pnt]= z
for pnt in range(len(array)): x = array[pnt] z = 0 parts = x.split('.') z = (int(parts[0]) << 24) + (int(parts[1]) << 16) + (int(parts[2]) << 8) + int(parts[3]) array[pnt]= z
for pnt in range(len(array)):
    x = array[pnt]
    z = 0
    parts = x.split('.')
    z = (int(parts[0]) << 24) + (int(parts[1]) << 16) + (int(parts[2]) << 8) + int(parts[3])
    array[pnt]= z

Keeping in mind that in most cases, the IP address itself is not an indicator of an attack, and most of the time it doesn’t hold a value in the detection process. However, as we explained earlier, the relationship between different IP addresses might help the classification process for some types of attacks. The relationship between IP addresses might indicate whether the attacker is an internal actor, or an external actor as well.  With that in mind, this method maintains this relationship between IP addresses, as the system would be able to learn that IP addresses that are close to each other can be within the same network. This can help in recognizing attacks coming from botnets within the same IP cluster, for example.

Path 2

In this path, we strip the dots from the dotted-decimal and simply add zeros to the left to complete each octet to exactly three digits. This means that the IP address 192.168.1.14 becomes 192168001014. This python code shows how this can be done

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
for pnt in range(len(array)):
x = array[pnt, 0]
z = ''
parts = x.split('.')
for prt in range(3)
if len(parts[prt])=3:
z = parts[prt] + z
elif len(parts[prt]=2:
z = '0'+ parts[prt] + z
else:
z = '00' + parts[prt] + z
array[pnt]= int(z)
for pnt in range(len(array)): x = array[pnt, 0] z = '' parts = x.split('.') for prt in range(3) if len(parts[prt])=3: z = parts[prt] + z elif len(parts[prt]=2: z = '0'+ parts[prt] + z else: z = '00' + parts[prt] + z array[pnt]= int(z)
for pnt in range(len(array)):
    x = array[pnt, 0]
    z = ''
    parts = x.split('.')
    for prt in range(3)
        if len(parts[prt])=3:
            z = parts[prt] + z
        elif len(parts[prt]=2:
            z = '0'+ parts[prt] + z
        else:
            z = '00' + parts[prt] + z
        array[pnt]= int(z)

 

This method also maintains the relationship between IP addresses and helps the recognition of adjacent nodes and networks.

One Hot Encoding

In this method, IP addresses get encoded into representative codes instead of being dealt with as values (whether string, or integer ones). When using this method, each IP address would be represented as a code (such as a binary code). This way, you can identify each packet coming from, or going to, a particular IP address. However, this representation will eliminate the relationship between these IP addresses. This means that your system would not be able to identify the relationship between IP addresses of two hosts within the same network.

Although the relationship is not maintained between IP addresses in this representation, it can be helpful in recognizing data flows, and doing statistics analysis of data originating or landing in a specific node. Even if the system doesn’t know what the exact IP address in a packet, it can easily calculate the amount of traffic this node is sending or receiving, or the amount of data going from a particular source to a particular destination.

For example, if your dataset contains 110 unique IP addresses, these will be replaced with 7-bit codes. The first IP address appearing in the dataset would be name ‘0000000’. Consequently, and mention of the same IP address will be encoded as ‘0000000’. The second IP address, and all of it’s consequent mentions, would be replaced with ‘0000001’ and so on.

In python this can be done through the use of SKLearn Label Binarizer.

Final Remarks

As we discussed earlier, IP addresses are not always helpful to include within the training process. In any way you choose to represent them, they can hinder the system’s learning and have a damaging impact to the system’s outcome. A general rule of thumb is that do not include IP addresses in your training data unless you recognize a tangible benefit from it. If you’re implementing a system to detect denial of service attacks, IP addresses can help. If your system needs to differentiate internal from external traffic, you need to include IP addresses. Just look at how the IP address impacts the attack detection process and make a decision.

If you’ve used other ways of handling IP addresses, I’d love to hear from you. Perhaps, I’d include it in this article as an update. Let me know what you think in the comments section below, or shoot me an email.