Rather then classifying the intent of a sample by reviewing its inner-workings with static or dynamic analysis, this is a new algorithmic-approach to attributing samples. Specficially, by reviewing common characteristics and visualizing these into Network Topologies.

Now this is new territory for me, so as ever - take what I say with a Thanos-sized hand full of salt. I'll try my best.

This is where Data-science comes into play. There can be many methods of how a Network Topology is created, for example:

  • Network TopologyA can be created from AttributeA of the samples
  • Network TopologyB can be created from AttributeB of the samples instead.

Unlike the traditional association of "Networks" like in a LAN. In context of data-science, a Network is the formal term for an applied method or algorithm (or models). For example, Neural Network.

Classification via Bipartite Networks

A Bipartite Network is at the core, identification by the partition of two groups. In this case, for Malware, identifying shared attributes between malware samples. For example, Hostnames or IP Addresses contacted.

It can be used to quickly visualize various thing such as:

  1. Apparently different samples being written by the same Author(s)
    • Visualizing that different samples actually contact similar servers.
  2. A variant of a pre-exisiting type of Malware.
  3. A trend of perhaps compromised service(s) that may be discovered based upon on a re-occuring domain pattern

To illustrate one example of using a Bipartite Network, we can use these two groups to contain the following:

Group #1: Contacted Domains (perhaps a set of Botnet Command and Control Servers owned by one group)
Group #2: Apparently different Malware Samples

I used a provided diagram from the Book as a basis, and modified it to visualize a bit better.

Visualizing this Analysis, and the Associated Problems

The problem is, by using this Network to classify samples, we very quickly create a huuuuge Network Topology. Whilst it's great to show all the data, what's the point if we can't actually see it or realistically use it?!

This is where we can use weights to narrow down the size of a Network Topology, for a more realistic or quicker analysis.

For example, we can define a weight in this case where:

  • Only visualise / render a node if they connect too atleast two of the grouped domains

Or...Simply

  • Render a sample if it connects to any of the grouped domains

A weight is quite literally just that - how much weight a sample has. The more connections to different grouped domains, the higher the weight. A Sample with fewer connections to grouped domains, will have a lower weight.

Relate it to a traditional point system; The more points you score, the higher your weight. The lower amount of points you score, the lower your weight.

All a Network Topology consists of - is simply the rendering of samples (nodes) based upon their weights.

Like a traditional Network Topology, we want these nodes to be rendered based upon their proximity (via hop-count) to each other. It's illogical to plot two objects with only 2 hop-counts, as if they were connected closer with objects 15 hop-counts away.

Ultimately, this problem - distortion - cannot be solved, but only minimized. Once more then three nodes are introduced, where they are of different hop-counts, a node may appear physically closer to another then it actually is.

Actual proximity of Nodes

Take this for example, this is the true hop-count between nodes, and perfectly represents their geographical or logically-connected proximity.

However, distortion is the mis-visualization where nodes appear either geographically or logically connected closer then they actually are.

Notice how it appears that Node A is rendered as being the same distance between Node E and D, whilst geographically, they are not at all.

Combatting this Mis-Visualization

Rather then visualizing nodes based purely upon distance (hop-count), why not use the principles of how vector images work? Where the image scales using math to determine where the lines need to be re-drawn. Rather then PNG's, where the location of the pixels are static.

To apply this further, why not just determine an amount of space between nodes, regardless of their actual proximity together. Forcing every Node to be plotted as an equal distance away from each other. No matter how many nodes are visualized. E.g. enforcing a padding of 5% between each Node.

Phew... That was heavy, and we've barely just gotten started. Read Pt II, where we begin to create these Network Topologies, and explore the various tools to do so - I am bricking it.