We're publishing a new tool to diff virtual machine snapshots and view the diffs in-browser. It lets you see every file and process that’s changed between two points in time, and lets us finally answer the question “what happens on my computer when I do X?”. We’re releasing the prototype today on Github, we hope you like it! ☀️☀️ This has nothing to do with AI. |
vmdiff
is like git diff
but for the whole computer
Pop quiz, what happens on your computer when you install a new piece of software? What does this do?
What happens when you run software? When you change a setting?
What happens when you run this command? Does anyone know for sure?
What happens if you take your hands off the keyboard, go outside, and forget that computers were ever inflicted upon your life? Does anything change on your computer?
How do you know?
These are some of the embarrassingly simple questions I asked myself, and to which I found the answer was nobody knows.
Sure, when you install software, you know some things that change, from e.g. the logs the installer shows you, telling you which files it’s put where. But can you say for sure that’s everything that changed? What about if the installer changes another file that it doesn’t log? What if it just quietly adds write permissions to some of your directories? What if it sets some environment variables that are used by another program?
It is really hard right now to say:
“Here is everything that happens when you run this program.”
Currently we’re at:
“Here is some of what happens when you run this program.
Maybe that’s everything? Who can truly say, the universe is so mysterious ✨”
So I wonder what secrets are hiding in the shadows 👀
In the last episode of Icarus Labs, our protagonist discovered that a whole lot of software developers had installed Docker Desktop on their macOS laptops, without knowing that this also installed a Linux virtual machine on their computer. Normally that would just feel a bit violating of the sanctity of one’s Macbook, but this Linux virtual machine also happened to allow us to hide malware very convincingly.
That’s why we’re here today. I thought to myself:
Wow, lots of people actually installed this software without knowing it’s a Linux virtual machine. I wonder what else I’ve installed that has secret stuff in it?
In finding the Docker virtual machine, I had to manually rummage through my files, trying to guess where the Docker app stored its config, where it wrote files, which files belonged to it, and so on, like a caveman. It took a long time. But we are no longer in caveman times, you see. We have blog posts, for example.
This blog post, then, is about the tool I made to mass-produce blog posts like the last one.
I decided I was sick of this chaotic and lawless world, and wanted to see a comprehensive list of everything that changed on my computer between two points in time. I wanted the sweet reassurance of being able to say “it’s not on my list, so it didn’t happen”.
I figured the simplest way to test it out was to use a virtual machine.
First I’d use the built-in feature of the VM software (e.g. VMWare) to take a snapshot (a saved state of the disk and memory), then do my changes (e.g. running software, changing a setting, anything), then take another snapshot.
I then wanted to see the difference between the snapshots.
“Like git diff
", I thought to myself.
So I of course googled “diff virtual machine snapshot”, and found….. nothing.
What do you mean nobody’s done this before? Nobody’s wanted a complete list of what happens when you run a program on your computer? Like, it’s okay if I, just some person, don’t know everything that happens when I run software on my computer. But nobody knows? You live like this? The Great Barrier Reef is dying and you don’t know what happens when you install something on your computer?
By which I mean I uh couldn’t believe that this seemingly basic thing hadn’t been done yet, so I tried making it.
Skipping several months of understanding virtual machines in a way that was far too intimate for me, I ended up with a good-enough prototype.
Accepts two virtual machine snapshots (vmdk
and vmem
files)
Diffs all files on both disks, line-by line (including deleted files). If it’s not in the list, it didn’t happen
Diffs memory (running processes, command lines, and environment variables) on Windows
Diffs also available to search/process via terminal as local directories (think grep
)
Hmmm, I’ve always thought Docker was a bit strange. Why do you have to install an extra app that lives in your taskbar? Why can’t you install it with a package manager? Let’s find out.
Get a Virtual Machine
Take a “before” snapshot
Install Docker Desktop (or whatever you want to test)
Take an “after” snapshot
Now you have two snapshots, and that’s all you need 😎😎
vmdiff
Point vmdiff
at the two snapshots, and thennnn
Point your browser at the delightful new website on localhost:5000
to seeee
Hmmm, what’s this?
What’s DockerDesktop.vhdx
? It turns out that .vhdx
is a format for virtual machine disk files. This might lead you to the surprising but incredibly real conclusion that there was a virtual machine on your computer. Who knows what kind of implications that might have?
You can also see how the running processes have changed between the two snapshots, like this:
Maybe you want to see every file that mentions “docker”, sorted by frequency? The tree of diffs is also available in plaintext in a local directory, so you can grep
, find
, sort
, etc.
You’ll notice that it’s not just the contents of the file that are diffed, but also the metadata, e.g. permissions, timestamps, or whatever “extended file attributes” the OS wants to put on them. For example, here’s a diff showing the macOS “quarantine” attribute changing:
Quarantine is used to enable features such as showing you popup boxes that say “this file was downloaded from the internet, are you sure you want to open it?”
I mean would you look at this?
It's got tables.
vmdiff
for?I don’t even know all the uses. There are so many reasons why someone might want to know everything that’s changed on a computer between two points in time.
Here are some examples of when people might use it:
Security researchers
To analyse what software does (my use case)
e.g. “How does this program store its config data/authentication cookies/assets?”
Analyse what malware does, of course
Existing malware sandboxes list what the malware does (e.g. syscalls, network requests)
vmdiff
lists the result of running the malware
Both are good, just different
Software engineers
To ensure nothing has changed on a computer e.g. software testing
To diagnose and debug complex problems, when you really need more information on what’s going on inside your computer to figure out why it’s happening
If that simple diagram somehow didn’t answer all your questions, more details are on Github.
I’m not going be working on/maintaining vmdiff
for at least 12 months, maybe ever
I’d love for someone to steal this genius idea, either forking the prototype, or making their own
If you ever need to find out everything that happens on your computer when you do something, the prototype is available on Github.
Alex Hope
1 comment