When talking about the protection of intellectual property, virtual machines and custom instruction set can play a very important role in the field. One of the ways to protect your algorithm from curious eyes is to use code obfuscation techniques. These can range from a simple instruction reordering to a more sophisticated control flow modifications and added layer of custom instruction set, or a combination of both. Recently I’ve watched a video about virtual machines for code obfuscation from RECON video archives. The speaker said that his implementation is available for download at his website for those who want to experiment; however, I could not find it so I’ve decided to implement my own compiler and virtual machine (vm).
The code of the compiler and testing version of the vm is available for download from github. I am not going to explain how they are implemented but I will explain how one can create his custom instruction set by extending the compiler and how to write code that compiler understands.
The compiler was written in python because it makes the work with text and strings trivial. There are three files: compiler.py, vmc.py and myis.py. The compiler.py is basically the class that provides all the logic: lexical analysis, compiling, linking, etc. The parser in the compiler does not support comments. So, if you are going to write anything longer than my factorial example you may get lost in the code. The compiler only supports 7 types of tokens where each token is separated by a white space:
- instruction – an instruction (see myis.py instructions variable).
- register – a register (see myis.py registers variable).
- @register – a register that holds memory address and will be dereferenced by the virtual machine.
- immediate – an immediate value. Syntax: int(<number>) where <number> can be a number in any format supported by python.
- label – a label. Syntax <name>:
- reference – any name that does not fall under the instruction and register categories. Usually a label without a colon.
- string – a sequence of bytes. Syntax str(<string>) where <string> is any string supported by python.
A sample program that computes factorial can be found at github with the rest of the code; the syntax should become more clear by looking at it. However, you should be aware when using int() and str(); whatever will be inside the parenthesis is going to be evaluated by the python interpreter. Furthermore, int() and str() cannot take an argument with white space. For example, instead of str(“hello world”) you should use str(“hello\x20world”).
First let’s look how one can extend the instruction set by modifying myis.py.
registers = ["eax", "ebx", "ecx", "edx", "esp", "ebp", "esi", "edi"]
instructions = {
"push" : [{"opcode" : 0x00, "format" : "<cc", "params" : ["reg"]}],
"pop" : [{"opcode" : 0x01, "format" : "<cc", "params" : ["reg"]}],
"mov" : [{"opcode" : 0x02, "format" : "<ccI", "params" : ["reg", "imm"]},
{"opcode" : 0x03, "format" : "<ccc", "params" : ["reg", "reg"]},
{"opcode" : 0x04, "format" : "<ccc", "params" : ["reg", "@reg"]},
{"opcode" : 0x05, "format" : "<ccc", "params" : ["@reg", "reg"]},
{"opcode" : 0x06, "format" : "<ccI", "params" : ["reg", "ref"]}
],
...
}
Above you can see an excerpt from myis.py which contains two variables: registers and instructions. The registers variable is a list containing all valid registers that tokenizer will recognize. If the order of registers is changed the compiler will generate different instruction opcodes because it refers to registers by their position in the array. The instruction variable is a map table (dictionary in python terms) where each keyword is a name of an instruction mapping to a list that contains one or more dictionaries. Each inner dictionary has three keys:
- opcode – an opcode number.
- format – an instruction encoding format. See python struct.
- params – a list of parameters (in exact order) the instruction takes. Valid values are: reg, @reg, ref, imm, str. See tokens above for more details.
Sometimes an instruction can have multiple dictionaries as mov does in the above example. This should be only the case when instruction can take different set of arguments; then the compiler will iterate through till it finds the right match. There is also a special instruction defined in the myis.py – emit. The emit takes a string as an argument and emits bytes into the code. This can be useful if you want to make code overlap, reserve space for variables, or just manually inject bytes in the code. When using references as arguments the address of the reference will be translated to an offset relative to the instruction pointer. Simply by modifying myis.py you can introduce new instructions to the compiler; then write the code and compile it with vmc.py.
Once you have the bytecode the final step is to write a vm that will read it and execute the encoded instructions. Among the source files I have included a very basic sample vm that will execute the factorial program. The implementation is pretty straightforward.
Hopefully I have covered everything and didn’t miss any parts of the compiler’s logic. The source code I provide is highly experimental and it should have tons of bugs; nevertheless, it should give a good starting base to those who would like to write their own compiler and a vm.