# Ghidra Bsim plugin for Dalvik, and exporting to MISP and BsimVis
Ghidra has Dalvik decompilation support, and lifting to Pcode Intermediate representation, enabling BSim feature vectors creation
Also the unpacking of the APK archive needs to be done before analyzing the files, the batch import of the APK archive doesn't seem to automatically detect the .dex files inside.

Here the feature vector and Pcode intermediate representation of a classes.dex from a Pegasus APK malware.

The ghidra analyzers do run from quite some time, because of Java boilerplate code and the size of that malware (15 min of analysis for one 7Mb .dex file holding 19,676 functions).
## 1. Misp-ghidra
I recommand making a selection of the function because of upload limits (defintely an issue to be fixed), or change the upload limit in MISP.


## 2. BSimVis
Same thing for Bsimvis, upload the file using :
```bash
uv run bsimvis upload ./classes.dex ./classes_variations.dex
```
Here is similar java functions matching :

### Comparing 4 APK Pegasus / Chryasor samples
I analyzd 4 pegasus samples found on different sources (malware-bazaar and TheZoo). One did not correlate to the others.
The two main blobs of function (blue and orange) are the correlation between two Pegasus found in 2021 (one comes from Malware-bazaar and one from TheZoo repo).

The third one is a Pegasus from 2023. The function matching between the 2021 and 2023 are some cryptographic (XOR and base64) code, in the `seC.dujmehn` package (these functions are reported in https://malware.news/t/mobile-malware-analysis-part-3-pegasus/75176)
One issue with Java and Android APIs could be the extensive use of strings, which are abstracted in our function similarity. There is a lot of noise when comparing function using the same coding pattern (API requests, string building).
### Code reuse in the same binary
BSimVis allows you to look at cross binary similarity but also internal code similarities.
Looking at internal code reuse in the TheZoo pegasus :
`8d4b77fa3546149f25bd17357d41fbf0 Andr.PegasusB.apk`

We see that again in the `LseC.dujmehn.Cutyq` package, we find a lot of code reuse for base64 string manipulation. Using one of these functions as the filter for simialrity, we find 1448 functions that share this code pattern (with different string keys) only in this binary.
# Experiments
## 1. Comparing C to Java in BsimVis ?
I also tried comparing Java and C functions to see if we could see some correlations. I asked Gemini to build me a Java and a C program with similar data-flow functions, with a bit of comlexity to have interesting vectors to compare.
### Some results
In the first test, its not that conclusive (28% similarity) because of a switch statement not having the same data flow in compiled C and Java Dalvik.

However changing the switch to if/else statements does bring the similarity to 61%.
Here is an x86 elf, exe and a Java Dalvik function matching :

### Test on a cryptographic function (feistel and s-boxes)
Here i asked Gemini to translate some cryptographic function from C to Java, which could be something a malware developper could do. The similarity falls to 19.0% for these functions :
#### C cryptographic function
```C
uint32_t sim_target(uint32_t l, uint32_t r, uint32_t k0, uint32_t k1, uint32_t k2, uint32_t k3) {
uint32_t subkeys[32];
uint32_t seed = k0 ^ k1 ^ k2 ^ k3 ^ 0x12345678;
// Stage 1: Key Expansion (LFSR + Mixed Arithmetic)
for (int i = 0; i < 32; i++) {
seed = (seed >> 1) ^ (-(seed & 1u) & 0xD0000001u); // 32-bit LFSR
uint32_t mix = (k0 << (i % 31)) | (k1 >> (32 - (i % 31)));
subkeys[i] = seed ^ mix ^ (k2 + k3);
}
// Stage 2: 32-Round Feistel Network
for (int i = 0; i < 32; i++) {
uint32_t temp = r;
// Complex Round Function F
uint32_t f = (r << 4) ^ (r >> 5);
f += subkeys[i];
f ^= (subkeys[31 - i] >> 3);
// Substitution Layer (S-Box)
f = ((uint32_t)sbox[f & 0xFF]) |
((uint32_t)sbox[(f >> 8) & 0xFF] << 8) |
((uint32_t)sbox[(f >> 16) & 0xFF] << 16) |
((uint32_t)sbox[(f >> 24) & 0xFF] << 24);
// Conditional data-dependent manipulation
if (f % 2 == 0) {
f = (f << 1) | (f >> 31);
} else {
f ^= 0xAAAAAAAA;
}
r = l ^ f;
l = temp;
}
// Final Mixing
return l ^ r ^ subkeys[0] ^ subkeys[31];
}
```
#### Java translation
```java
public static int simTarget(int l, int r, int k0, int k1, int k2, int k3) {
int[] subkeys = new int[32];
int seed = k0 ^ k1 ^ k2 ^ k3 ^ 0x12345678;
// Stage 1: Key Expansion
for (int i = 0; i < 32; i++) {
// Unsigned right shift for LFSR to match uint32_t behavior
int bit = seed & 1;
seed = (seed >>> 1) ^ (bit != 0 ? 0xD0000001 : 0);
int mix = (k0 << (i % 31)) | (k1 >>> (32 - (i % 31)));
subkeys[i] = seed ^ mix ^ (k2 + k3);
}
// Stage 2: 32-Round Feistel Network
for (int i = 0; i < 32; i++) {
int temp = r;
// Complex Round Function F
int f = (r << 4) ^ (r >>> 5);
f += subkeys[i];
f ^= (subkeys[31 - i] >>> 3);
// Substitution Layer
f = (SBOX[f & 0xFF]) |
(SBOX[(f >>> 8) & 0xFF] << 8) |
(SBOX[(f >>> 16) & 0xFF] << 16) |
(SBOX[(f >>> 24) & 0xFF] << 24);
// Conditional data-dependent manipulation
// Important: In Java, % 2 on negative numbers can be -1.
// But for unsigned-like behavior, checking the last bit is better.
if ((f & 1) == 0) {
f = (f << 1) | (f >>> 31);
} else {
f ^= 0xAAAAAAAA;
}
r = l ^ f;
l = temp;
}
return l ^ r ^ subkeys[0] ^ subkeys[31];
}
```
## 2. What about C# CIL .Net ?
Out of curiosity, i looked up if the same applied for C# CIL language, but ghidra does not support it at all, and there doesn't seem to be any community plugin for CIL either.