mirror of
https://github.com/NationalSecurityAgency/ghidra.git
synced 2025-10-03 17:59:46 +02:00
88 lines
4.6 KiB
Markdown
88 lines
4.6 KiB
Markdown
# Introduction to BSim
|
|
|
|
As you've reverse engineered software, you've likely asked the following questions:
|
|
|
|
- Which libraries were statically linked into this executable?
|
|
- Does this executable share some code with another executable that I've analyzed?
|
|
- What are the differences between version 1 and version 2 of a given executable?
|
|
- Does this executable share code with another executable in a large collection of binaries?
|
|
- Was this function pulled from an open-source library?
|
|
|
|
BSim is intended to help with these questions (and others) by providing a way to search collections of binaries for similar, but not necessarily identical, functions.
|
|
|
|
# How Does BSim Work?
|
|
|
|
The idea behind BSim is to generate a *feature vector* for each function in a binary.
|
|
The vectors are generated by Ghidra's decompiler.
|
|
Each feature represents a small piece of data flow and/or control flow of the associated function.
|
|
The decompiler normalizes the feature vector representation so that different, but functionally equivalent, pieces of code often produce the same features.
|
|
Certain attributes, such as values of constants, names of registers, and data types, are intentionally not incorporated into the features.
|
|
|
|
BSim vectors are compared using *cosine similarity*.
|
|
Discrepancies between the vectors for ``foo`` and ``bar`` which are caused by differences in compilers, target architectures, and/or small changes to the source code typically result in vectors which are close but not identical.
|
|
|
|
BSim vectors can be stored in a dedicated database.
|
|
BSim databases intended to hold large[^1] numbers of vectors maintain an index based on *locality-sensitive hashing*.
|
|
The index drastically reduces the number of vector comparisons needed and allows for rapid retrieval of results.
|
|
|
|
[^1]: Creating a database requires a *database template*, which determines the specifics of the index. Currently, Ghidra provides a *medium* template, intended for databases holding up to 10 million unique vectors, and a *large* template, intended for databases holding up to 100 million unique vectors.
|
|
|
|
Querying ``foo`` against a BSim database typically yields a number of potential matches.
|
|
Each individual match for ``foo`` can be compared to `foo` in a side-by-side view, and certain information (such as function name) can be quickly copied from a match to ``foo``.
|
|
|
|
We frequently call BSim vectors the *BSim signature* of a function, or just the *signature* when the context is clear.
|
|
|
|
# Why "BSim"?
|
|
|
|
We can think of each feature as representing a small piece of the *behavior* of a function, analogous to a snippet of source code.
|
|
Functions whose BSim vectors are close typically have many features in common, that is, they have *similar behavior*.
|
|
Hence the name "BSim": **B**ehavioral **Sim**iliarity.
|
|
|
|
# BSim Clients, BSim Databases, and Ghidra Projects
|
|
|
|
Using BSim involves the following components:
|
|
|
|
- A *BSim Client*, i.e., an instance of Ghidra with the BSim plugin enabled.
|
|
- This is where the reverse engineering happens.
|
|
- A *BSim Database*, which stores the BSim signatures.
|
|
- Also stores some metadata about each function and its containing executable.
|
|
- In particular, stores the ghidra:// URL of the associated Ghidra program.
|
|
- Does not store disassembly or decompiled functions.
|
|
- A *Ghidra Project*, which stores the analyzed programs used to populate the BSim database.
|
|
- Given a BSim match, the BSim client can use the ghidra:// URL to retrieve a program from a Ghidra project for side-by-side comparisons.
|
|
- Note that a single BSim database can reference multiple Ghidra projects.
|
|
|
|
# Database Backends
|
|
|
|
There are three supported database backends for BSim:
|
|
|
|
1. PostgreSQL
|
|
|
|
- The Ghidra distribution includes the source for PostgreSQL, a PostgreSQL plugin for BSim, and a build script.
|
|
- Populated from shared Ghidra projects (i.e., requires a Ghidra server).
|
|
- Server not supported on Windows (no restriction on clients).
|
|
|
|
2. Elasticsearch
|
|
|
|
- The ``BSimElasticPlugin`` extension contains an Elasticsearch plugin for BSim.
|
|
- This plugin must be installed into an existing Elasticsearch database.
|
|
- Populated from shared Ghidra projects.
|
|
|
|
3. H2
|
|
|
|
- Simplest way to use BSim:
|
|
- Backed by files on the user's machine (don't need to install database server).
|
|
- Can be created and populated quickly.
|
|
- Supported on all platforms.
|
|
- Does not support large collections of binaries or multiple users.
|
|
- Can be populated from non-shared (local) or shared Ghidra projects.
|
|
|
|
Next Section: [Starting Ghidra and Enabling BSim](BSimTutorial_Enabling.md)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|