diff --git a/GhidraDocs/GhidraClass/BSim/BSimTutorial_BSim_Command_Line.html b/GhidraDocs/GhidraClass/BSim/BSimTutorial_BSim_Command_Line.html deleted file mode 100755 index 5daff89a45..0000000000 --- a/GhidraDocs/GhidraClass/BSim/BSimTutorial_BSim_Command_Line.html +++ /dev/null @@ -1,83 +0,0 @@ -
The bsim
command-line utility, located in the support
directory of a Ghidra distribution, is used to create, populate, and manage BSim databases.
-It works for all BSim database backends.
-This utility offers a number of commands, many of which have several options.
-In this section, we cover only a small subset of the possibilities.
Running bsim
with no arguments will print a detailed usage message.
The first step is to create signature files from the binaries in the Ghidra project. -Signature files are XML files which contain the BSim signatures and metadata needed by the BSim server.
- -Important: It’s simplest to exit Ghidra before performing the next steps, because:
-postgres_object_files
project open in Ghidra, signature generation will fail.
-Non-shared projects are locked when open, and the lock will prevent the signature-generating process from accessing the project.To generate the signature files, execute the following commands in a shell (adjust as necessary for Windows).
- -cd <ghidra_install_dir>/support
-mkdir ~/bsim_sigs
-./bsim generatesigs ghidra:/<ghidra_project_dir>/postgres_object_files --bsim file:/<database_dir>/example ~/bsim_sigs
-
-
-ghidra:/
argument is the local project which holds the analyzed binaries.
-Note that there is only one forward slash in the URL for a local project.--bsim
argument is the URL of the BSim database.
-This command does not add any signatures to the database, but it does query the database for its settings.Now, we commit the signatures to the BSim database with the following command (still in the support
directory).
./bsim commitsigs file:/<database_dir>/example ~/bsim_sigs
-
-
-Once the signatures have been committed, start Ghidra again.
- -We continue to use the database example
, so this step isn’t necessary for the exercises.
However, if we hadn’t created example
using CreateH2BSimDatabaseScript.java
, we could have used the following command:
./bsim createdatabase file:/<database_dir>/example medium_nosize
-
-medium_nosize
is a database template.
- createdatabase
command can also be used to create a BSim database on a PostgreSQL or Elasticsearch server, provided the servers are configured and running.
-See the “BSim” entry in the Ghidra help for details.It’s worth a brief note about Executable Categories and Function Tags, although they are not used in any of the following exercises.
- -A BSim database can record user-defined metadata about an executable (executable categories) or about a function (function tags). -Categories and tags can then be used as filter elements in a BSim query. -For example, you could restrict a BSim query to search only in executables of the category “OPEN_SOURCE” or to functions which have been tagged “COMPRESSION_FUNCTIONS”.
- -Executable categories in BSim are implemented using program properties, and function tags in BSim correspond to function tags in Ghidra. Properties and tags both have uses in Ghidra which are independent of BSim. -So, if we want a BSim database to record a particular category or tag, we must indicate that explicitly.
- -For example, to inform the database that we wish to record the ORIGIN category, you would execute the command
- -./bsim addexecategory file:/<database_dir>/example ORIGIN
-
-
-Executable categories can be added to a program using the script SetExecutableCategoryScript.java
.
Next Section: Evaluating Matches and Applying Information
diff --git a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Basic_Queries.html b/GhidraDocs/GhidraClass/BSim/BSimTutorial_Basic_Queries.html deleted file mode 100755 index 331432a81c..0000000000 --- a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Basic_Queries.html +++ /dev/null @@ -1,200 +0,0 @@ -In this section, we demonstrate some applications of our BSim database.
- -In order to query the database, you must register it with Ghidra:
- -example.mv.db
Before presenting the exercises, we describe the general mechanics of querying a BSim database.
- -There are a number of ways to initiate a BSim query, including:
- -For these cases, the function(s) being queried depend on the current selection.
-If there is no selection, the function containing the current address is queried.
-If there is a selection, all functions whose entry points are within the selection are queried.
-An easy way to query all functions in a program is to select all addresses with Ctrl-A
in the Listing window and then initiate a BSim query.
It is also possible to initiate a BSim query from the Decompiler window. -Simply right-click on a function name token and select BSim… to query the corresponding function. -This action is available on the name token in the decompiled function’s signature as well as tokens corresponding to names of callees.
- -All of these actions bring up the BSim Search Dialog.
- -From the BSim Search Dialog, you can
- -To query a registered BSim database, select that server from the BSim Server drop-down.
- -Similarity and confidence are scores used to evaluate the relationship between two vectors. -The respective fields in the dialog set lower bounds for these values for the matches returned by BSim.
- -Confidence is used to judge the significance of a match. -For example, many executables contain a function which simply returns a constant value. -Given two executables, each with such a function, the similarity score between the corresponding BSim vectors will be 1.0. -However, the confidence score of the match will be quite low, indicating that it is not very significant that the two executables “share” this code.
- -In general, setting the thresholds involves a tradeoff: lower values mean that the database is more likely to return legitimate matches with significant differences, but also more likely to return matches which simply happen to share some features by chance. -The results of a BSim query can be sorted by the similarity and/or confidence of each match, so a common practice is to set the thresholds relatively low and to examine the matches in descending sort order.
- -The Matches per Function bound controls the number of results returned for a single function. -Note that in large collections, certain small or common functions might have substantial numbers of identical matches.
- -Filters are discussed in BSim Filters.
- -Click the Search button in the dialog to perform a query.
- -After successfully issuing a query, you will also see a Search Function(s) action (without the ellipsis) in certain contexts. -This will perform a BSim query on the selected functions using the same parameters as the last query (skipping the BSim Search Dialog).
- -The database example
contains vectors from a Linux executable used by Ghidra’s GNU demangler.
-Ghidra ships with several other versions of this executable.
-We use these different versions to demonstrate some of the capabilities of BSim.
Note: Use the default query settings and autoanalysis options for the exercises unless otherwise specified.
- -<ghidra_install_dir>/GPL/DemanglerGnu/os/win_x86_64/demangler_gnu_v2_41.exe
.
- demangler_gnu_v2_41
but compiled with Visual Studio instead of GCC.demangler_gnu_v2_41
.example
for matches to the function at 140006760
.Note: We cover the Decompiler Diff View in greater detail and discuss the various “Apply” actions in Evaluating Matches and Applying Information.
- -<ghidra_install_dir>/GPL/DemanglerGnu/os/linux_x86_64/demangler_gnu_v2_24
.
- example
.expandargv
in demangler_gnu_v2_24
and issue a BSim query.<ghidra_install_dir>/GPL/DemanglerGnu/src/demangler_gnu_v2_24/c/argv.c
<ghidra_install_dir>/GPL/DemanglerGnu/src/demangler_gnu_v2_41/c/argv.c
<ghidra_install_dir>/GPL/DemanglerGnu/os/mac_arm_64/demangler_gnu_v2_41
.
- example
but compiled for a different architecture._expandargv
and issue a BSim query.
-In the decompiler diff view of the single match, what differences do you see regarding memmove
and memcpy
?
- Q: If you set the similarity and confidence thresholds to 0.0, will a BSim query return all of the functions in the database?
- -A: No, because
-Next Section: Ghidra from the Command Line
- diff --git a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Creating_Database_From_GUI.html b/GhidraDocs/GhidraClass/BSim/BSimTutorial_Creating_Database_From_GUI.html deleted file mode 100755 index bbd0342634..0000000000 --- a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Creating_Database_From_GUI.html +++ /dev/null @@ -1,38 +0,0 @@ -This section explains how to create and populate an H2-backed BSim database from the Ghidra GUI.
- -To create a BSim database, first create a directory on your file system to contain the database.
- -Next, perform the following steps from the Ghidra Code Browser:
- -CreateH2BSimDatabaseScript.java
.We now populate the database with an executable which is contained in the Ghidra distribution.
- -<ghidra_install_dir>/GPL/DemanglerGnu/os/linux_x86_64/demangler_gnu_v2_41
using the default analysis options.AddProgramToH2BSimDatabaseScript.java
on this program.
- example.mv.db
in the database directory.Next Section: Basic BSim Queries
- diff --git a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Enabling.html b/GhidraDocs/GhidraClass/BSim/BSimTutorial_Enabling.html deleted file mode 100755 index 2504edf8aa..0000000000 --- a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Enabling.html +++ /dev/null @@ -1,22 +0,0 @@ -To begin the tutorial, perform the following steps:
- -To enable BSim, perform the following steps:
- -Configure
link of the BSim
entry.BSimSearchPlugin
is checked.Next Section: Creating and Populating a BSim Database from the GUI
- diff --git a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Evaluating_Matches.html b/GhidraDocs/GhidraClass/BSim/BSimTutorial_Evaluating_Matches.html deleted file mode 100755 index be7a5a6f1c..0000000000 --- a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Evaluating_Matches.html +++ /dev/null @@ -1,135 +0,0 @@ -Summarizing what we’ve created over the last few sections, we now have:
-postgres
).We now demonstrate using BSim to help reverse engineer postgres
.
-While doing this, we’ll showcase some of the features available in the decompiler diff view.
Import and analyze the stripped postgres
executable into the tutorial project, then perform the following steps:
postgres
via Ctrl-A
in the Listing.example
.
- grouping_planner
as the matching function.
-The corresponding function in postgres
should have a default name.double
argument differ between the functions?
- For matches with a fair number of differences, the decompiler diff panel can get pretty colorful. -Furthermore, as you click around, tokens will gain and lose highlights of various colors. -It’s worth giving a brief explanation of when highlighting happens and what the different colors mean. -Some terminology: if you click on a token in a decompiler panel, that token becomes the focused token.
- -The colors:
- -By default, scrolling in the diff window is synchronized. -This means that scrolling within one window will also scroll within the other window. -In the decompiler diff window, scrolling works by matching one line in the left function with one line in the right function. -The two functions are aligned using those lines. -Initially, the functions are aligned using the functions’ signatures.
- -As you click around in either function, the “aligning lines” will change. -If the focused token has a match, the scrolling is re-centered based on the lines containing the matched tokens. -If the focused token does not have a match, the functions will be aligned using the closest token to the focused token which does have a match.
- -Synchronized scrolling can be toggled using the and
icons in the toolbar.
If you are satisfied with a given match, you might want to apply information about the matching function to the queried function. -For example, you might want to apply the name or signature of the function. -There are some subtleties which determine how much information is safe to apply. -Hence there are three actions available under the Apply From Other menu when you right-click in the left panel:
- -Warning: You should be absolutely certain that the datatypes are the exactly the same before applying signatures and data types. -If there have been any changes to a datatype’s definition, you could end up bringing incorrect datatypes into a program, even using BSim matches with 1.0 similarity. -Applying full data types is also problematic for cross-architecture matches.
- -There are similarly-named actions available on rows of the Function Matches table in the BSim Search Results window. -The Status column contains information about which rows have had their matches applied.
- -The token matching algorithm matches a function call in one program to a function call in another by considering the data flow into and out of the CALL
instruction, but it does not do anything with the bodies of the callees.
-However, given a matched pair of calls, you can bring up a new comparison window for the callees with the Compare Matching Callees action.
Ctrl-F
.FUN_
and search for matched function calls where the callee in the left window has a default name and the callee in the right window has a non-default name.The function shown in a panel is controlled by a drop-down menu at the top of the panel. -This can be useful when you’d like to evaluate multiple matches to a single function.
- -Exercise:
- -postgres
, each of which has exactly two matches.
-Select the corresponding four rows in the matches table and perform the Compare Functions action.In the next section, we discuss the Executable Results table.
- -Next Section: From Matching Functions to Matching Executables
-Having debug information isn’t necessary to use BSim (as we’ve seen in a previous exercise), but it is convenient. Note that applying debug information can change BSim signatures, which can negatively impact matching between functions with debug information and functions without it. ↩
-In this section, we discuss the Executable Results table. -Each row of this table corresponds to one executable in the database. -The information in one row is an aggregation of all of the function-level matches into that row’s executable. -Your Executable Results table from the previous query should look similar to the following:
- -If you select a single row in the table and right-click on it, you will see the following actions:
- -foo
has 2 or more matches into a given executable, it still only contributes 1 to the function count).
-What position is demangler_gnu_v2_41
?
- foo
has more than one match into a given executable, only the one with the highest (function-level) confidence contributes to the (executable-level) confidence score.
-Sort the Executable results by descending confidence and observe that demangler_gnu_v2_41
is now much further down the list.
- demangler_gnu_v2_41
and apply the filter action.
-Sort the filtered function matches by descending confidence.
-Starting at the top, examine some of the matches and convince yourself that the given explanation is correct.
- From this exercise, we see that unrelated functions can be duplicates of each other, either because they are small or because they perform a common generic action. -Keep in mind that such functions can “pollute” the results of a blanket query. -In the next section, we demonstrate a technique to restrict queries to functions which are more likely to have meaningful matches.
- -Next Section: Overview Queries
diff --git a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Filters.html b/GhidraDocs/GhidraClass/BSim/BSimTutorial_Filters.html deleted file mode 100755 index 282901dfcc..0000000000 --- a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Filters.html +++ /dev/null @@ -1,21 +0,0 @@ -There are a number of filters that can be applied to BSim queries, involving names, architectures, compilers, ingest dates, user-defined executable categories, and other attributes.
- -Filters be can applied server-side or client-side.
-Server-side filters affect the query results sent to Ghidra from a BSim server and can be applied using the Filters drop-down in the BSim Search dialog.
-Client-side filters apply to the BSim Search results table and can be added and removed at will using the Filter Results icon .
-However, to “undo” a server-side filter, you have to issue another BSim query without the filter.
postgres
and bring up the BSim Search dialog.demangler_gnu_v2_41
as the name to exclude.demangler_gnu_v2_41
is not in the list of executables with matches.Next Section: Scripting and Visualization
diff --git a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Ghidra_Command_Line.html b/GhidraDocs/GhidraClass/BSim/BSimTutorial_Ghidra_Command_Line.html deleted file mode 100755 index ca7ef85e74..0000000000 --- a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Ghidra_Command_Line.html +++ /dev/null @@ -1,53 +0,0 @@ -For the remaining exercises, we need to populate our BSim database with a number of binaries. -We’d like a consistent set of binaries for the tutorial, but we don’t want to clutter the Ghidra distribution with dozens of additional executables. -Fortunately, the BSim plugin includes a script for building the PostgreSQL backend, and that build process creates hundreds of object files. -So we can just build PostgreSQL and harvest the object files we need.
- -Note: For the tutorial, we continue to use the H2 BSim backend. -We do not run any PostgreSQL code, we simply analyze some files produced when building PostgreSQL.
- -Note that these files must be built on a machine running Linux. -Windows users can build these files in a Linux virtual machine.
- -To build the files, execute the following commands in a shell: 1
- -cd <ghidra_install_dir>/Features/BSim
-export CFLAGS="-O2 -g"
-./make-postgres.sh
-mkdir ~/postgres_object_files
-cd build
-find . -name p*o -size +100000c -size -700000c -exec cp {} ~/postgres_object_files/ \;
-cd os/linux_x86_64/postgresql/bin
-strip -s postgres
-
-
-To continue on Windows, transfer the ~/postgres_object_files
directory and the stripped postgres
executable to your Windows machine.
Now that we have the executables, we can analyze them with the headless analyzer2. -The headless analyzer is distinct from BSim, but using it is the only feasible way to analyze substantial numbers of binaries.
- -To analyze the files in Linux, execute the following commands in a shell.
- -cd <ghidra_install_dir>/support
-./analyzeHeadless <ghidra_project_dir> postgres_object_files -import ~/postgres_object_files/*
-
-(On windows, use analyzeHeadless.bat
and adjust paths accordingly.)
This will create a local Ghidra project called postgres_object_files
in the directory <ghidra_project_dir>
.
Next Section: BSim from the Command Line
- -You may need to install additional packages and/or change some build options in order for PostgreSQL to build successfully. The error messages are generally informative. See the comments in make-postgres.sh
. ↩
The headless analyzer has its own documentation: <ghidra_install_dir>/support/analyzeHeadlessREADME.html
. ↩
As you’ve reverse engineered software, you’ve likely asked the following questions:
- -BSim is intended to help with these questions (and others) by providing a way to search collections of binaries for similar, but not necessarily identical, functions.
- -The idea behind BSim is to generate a feature vector for each function in a binary. -The vectors are generated by Ghidra’s decompiler. -Each feature represents a small piece of data flow and/or control flow of the associated function. -The decompiler normalizes the feature vector representation so that different, but functionally equivalent, pieces of code often produce the same features. -Certain attributes, such as values of constants, names of registers, and data types, are intentionally not incorporated into the features.
- -BSim vectors are compared using cosine similarity.
-Discrepancies between the vectors for foo
and bar
which are caused by differences in compilers, target architectures, and/or small changes to the source code typically result in vectors which are close but not identical.
BSim vectors can be stored in a dedicated database. -BSim databases intended to hold large1 numbers of vectors maintain an index based on locality-sensitive hashing. -The index drastically reduces the number of vector comparisons needed and allows for rapid retrieval of results.
- -Querying foo
against a BSim database typically yields a number of potential matches.
-Each individual match for foo
can be compared to foo
in a side-by-side view, and certain information (such as function name) can be quickly copied from a match to foo
.
We frequently call BSim vectors the BSim signature of a function, or just the signature when the context is clear.
- -We can think of each feature as representing a small piece of the behavior of a function, analogous to a snippet of source code. -Functions whose BSim vectors are close typically have many features in common, that is, they have similar behavior. -Hence the name “BSim”: Behavioral Similiarity.
- -Using BSim involves the following components:
- -There are three supported database backends for BSim:
- -PostgreSQL
- -Elasticsearch
- -BSimElasticPlugin
extension contains an Elasticsearch plugin for BSim.H2
- -Next Section: Starting Ghidra and Enabling BSim
- -Creating a database requires a database template, which determines the specifics of the index. Currently, Ghidra provides a medium template, intended for databases holding up to 10 million unique vectors, and a large template, intended for databases holding up to 100 million unique vectors. ↩
-An Overview Query queries a BSim database for the number of matches to each function in an executable. -The matching functions themselves are not returned. -Similarity and Confidence thresholds can be set for an Overview Query, but there is no “Matches per Function” bound and no filters can be applied.
- -To perform an Overview Query, select BSim -> Perform Overview… from the Code Browser.
- -postgres
using the default query thresholds.
-You should see the following result:
-Using the hit count column, it is possible to exclude functions with large numbers of matches.
- -demangler_gnu_v2_41
is far down the list.Suppose foo
and bar
have the same number of hits in the Overview table.
-There are two possibilities:
foo
and bar
have distinct feature vectors which happen to have the same number of matches.foo
and bar
have the same feature vector.An optional column, Vector Hash, can be used to distinguish between these two cases.
- -Shift-C
or right-click and perform the Compare Selected Functions action.Next Section: Queries and Filters
diff --git a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Scripting.html b/GhidraDocs/GhidraClass/BSim/BSimTutorial_Scripting.html deleted file mode 100755 index c2d3c4c2d1..0000000000 --- a/GhidraDocs/GhidraClass/BSim/BSimTutorial_Scripting.html +++ /dev/null @@ -1,23 +0,0 @@ -Finally, we briefly mention a few other topics related to BSim.
- -There are are number of example scripts in the BSim
script category, which demonstrate how to interact with BSim programmatically.
Finally, if you’d like to see the particular BSim features in a function, you can use the BSim Feature Visualizer. -This plugin allows you to highlight regions of the decompiled code corresponding to a particular feature and to display a graph representing the feature.
- -To use this plugin, first enable the BSimFeatureVisualizerPlugin
via File -> Configure from the Code Browser.
-You can then bring it up via BSim -> BSim Feature Visualizer.
This is the end of the tutorial.
- - diff --git a/GhidraDocs/GhidraClass/BSim/README.html b/GhidraDocs/GhidraClass/BSim/README.html deleted file mode 100755 index e6b4c0082f..0000000000 --- a/GhidraDocs/GhidraClass/BSim/README.html +++ /dev/null @@ -1,24 +0,0 @@ -BSim is a Ghidra plugin for finding structurally similar functions in (potentially large) collections of binaries. -It is based on Ghidra’s decompiler and can find matches across compilers, architectures, and/or small changes to source code.
- -This tutorial demonstrates how create a small BSim database and walks through some typical use cases.
- -Detailed information about BSim can be found in the “BSim” entry of the Ghidra Help.
- -Next Section: Introduction to BSim
diff --git a/GhidraDocs/build.gradle b/GhidraDocs/build.gradle index 97cded2550..076d90275e 100644 --- a/GhidraDocs/build.gradle +++ b/GhidraDocs/build.gradle @@ -55,4 +55,8 @@ rootProject.assembleMarkdownToHtml { from ("${this.projectDir}/InstallationGuide.md") { into "docs" } + from ("${this.projectDir}/GhidraClass/BSim") { + include "*.md" + into "docs/GhidraClass/BSim" + } }