mirror of
https://github.com/NationalSecurityAgency/ghidra.git
synced 2025-10-04 02:09:44 +02:00
Candidate release of source code.
This commit is contained in:
parent
db81e6b3b0
commit
79d8f164f8
12449 changed files with 2800756 additions and 16 deletions
426
Ghidra/Features/Decompiler/src/decompile/cpp/docmain.hh
Normal file
426
Ghidra/Features/Decompiler/src/decompile/cpp/docmain.hh
Normal file
|
@ -0,0 +1,426 @@
|
|||
/* ###
|
||||
* IP: GHIDRA
|
||||
* REVIEWED: YES
|
||||
*
|
||||
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||
* you may not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
/** \mainpage Decompiler Analysis Engine
|
||||
|
||||
\section toc Table of Contents
|
||||
|
||||
- \ref overview
|
||||
- \ref capabilities
|
||||
- \ref design
|
||||
- \ref workflow
|
||||
- \ref ghidraimpl
|
||||
- \subpage sleigh
|
||||
- \subpage coreclasses
|
||||
- \subpage termrewriting
|
||||
|
||||
\section overview Overview
|
||||
|
||||
Welcome to the \b Decompiler \b Analysis \b Engine. It is a
|
||||
complete library for performing automated data-flow analysis
|
||||
on software, starting from the binary executable. This
|
||||
documentation is geared toward understanding the source code
|
||||
and starts with a brief discussion of the libraries capabilities
|
||||
and moves immediately into the design of the decompiler and
|
||||
the main code workflow.
|
||||
|
||||
The library provides its own Register
|
||||
Transfer Languate (RTL), referred to internally as \b p-code,
|
||||
which is designed specifically for reverse engineering
|
||||
applications. The disassembly of processor specific machine-code
|
||||
languages, and subsequent translation into \b p-code, forms
|
||||
a major sub-system of the decompiler. There is a processor
|
||||
specification language, referred to as \b SLEIGH, which is
|
||||
dedicated to this translation task, and there is a corresponding
|
||||
section in the documentation for the classes and methods used
|
||||
to implement this language in the library (See \subpage sleigh).
|
||||
This piece of the code can be built as a standalone binary
|
||||
translation library, for use by other applications.
|
||||
|
||||
For getting up to speed quickly on the details of the source
|
||||
and the decompiler's main data structures,
|
||||
there is a specific documentation page describing the core
|
||||
classes and methods.
|
||||
|
||||
Finally there is a documentation page summarizing the
|
||||
simplification rules used in the core decompiler analysis.
|
||||
|
||||
\section capabilities Capabilities
|
||||
|
||||
\section design Design
|
||||
|
||||
The main design elements of the decompiler come straight
|
||||
from standard \e Compiler \e Theory data structures and
|
||||
algorithms. This should come as no surprise, as both
|
||||
compilers and decompilers are concerned with translating
|
||||
from one coding language to another. They both follow a
|
||||
general work flow:
|
||||
|
||||
- Parse/tokenize input language.
|
||||
- Build abstract syntax trees in an intermediate language.
|
||||
- Manipulate/optimize syntax trees.
|
||||
- Map intermediate language to output language constructs.
|
||||
- Emit final output language encoding.
|
||||
|
||||
With direct analogs to (forward engineering) compilers, the
|
||||
decompiler uses:
|
||||
|
||||
- A Register Transfer Language (RTL) referred to as \b p-code.
|
||||
- Static Single Assignment (SSA) form.
|
||||
- Basic blocks and Control Flow Graphs.
|
||||
- Term rewriting rules.
|
||||
- Dead code elimination.
|
||||
- Symbol tables and scopes.
|
||||
|
||||
Despite these similarities, the differences between a
|
||||
decompiler and a compiler are substantial and run throughout
|
||||
the entire process. These all stem from the fact that, in
|
||||
general, descriptive elements and the higher-level
|
||||
organization of a piece of code can only be explicitly
|
||||
expressed in a high-level language. So the decompiler,
|
||||
working with a low-level language as input, can only infer
|
||||
this information.
|
||||
|
||||
The features mentioned above all have a decompiler specific
|
||||
slant to them, and there are other tasks that the decompiler
|
||||
must perform that have no real analog with a compiler.
|
||||
These include:
|
||||
|
||||
- Variable merging (vaguely related to register coloring)
|
||||
- Type propagation
|
||||
- Control flow structuring
|
||||
- Function prototype recovery
|
||||
- Expression recovery
|
||||
|
||||
\section workflow Main Work Flow
|
||||
|
||||
Here is an outline of the decompiler work flow.
|
||||
|
||||
-# \ref step0
|
||||
-# \ref step1
|
||||
-# \ref step2
|
||||
-# \ref step3
|
||||
-# \ref step4
|
||||
-# \ref step5
|
||||
- \ref step5a
|
||||
- Adjust p-code in special situations.
|
||||
- \ref step5b
|
||||
- \ref step5c
|
||||
- \ref step5d
|
||||
- \ref step5e
|
||||
- \ref step5f
|
||||
-# \ref step6
|
||||
-# \ref step7
|
||||
-# \ref step8
|
||||
-# \ref step9
|
||||
-# \ref step10
|
||||
-# \ref step11
|
||||
-# \ref step12
|
||||
-# \ref step13
|
||||
-# \ref step14
|
||||
|
||||
\subsection step0 Specify Entry Point
|
||||
|
||||
The user specifies a starting address for a particular function.
|
||||
|
||||
\subsection step1 Generate Raw P-code
|
||||
|
||||
The p-code generation engine is called \b SLEIGH. Based on a
|
||||
processor specification file, it maps binary encoded
|
||||
machine instructions to sequences of p-code operations.
|
||||
P-code operations are generated for a single machine
|
||||
instruction at a specific address. The control flow
|
||||
through these p-code operations is followed to determine
|
||||
if control falls through, or if there are jumps or calls.
|
||||
A work list of new instruction addresses is kept and is
|
||||
continually revisited until there are no new instructions.
|
||||
After the control flow is traced, additional changes may
|
||||
be made to the p-code.
|
||||
|
||||
-# PIC constructions are checked for, now that the
|
||||
extent of the function is known. If a call is to a
|
||||
location that is still within the function, the call
|
||||
is changed to a jump.
|
||||
-# Functions which are marked as inlined are filled in
|
||||
at this point, before basic blocks are generated.
|
||||
P-code for the inlined function is generated
|
||||
separately and control flow is carefully set up to
|
||||
link it in properly.
|
||||
|
||||
\subsection step2 Generate Basic Blocks and the CFG
|
||||
|
||||
Basic blocks are generated on the p-code instructions
|
||||
(\e not the machine instructions) and a control flow graph
|
||||
of these basic blocks is generated. Control flow is
|
||||
normalized so that there is always a unique start block
|
||||
with no other blocks falling into it. In the case of
|
||||
subroutines which have branches back to their very first
|
||||
machine instruction, this requires the creation of an
|
||||
empty placeholder start block that flows immediately into
|
||||
the block containing the p-code for the first instruction.
|
||||
|
||||
\subsection step3 Inspect Sub-functions
|
||||
|
||||
-# Addresses of direct calls are looked up in the
|
||||
database and any parameter information is
|
||||
recovered.
|
||||
-# If there is information about an indirect call,
|
||||
parameter information can be filled in and the
|
||||
indirect call can be changed to a direct call.
|
||||
-# Any call for which no prototype is found has a
|
||||
default prototype set for it.
|
||||
-# Any global or default prototype recovered at this
|
||||
point can be overridden locally.
|
||||
|
||||
\subsection step4 Adjust/Annotate P-code
|
||||
|
||||
-# The context database is searched for known values of
|
||||
memory locations coming into the function. These
|
||||
are implemented by inserting p-code \b COPY
|
||||
instructions that assign the correct value to the
|
||||
correct memory location at the beginning of the
|
||||
function.
|
||||
-# The recovered prototypes may require that extra
|
||||
p-code is injected at the call site so that certain
|
||||
actions of the call are explicit to the analysis
|
||||
engine.
|
||||
-# Other p-code may be inserted to indicate changes a
|
||||
call makes to the stack pointer. Its possible that
|
||||
the change to the stack pointer is unknown. In this
|
||||
case \b INDIRECT p-code instructions are inserted to
|
||||
indicate that the state of the stack pointer is
|
||||
unknown at that point, preparing for the extrapop
|
||||
action.
|
||||
-# For each p-code call instruction, extra inputs are
|
||||
added to the instruction either corresponding to a
|
||||
known input for that call, or in preparation for the
|
||||
prototype recovery actions. If the (potential)
|
||||
function input is located on the stack, a temporary
|
||||
is defined for that input and a full p-code \b LOAD
|
||||
instruction, with accompanying offset calculation,
|
||||
is inserted before the call to link the input with
|
||||
the (currently unknown) stack offset. Similarly
|
||||
extra outputs are added to the call instructions
|
||||
either representing a known return value, or in
|
||||
preparation for parameter recovery actions.
|
||||
-# Each p-code \b RETURN instruction for the current
|
||||
function is adjusted to hide the use of the return
|
||||
address and to add an input location for the return
|
||||
value. The return value is considered an input to
|
||||
the \b RETURN instruction.
|
||||
|
||||
\subsection step5 The Main Simplification Loop
|
||||
|
||||
\subsubsection step5a Generate SSA Form
|
||||
|
||||
This is very similar to forward engineering
|
||||
algorithms. It uses a fairly standard phi-node
|
||||
placement algorithm based on the control flow dominator
|
||||
tree and the so-called dominance frontier. A standard
|
||||
renaming algorithm is used for the final linking of
|
||||
variable defs and uses. The decompiler has to take
|
||||
into account partially overlapping variables and guard
|
||||
against various aliasing situations, which are
|
||||
generally more explicit to a compiler. The decompiler
|
||||
SSA algorithm also works incrementally. Many of the
|
||||
stack references in a function cannot be fully resolved
|
||||
until the main term rewriting pass has been performed
|
||||
on the register variables. Rather than leaving stack
|
||||
references as associated \b LOAD s and \b STORE s, when
|
||||
the references are finally discovered, they are
|
||||
promoted to full variables within the SSA tree. This
|
||||
allows full copy propagation and simplification to
|
||||
occur with these variables, but it often requires 1 or
|
||||
more additional passes to fully build the SSA tree.
|
||||
Local aliasing information and aliasing across
|
||||
subfunction calls can be annotated in the SSA structure
|
||||
via \b INDIRECT p-code operations, which holds the
|
||||
information that the output of the \b INDIRECT is derived
|
||||
from the input by some indirect (frequently unknown)
|
||||
effect.
|
||||
|
||||
\subsubsection step5b Eliminate Dead Code
|
||||
|
||||
Dead code elimination is essential to the decompiler
|
||||
because a large percentage of machine instructions have
|
||||
side-effects on machine state, such as the setting of
|
||||
flags, that are not relevant to the function at a
|
||||
particular point in the code. Dead code elimination is
|
||||
complicated by the fact that its not always clear what
|
||||
variables are temporary, locals, or globals. Also,
|
||||
compilers frequently map smaller (1-byte or 2-byte)
|
||||
variables into bigger (4-byte) registers, and
|
||||
manipulation of these registers may still carry around
|
||||
left over information in the upper bytes. The
|
||||
decompiler detects dead code down to the bit, in order
|
||||
to appropriately truncate variables in these
|
||||
situations.
|
||||
|
||||
\subsubsection step5c Propagate Local Types
|
||||
|
||||
The decompiler has to infer high-level type information
|
||||
about the variables it analyzes, as this kind of
|
||||
information is generally not present in the input
|
||||
binary. Some information can be gathered about a
|
||||
variable, based on the instructions it is used in (.i.e
|
||||
if it is used in a floating point instruction). Other
|
||||
information about type might be available from header
|
||||
files or from the user. Once this is gathered, the
|
||||
preliminary type information is allowed to propagate
|
||||
through the syntax trees so that related types of other
|
||||
variables can be determined.
|
||||
|
||||
\subsubsection step5d Perform Term Rewriting
|
||||
|
||||
The bulk of the interesting simplifications happen in
|
||||
this section. Following Formal Methods style term
|
||||
rewriting, a long list of rules are applied to the
|
||||
syntax tree. Each rule matches some potential
|
||||
configuration in a portion of the syntax tree, and
|
||||
after the rule matches, it specifies a sequence of edit
|
||||
operations on the syntax tree to transform it. Each
|
||||
rule can be applied repeatedly and in different parts
|
||||
of the tree if necessary. So even a small set of rules
|
||||
can cause a large transformation. The set of rules in
|
||||
the decompiler is extensive and is tailored to specific
|
||||
reverse engineering needs and compiler constructs. The
|
||||
goal of these transformations is not to optimize as a
|
||||
compiler would, but to simplify and normalize for
|
||||
easier understanding and recognition by human analysts
|
||||
(and follow on machine processing). Typical examples
|
||||
of transforms include, copy propagation, constant
|
||||
propagation, collecting terms, cancellation of
|
||||
operators and other algebraic simplifications, undoing
|
||||
multiplication and division optimizations, commuting
|
||||
operators, ....
|
||||
|
||||
\subsubsection step5e Adjust Control Flow Graph
|
||||
|
||||
The decompiler can recognize
|
||||
- unreachable code
|
||||
- unused branches
|
||||
- empty basic blocks
|
||||
- redundant predicates
|
||||
- ...
|
||||
|
||||
It will remove branches or blocks in order to
|
||||
simplify the control flow.
|
||||
|
||||
\subsubsection step5f Recover Control Flow Structure
|
||||
|
||||
The decompiler recovers higher-level control flow
|
||||
objects like loops, \b if/\b else blocks, and \b switch
|
||||
statements. The entire control flow of the function is
|
||||
built up hierarchically with these objects, allowing it
|
||||
to be expressed naturally in the final output with the
|
||||
standard control flow constructs of the high-level
|
||||
language. The decompiler recognizes common high-level
|
||||
unstructured control flow idioms, like \e break, and can
|
||||
use node-splitting in some situations to undo compiler
|
||||
flow optimizations that prevent a structured
|
||||
representation.
|
||||
|
||||
\subsection step6 Perform Final P-code Transformations
|
||||
|
||||
During the main simplification loop, many p-code
|
||||
operations are normalized in specific ways for the term
|
||||
rewriting process that aren't necessarily ideal for the
|
||||
final output. This phase does transforms designed to
|
||||
enhance readability of the final output. A simple example
|
||||
is that all subtractions (\b INT_SUB) are normalized to be an
|
||||
addition on the twos complement in the main loop. This
|
||||
phase would convert any remaining additions of this form
|
||||
back into a subtraction operation.
|
||||
|
||||
\subsection step7 Exit SSA Form and Merge Low-level Variables (phase 1)
|
||||
|
||||
The static variables of the SSA form need to be merged
|
||||
into complete high-level variables. The first part of
|
||||
this is accomplished by formally exiting SSA form. The
|
||||
SSA phi-nodes and indirects are eliminated either by
|
||||
merging the input and output variables or inserting extra
|
||||
\b COPY operations. Merging must guard against a high-level
|
||||
variable holding different values (in different memory
|
||||
locations) at the same time. This is similar to register
|
||||
coloring in compiler design.
|
||||
|
||||
\subsection step8 Determine Expressions and Temporary Variables
|
||||
|
||||
A final determination is made of what the final output
|
||||
expressions are going to be, by determining which
|
||||
variables in the syntax tree will be explicit and which
|
||||
represent temporary variables. Certain terms must
|
||||
automatically be explicit, such as constants, inputs,
|
||||
etc. Other variables are forced to be explicit because
|
||||
they are read too many times or because making it implicit
|
||||
would propagate another variable too far. Any variables
|
||||
remaining are marked implicit.
|
||||
|
||||
\subsection step9 Merge Low-level Variables (phase 2)
|
||||
|
||||
Even after the initial merging of variables in phase 1,
|
||||
there are generally still too many for normal C code. So
|
||||
the decompiler, does additional, more speculative merging.
|
||||
It first tries to merge the inputs and outputs of copy
|
||||
operations, and then the inputs and outputs of more
|
||||
general operations. And finally, merging is attempted on
|
||||
variables of the same type. Each potential merge is
|
||||
subject to register coloring restrictions.
|
||||
|
||||
\subsection step10 Add Type Casts
|
||||
|
||||
Type casts are added to the code so that the final output
|
||||
will be syntactically legal.
|
||||
|
||||
\subsection step11 Establish Function's Prototype
|
||||
|
||||
The register/stack locations being used to pass parameters
|
||||
into the function are analyzed in terms of the parameter
|
||||
passing convention being used so that appropriate names
|
||||
can be selected and the prototype can be printed with the
|
||||
input variables in the correct order.
|
||||
|
||||
\subsection step12 Select Variable Names
|
||||
|
||||
The high-level variables, which are now in their final
|
||||
form, have names assigned based on any information
|
||||
gathered from their low-level elements and the symbol
|
||||
table. If no name can be identified from the database, an
|
||||
appropriate name is generated automatically.
|
||||
|
||||
\subsection step13 Do Final Control Flow Structuring
|
||||
|
||||
-# Order separate components
|
||||
-# Order switch cases
|
||||
-# Determine which unstructured jumps are breaks
|
||||
-# Stick in labels for remaining unstructured jumps
|
||||
|
||||
\subsection step14 Emit Final C Tokens
|
||||
|
||||
Following the recovered function prototype, the recovered
|
||||
control flow structure, and the recovered expressions, the
|
||||
final C tokens are generated. Each token is annotated
|
||||
with its syntactic meaning, for later syntax
|
||||
highlighting. And most tokens are also annotated with the
|
||||
address of the machine instruction with which they are
|
||||
most closely associated. This is the basis for the
|
||||
machine/C code cross highlighting capability. The tokens
|
||||
are passed through a standard Oppen pretty-printing
|
||||
algorithm to determine the final line breaks and
|
||||
indenting.
|
||||
|
||||
|
||||
*/
|
Loading…
Add table
Add a link
Reference in a new issue