ghidra/Ghidra/Features/Decompiler/src/decompile/cpp/docmain.hh

/* ###
 * IP: GHIDRA
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
/** \mainpage Decompiler Analysis Engine

  \section toc Table of Contents

     - \ref overview
     - \ref capabilities
     - \ref design
     - \ref workflow
     - \ref ghidraimpl
     - \subpage sleigh
     - \subpage coreclasses
     - \subpage termrewriting

  \section overview Overview

  Welcome to the \b Decompiler \b Analysis \b Engine. It is a
  complete library for performing automated data-flow analysis
  on software, starting from the binary executable. This
  documentation is geared toward understanding the source code
  and starts with a brief discussion of the libraries capabilities
  and moves immediately into the design of the decompiler and
  the main code workflow.

  The library provides its own Register
  Transfer Language (RTL), referred to internally as \b p-code,
  which is designed specifically for reverse engineering
  applications.  The disassembly of processor specific machine-code
  languages, and subsequent translation into \b p-code, forms
  a major sub-system of the decompiler. There is a processor
  specification language, referred to as \b SLEIGH, which is
  dedicated to this translation task, and there is a corresponding
  section in the documentation for the classes and methods used
  to implement this language in the library (See \subpage sleigh).
  This piece of the code can be built as a standalone binary
  translation library, for use by other applications.

  For getting up to speed quickly on the details of the source
  and the decompiler's main data structures,
  there is a specific documentation page describing the core
  classes and methods.

  Finally there is a documentation page summarizing the
  simplification rules used in the core decompiler analysis.

  \section capabilities Capabilities

  \section design Design

  The main design elements of the decompiler come straight
  from standard \e Compiler \e Theory data structures and
  algorithms.  This should come as no surprise, as both
  compilers and decompilers are concerned with translating
  from one coding language to another.  They both follow a
  general work flow:

     - Parse/tokenize input language.
     - Build abstract syntax trees in an intermediate language.
     - Manipulate/optimize syntax trees.
     - Map intermediate language to output language constructs.
     - Emit final output language encoding.

  With direct analogs to (forward engineering) compilers, the
  decompiler uses:

     - A Register Transfer Language (RTL) referred to as \b p-code.
     - Static Single Assignment (SSA) form.
     - Basic blocks and Control Flow Graphs.
     - Term rewriting rules.
     - Dead code elimination.
     - Symbol tables and scopes.

  Despite these similarities, the differences between a
  decompiler and a compiler are substantial and run throughout
  the entire process. These all stem from the fact that, in
  general, descriptive elements and the higher-level
  organization of a piece of code can only be explicitly
  expressed in a high-level language.  So the decompiler,
  working with a low-level language as input, can only infer
  this information.

  The features mentioned above all have a decompiler specific
  slant to them, and there are other tasks that the decompiler
  must perform that have no real analog with a compiler.
  These include:

     - Variable merging  (vaguely related to register coloring)
     - Type propagation
     - Control flow structuring
     - Function prototype recovery
     - Expression recovery

  \section workflow Main Work Flow

  Here is an outline of the decompiler work flow.

     -# \ref step0
     -# \ref step1
     -# \ref step2
     -# \ref step3
     -# \ref step4
     -# \ref step5
         - \ref step5a
         - Adjust p-code in special situations.
         - \ref step5b
         - \ref step5c
         - \ref step5d
         - \ref step5e
         - \ref step5f
     -# \ref step6
     -# \ref step7
     -# \ref step8
     -# \ref step9
     -# \ref step10
     -# \ref step11
     -# \ref step12
     -# \ref step13
     -# \ref step14

  \subsection step0 Specify Entry Point

  The user specifies a starting address for a particular function.

  \subsection step1 Generate Raw P-code

  The p-code generation engine is called \b SLEIGH. Based on a
  processor specification file, it maps binary encoded
  machine instructions to sequences of p-code operations.
  P-code operations are generated for a single machine
  instruction at a specific address.  The control flow
  through these p-code operations is followed to determine
  if control falls through, or if there are jumps or calls.
  A work list of new instruction addresses is kept and is
  continually revisited until there are no new instructions.
  After the control flow is traced, additional changes may
  be made to the p-code.

    -# PIC constructions are checked for, now that the
       extent of the function is known.  If a call is to a
       location that is still within the function, the call
       is changed to a jump.
    -# Functions which are marked as inlined are filled in
       at this point, before basic blocks are generated.
       P-code for the inlined function is generated
       separately and control flow is carefully set up to
       link it in properly.

   \subsection step2 Generate Basic Blocks and the CFG

   Basic blocks are generated on the p-code instructions
   (\e not the machine instructions) and a control flow graph
   of these basic blocks is generated.  Control flow is
   normalized so that there is always a unique start block
   with no other blocks falling into it.  In the case of
   subroutines which have branches back to their very first
   machine instruction, this requires the creation of an
   empty placeholder start block that flows immediately into
   the block containing the p-code for the first instruction.

   \subsection step3 Inspect Sub-functions

      -# Addresses of direct calls are looked up in the
         database and any parameter information is
         recovered.
      -# If there is information about an indirect call,
         parameter information can be filled in and the
         indirect call can be changed to a direct call.
      -# Any call for which no prototype is found has a
         default prototype set for it.
      -# Any global or default prototype recovered at this
         point can be overridden locally.

   \subsection step4 Adjust/Annotate P-code

     -# The context database is searched for known values of
        memory locations coming into the function.  These
        are implemented by inserting p-code \b COPY
        instructions that assign the correct value to the
        correct memory location at the beginning of the
        function.
     -# The recovered prototypes may require that extra
        p-code is injected at the call site so that certain
        actions of the call are explicit to the analysis
        engine.
     -# Other p-code may be inserted to indicate changes a
        call makes to the stack pointer.  Its possible that
        the change to the stack pointer is unknown. In this
        case \b INDIRECT p-code instructions are inserted to
        indicate that the state of the stack pointer is
        unknown at that point, preparing for the extrapop
        action.
     -# For each p-code call instruction, extra inputs are
        added to the instruction either corresponding to a
        known input for that call, or in preparation for the
        prototype recovery actions.  If the (potential)
        function input is located on the stack, a temporary
        is defined for that input and a full p-code \b LOAD
        instruction, with accompanying offset calculation,
        is inserted before the call to link the input with
        the (currently unknown) stack offset. Similarly
        extra outputs are added to the call instructions
        either representing a known return value, or in
        preparation for parameter recovery actions.
     -# Each p-code \b RETURN instruction for the current
        function is adjusted to hide the use of the return
        address and to add an input location for the return
        value. The return value is considered an input to
        the \b RETURN instruction.

   \subsection step5 The Main Simplification Loop

     \subsubsection step5a Generate SSA Form

     This is very similar to forward engineering
     algorithms. It uses a fairly standard phi-node
     placement algorithm based on the control flow dominator
     tree and the so-called dominance frontier.  A standard
     renaming algorithm is used for the final linking of
     variable defs and uses.  The decompiler has to take
     into account partially overlapping variables and guard
     against various aliasing situations, which are
     generally more explicit to a compiler.  The decompiler
     SSA algorithm also works incrementally. Many of the
     stack references in a function cannot be fully resolved
     until the main term rewriting pass has been performed
     on the register variables.  Rather than leaving stack
     references as associated \b LOAD s and \b STORE s, when
     the references are finally discovered, they are
     promoted to full variables within the SSA tree. This
     allows full copy propagation and simplification to
     occur with these variables, but it often requires 1 or
     more additional passes to fully build the SSA tree.
     Local aliasing information and aliasing across
     subfunction calls can be annotated in the SSA structure
     via \b INDIRECT p-code operations, which holds the
     information that the output of the \b INDIRECT is derived
     from the input by some indirect (frequently unknown)
     effect.

     \subsubsection step5b Eliminate Dead Code

     Dead code elimination is essential to the decompiler
     because a large percentage of machine instructions have
     side-effects on machine state, such as the setting of
     flags, that are not relevant to the function at a
     particular point in the code.  Dead code elimination is
     complicated by the fact that its not always clear what
     variables are temporary, locals, or globals.  Also,
     compilers frequently map smaller (1-byte or 2-byte)
     variables into bigger (4-byte) registers, and
     manipulation of these registers may still carry around
     left over information in the upper bytes.  The
     decompiler detects dead code down to the bit, in order
     to appropriately truncate variables in these
     situations.

     \subsubsection step5c Propagate Local Types

     The decompiler has to infer high-level type information
     about the variables it analyzes, as this kind of
     information is generally not present in the input
     binary.  Some information can be gathered about a
     variable, based on the instructions it is used in (i.e.
     if it is used in a floating point instruction).  Other
     information about type might be available from header
     files or from the user.  Once this is gathered, the
     preliminary type information is allowed to propagate
     through the syntax trees so that related types of other
     variables can be determined.

     \subsubsection step5d Perform Term Rewriting

     The bulk of the interesting simplifications happen in
     this section.  Following Formal Methods style term
     rewriting, a long list of rules are applied to the
     syntax tree. Each rule matches some potential
     configuration in a portion of the syntax tree, and
     after the rule matches, it specifies a sequence of edit
     operations on the syntax tree to transform it.  Each
     rule can be applied repeatedly and in different parts
     of the tree if necessary.  So even a small set of rules
     can cause a large transformation. The set of rules in
     the decompiler is extensive and is tailored to specific
     reverse engineering needs and compiler constructs.  The
     goal of these transformations is not to optimize as a
     compiler would, but to simplify and normalize for
     easier understanding and recognition by human analysts
     (and follow on machine processing).  Typical examples
     of transforms include: copy propagation, constant
     propagation, collecting terms, cancellation of
     operators and other algebraic simplifications, undoing
     multiplication and division optimizations, commuting
     operators, ....

     \subsubsection step5e Adjust Control Flow Graph

     The decompiler can recognize
        - unreachable code
        - unused branches
        - empty basic blocks
        - redundant predicates
        - ...

     It will remove branches or blocks in order to
     simplify the control flow.

     \subsubsection step5f Recover Control Flow Structure

     The decompiler recovers higher-level control flow
     objects like loops, \b if/\b else blocks, and \b switch
     statements.  The entire control flow of the function is
     built up hierarchically with these objects, allowing it
     to be expressed naturally in the final output with the
     standard control flow constructs of the high-level
     language.  The decompiler recognizes common high-level
     unstructured control flow idioms, like \e break, and can
     use node-splitting in some situations to undo compiler
     flow optimizations that prevent a structured
     representation.

  \subsection step6 Perform Final P-code Transformations

  During the main simplification loop, many p-code
  operations are normalized in specific ways for the term
  rewriting process that aren't necessarily ideal for the
  final output. This phase does transforms designed to
  enhance readability of the final output.  A simple example
  is that all subtractions (\b INT_SUB) are normalized to be an
  addition on the twos complement in the main loop. This
  phase would convert any remaining additions of this form
  back into a subtraction operation.

  \subsection step7 Exit SSA Form and Merge Low-level Variables (phase 1)

  The static variables of the SSA form need to be merged
  into complete high-level variables.  The first part of
  this is accomplished by formally exiting SSA form.  The
  SSA phi-nodes and indirects are eliminated either by
  merging the input and output variables or inserting extra
  \b COPY operations.  Merging must guard against a high-level
  variable holding different values (in different memory
  locations) at the same time.  This is similar to register
  coloring in compiler design.

  \subsection step8 Determine Expressions and Temporary Variables

  A final determination is made of what the final output
  expressions are going to be, by determining which
  variables in the syntax tree will be explicit and which
  represent temporary variables.  Certain terms must
  automatically be explicit, such as constants, inputs,
  etc. Other variables are forced to be explicit because
  they are read too many times or because making it implicit
  would propagate another variable too far.  Any variables
  remaining are marked implicit.

  \subsection step9 Merge Low-level Variables (phase 2)

  Even after the initial merging of variables in phase 1,
  there are generally still too many for normal C code.  So
  the decompiler does additional, more speculative merging.
  It first tries to merge the inputs and outputs of copy
  operations, and then the inputs and outputs of more
  general operations.  And finally, merging is attempted on
  variables of the same type. Each potential merge is
  subject to register coloring restrictions.

  \subsection step10 Add Type Casts

  Type casts are added to the code so that the final output
  will be syntactically legal.

  \subsection step11 Establish Function's Prototype

  The register/stack locations being used to pass parameters
  into the function are analyzed in terms of the parameter
  passing convention being used so that appropriate names
  can be selected and the prototype can be printed with the
  input variables in the correct order.

  \subsection step12 Select Variable Names

  The high-level variables, which are now in their final
  form, have names assigned based on any information
  gathered from their low-level elements and the symbol
  table.  If no name can be identified from the database, an
  appropriate name is generated automatically.

  \subsection step13 Do Final Control Flow Structuring

   -# Order separate components
   -# Order switch cases
   -# Determine which unstructured jumps are breaks
   -# Stick in labels for remaining unstructured jumps

  \subsection step14 Emit Final C Tokens

  Following the recovered function prototype, the recovered
  control flow structure, and the recovered expressions, the
  final C tokens are generated.  Each token is annotated
  with its syntactic meaning, for later syntax
  highlighting. And most tokens are also annotated with the
  address of the machine instruction with which they are
  most closely associated.  This is the basis for the
  machine/C code cross highlighting capability.  The tokens
  are passed through a standard Oppen pretty-printing
  algorithm to determine the final line breaks and
  indenting.


*/