[Practical Binary Analysis] INTRODUCTION

INTRODUCTION

=================

  • The vast majority of computers programs are written in high-level languages like C or C++, which computers cannot run directly.

  • Before using these programs, they must first be compiled into “binary executable” containing machine code that the computer can run.

  • There is a big semantic gap between the compiled program (binary) and the high-level source.

  • As a result, many compilers bugs, subtle implementation errors, binary-level backdoors and malicious parasites can go unnoticed.

1.1 What is Binary Analysis, and why do we need it?

Binary analysis:
  • Is the science and art of analysing the properties of binary computer programs, called binaries, and the machine code and the data they contain.
  • Tries to figure out (and possibly to modify) the true properties of binary programs trying to understand what they really do as opposed to what they think they should do.

Broadly, binary analysis techniques can be divided into two classes, or a combination of these:

1.1.1 Static analysis

Static analysis techniques reason about a binary program without running it

ADVANTAGES
  • You can potentially analyse the whole binary in one go without the need of having a CPU that can run the binary: For instance, you can statically analyse an ARM binary on an x86 machine.
DOWNSIDES
  • Static analysis has no knowledge of the binary’s runtime state, which can make the analysis really challenging.

1.1.2 Dynamic analysis

Dynamic analysis runs the binary and analyses it as it executes.

ADVANTAGES
  • This approach is often simpler because you have full knowledge of the entire runtime state, including the values, the variables, and the outcomes of conditional branches.
DOWNSIDES
  • You can only see the executed code, so the analysis may miss interesting parts of the program.

1.1.3 Other techniques

  • Passive binary analysis

  • Binary instrumentation ( can be used to modify binary program without needing source)

1.2 What makes Binary Analysis Challenging?

Binary analysis is challenging and much more difficult the equivalent analysis at the source code level.

In fact, many binary analysis tasks are fundamentally undecidable, meaning that:

  • It is impossible to build an analysis engine for these problems that always returns a correct result!

An important part of binary analysis is to come up with creative ways to build usable tools despite analysis errors!

What makes binary analysis difficult?

Here is a list of some of the things that make binary analysis difficult:

NO SYMBOLIC INFORMATION
  • In high-level language, like C or C++, we give name to construct such as variables, functions and classes. All these names are called “symbolic information” o simply “symbol”. Good naming conventions make the source code much easier to understand BUT at binary level, they have no real relevance..
  • As a result, binaries are often stripped of symbols, making it much harder to understand.
NO TYPE INFORMATION
  • Inother feature of high-level programs is that they revolve around variables with well-defined types, such as INT*, FLOAT, STRING*, as well as more complex data structures like STRUCT TYPE.
  • In contrast, at the binary level, types are never explicitly stated, making the purpose and structure of data hard to infer.
NO HIGH-LEVEL ABSTRACTIONS
  • Modern programs are compartmentalized into classes and functions, but compilers throw away these high-level constructs.
  • That means that binaries appear as huge blobs of code and data, rather thrown well-structured programs, and restoring the high-level structure is complex and error-prone
MIXED CODE AND DATA
  • Binaries can (and DO) contain data fragments mixed in with the executable code (Visual studio, for example, is especially notorious in terms of mixing code and data)
  • This makes it easy to accidentally interpret data as code, or vice versa, leading to incorrect results.
LOCATION-DEPENDENT CODE AND DATA
  • Because binaries re not designed to be modified, even adding a single machine instruction can cause problems as it shifts other code around invalidating memory addresses and references from elsewhere in the code.
  • As a result, any kind of code or data modification is extremely challenging and prone to braking the binary.