dc.description.abstracteng | With our increasing reliance on the correct functioning of computer systems, identifying
and eliminating vulnerabilities in program code is gaining in importance. To date, the
vast majority of these flaws are found by tedious manual auditing of code conducted
by experienced security analysts. Unfortunately, a single missed flaw can suffice for an
attacker to fully compromise a system, and thus, the sheer amount of code plays into the
attacker’s cards. On the defender’s side, this creates a persistent demand for methods
that assist in the discovery of vulnerabilities at scale.
This thesis introduces pattern-based vulnerability discovery, a novel approach for identifying vulnerabilities which combines techniques from static analysis, machine learning,
and graph mining to augment the analyst’s abilities rather than trying to replace her.
The main idea of this approach is to leverage patterns in the code to narrow in on potential vulnerabilities, where these patterns may be formulated manually, derived from
the security history, or inferred from the code directly. We base our approach on a novel
architecture for robust analysis of source code that enables large amounts of code to be
mined for vulnerabilities via traversals in a code property graph, a joint representation
of a program’s syntax, control flow, and data flow. While useful to identify occurrences
of manually defined patterns in its own right, we proceed to show that the platform
offers a rich data source for automatically discovering and exposing patterns in code. To
this end, we develop different vectorial representations of source code based on symbols,
trees, and graphs, allowing it to be processed with machine learning algorithms. Ultimately, this enables us to devise three unique pattern-based techniques for vulnerability
discovery, each of which address a different task encountered in day-to-day auditing by
exploiting a different of the three main capabilities of unsupervised learning methods.
In particular, we present a method to identify vulnerabilities similar to a known vulnerability, a method to uncover missing checks linked to security critical objects, and
finally, a method that closes the loop by automatically generating traversals for our code
analysis platform to explicitly express and store vulnerable programming patterns. We empirically evaluate our methods on the source code of popular and widely-used open
source projects, both in controlled settings and in real world code audits. In controlled
settings, we find that all methods considerably reduce the amount of code that needs
to be inspected. In real world audits, our methods allow us to expose many previously
unknown and often critical vulnerabilities, including vulnerabilities in the VLC media
player, the instant messenger Pidgin, and the Linux kernel. | de |