Another look at two Linux KASLR patches

A fast pseudorandom generator for KASLR

A recent patchset proposed for the Linux KASLR randomizes not only the kernel base address, but also reorders every function at boot time. As such, it no longer suffices to leak an arbitrary kernel function pointer, or so the logic goes.

Along with this patchset came a custom random number generator intended to be as fast as possible, so as to keep the boot time overhead at a minimum:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/*
 * 64bit variant of Bob Jenkins' public domain PRNG
 * 256 bits of internal state
 */
struct prng_state {
	u64 a, b, c, d;
};

static struct prng_state state;
static bool initialized;

#define rot(x, k) (((x)<<(k))|((x)>>(64-(k))))
static u64 prng_u64(struct prng_state *x)
{
	u64 e;

	e = x->a - rot(x->b, 7);
	x->a = x->b ^ rot(x->c, 13);
	x->b = x->c + rot(x->d, 37);
	x->c = x->d + e;
	x->d = e + x->a;

	return x->d;
}

static void prng_init(struct prng_state *state)
{
	int i;

	state->a = kaslr_get_random_seed(NULL);
	state->b = kaslr_get_random_seed(NULL);
	state->c = kaslr_get_random_seed(NULL);
	state->d = kaslr_get_random_seed(NULL);

	for (i = 0; i < 30; ++i)
		(void)prng_u64(state);

	initialized = true;
}

unsigned long kaslr_get_prandom_long(void)
{
	if (!initialized)
		prng_init(&state);

	return prng_u64(&state);
}

This was quickly decried as dangerous, and as Andy Lutomirski puts it,

> Ugh, don’t do this. Use a real DRBG.  Someone is going to break the 
> construction in your patch just to prove they can. 
> 
> ChaCha20 is a good bet.

In the end, this random number generator was quickly removed, and that was that.

But one can still wonder—is this generator secure but unanalyzed, or would it have been broken just to prove a point?

Bob Jenkins’s Small PRNG

The above generator was, as per the comment, derived from one of Bob Jenkins’s small-state generators1. It is, in particular, the following “three rotation 64-bit variant”:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
typedef unsigned long long u8;
typedef struct ranctx { u8 a; u8 b; u8 c; u8 d; } ranctx;

#define rot(x,k) (((x)<<(k))|((x)>>(64-(k))))

u8 ranval( ranctx *x ) {
    u8 e = x->a - rot(x->b, 7);
    x->a = x->b ^ rot(x->c, 13);
    x->b = x->c + rot(x->d, 37);
    x->c = x->d + e;
    x->d = e + x->a;
    return x->d;
}

void raninit( ranctx *x, u8 seed ) {
    u8 i;
    x->a = 0xf1ea5eed, x->b = x->c = x->d = seed;
    for (i=0; i<20; ++i) {
        (void)ranval(x);
    }
}

The core consists of the iteration of a permutation; we can easily compute its inverse iteration as

1
2
3
4
5
6
7
8
u8 ranval_inverse( ranctx *x ) {
	u8 e = x->d - x->a;
	x->d = x->c - e;
	x->c = x->b - rot(x->d, 37);
	x->b = x->a ^ rot(x->c, 13);
	x->a = e + rot(x->b, 7);
    return x->d;
}

The core permutation present in ranval is depicted below.

The core permutation.

This resembles a Type-3 Feistel network2, with some added operations for extra diffusion. Nevertheless, the resemblance still means that there are relatively few changes from one state to the next.

The mode of operation, in modern terms, looks pretty much like a sponge pseudorandom generator with a capacity of 192 bits and a rate of 64 bits. As such, an ideal permutation in this mode of operation should be indistinguishable from a random stream until approximately $2^{96}$ captured 64-bit words.

Analysis

There are several ways to try and attack a pseudorandom generator:

  • We can try and find a bias in its output stream;
  • We can try to find a weakness in its initialization;
  • We can try to recover an intermediate state from its output;
  • Many more…

Our approach here will the be third one. The initialization, with its 20 rounds (or 30 in the KASLR version), is unlikely to have easily exploitable properties. Finding a bias in the output stream seems feasible, but in practical terms it has rather limited applicability.

Becase the permutation is rather simple, we will try to model the problem algebraically. This means representing the problem as a multivariate system of equations in $\mathbb{F}_2$, where $a \cdot b$ means bitwise and, and $a + b$ means bitwise xor. Since the permutation above consists only of a combination of additions, xor, and rotations, every operation is trivial to represent except addition (and subtraction).

Let $x, y$ and $z$ be 64-bit variables, and $x_i$ (resp. $y_i, z_i$) indicate the $i$th bit of $x$ (resp. $y, z$). One can represent 64-bit addition $z = x \boxplus_{64} y$ as a recursive system3:

$$ \begin{align} z_0 &= x_0 + y_0 \newline c_0 &= x_0 \cdot y_0 \newline z_i &= x_i + y_i + c_{i-1} \newline c_i &= x_i \cdot y_i + c_{i-1} \cdot (x_i + y_i) \newline &= x_i \cdot y_i + c_{i-1} \cdot x_i + c_{i-1} \cdot y_i \end{align} $$
While this representation is quite simple, and can be represented purely as a function of the input bits, it is not good for analysis. This is because the algebraic degree, that is, the monomial $x_i x_j \dots y_k y_l \dots$ with the most elements can have up to 63 variables. Working with polynomials of such high degree is not practical, due to memory and computational requirements, and therefore we do the most common trick in the business---if the system is too complex, add new variables to make it simpler:
$$ \begin{align} z_0 &= x_i + y_i \newline z_i &= x_i + y_i + x_{i-1}\cdot y_{i-1} + (z_{i-1} + x_{i-1} + y_{i-1})\cdot(x_{i-1} + y_{i-1}) \newline &= x_i + y_i + x_{i-1}\cdot y_{i-1} + z_{i-1}\cdot x_{i-1} + z_{i-1} \cdot y_{i-1} + x_{i-1} + y_{i-1} \end{align} $$
It is clear that this is equivalent to the above by checking that $c_{i-1} = z_{i-1} + x_{i-1} + y_{i-1}$. Now we add 64 extra variables for each addition, that is, $z_i$ are actual variables in our equation system, but the algebraic degree remains 2.

The equation system for subtraction is the same as with addition, with a simple reordering of the variables. Alternatively, we can explicitly write it as

$$ \begin{align} z_0 &= x_i + y_i \newline z_i &= x_i + y_i + (x_{i-1} + 1)\cdot y_{i-1} + (z_{i-1} + x_{i-1} + y_{i-1})\cdot((x_{i-1} + 1) + y_{i-1}) \newline &= x_i + y_i + x_{i-1}\cdot y_{i-1} + z_{i-1}\cdot x_{i-1} + z_{i-1}\cdot y_{i-1} + z_{i-1} + y_{i-1} \end{align} $$

Now it becomes quite straightforward to model the entire round as an equation system like above, reordering the equations such that it becomes a system of the form $$ \begin{align} p_1(x_0,\dots) &= 0, \newline p_2(x_0,\dots) &= 0, \newline \dots & \newline p_l(x_0,\dots) &= 0, \newline \end{align} $$ which we call the algebraic normal form, or ANF, of the system.

Below we present a Python script that does exactly this, receiving a number of output leaks as arguments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import sys

BITS = 64

def VAR(n=BITS):
  if not hasattr(VAR, "counter"):
    VAR.counter = 0 
  t = [ VAR.counter + i for i in range(n) ]
  VAR.counter += n
  return t

def ROTL(x, c):
  z = x[:]
  for i in range(c):
    z = z[-1:] + z[0:-1]
  return z

# Model c = a ^ b
def XOR(c, a, b):
  for i in range(BITS):
    L.append('x{} + x{} + x{}'.format(c[i], a[i], b[i]))

# Model c = a + b
def ADD(c, a, b):
  L.append('x{} + x{} + x{}'.format(c[0], a[0], b[0]))
  for i in range(1,BITS):
    L.append('x{0} + x{1} + x{2} + x{3}*x{4} + x{3} + x{4} + x{3}*x{5} + x{4}*x{5}'.format(c[i], a[i], b[i], a[i-1], b[i-1], c[i-1]))

# Model c = a - b
def SUB(c, a, b):
  L.append('x{} + x{} + x{}'.format(c[0], a[0], b[0]))
  for i in range(1,BITS):
    L.append('x{0} + x{1} + x{2} + x{3}*x{4} + x{4} + x{5} + x{3}*x{5} + x{4}*x{5}'.format(c[i], a[i], b[i], a[i-1], b[i-1], c[i-1]))

def EQ(a, b):
  for i in range(BITS):
    L.append('x{} + {}'.format(a[i], (b >> i)&1))

L = []

a = VAR()
b = VAR()
c = VAR()
d = VAR()

D = int(sys.argv[1], 0)
EQ(d, D)
for i in range(2, len(sys.argv)):
  e = VAR()
  # e = a - ROTL(b,  7)
  SUB(e, a, ROTL(b, 7))
  # a = b ^ ROTL(c, 13)
  a_ = VAR()
  XOR(a_, b, ROTL(c, 13))
  # b = c + ROTL(d, 37)
  b_ = VAR()
  ADD(b_, c, ROTL(d, 37))
  # c = d + e
  c_ = VAR()
  ADD(c_, d, e)
  # d = e + a
  d_ = VAR()
  ADD(d_, e, a_)
  a, b, c, d = a_, b_, c_, d_
  D = int(sys.argv[i], 0)
  EQ(d, D)

print('\n'.join(L))

Having this system, we can solve it in two main ways:

We note that both Gröbner bases and boolean satisfiability are NP-complete problems. However, for small enough and simple enough systems, the heuristics used by good modern solvers make many of these problems tractable.

Although we tinkered with the first approach, the latter is both simpler to implement and more efficient. We also made use of the recent and quite convenient tool Bosphorus, which makes it straightforward to export a simplified CNF given an ANF equation system exported by our script above:

1
2
3
./bob 8 | xargs python bob.py > /tmp/test.anf && bosphorus -v 0 --simplify=1 --sat=0 --xldeg=3 --anfread /tmp/test.anf --cnfwrite /tmp/test.cnf && ./cadical --unsat -q /tmp/test.cnf | python recover.py
Initial state: 0x512E276FCD97EE94 0xE5326BC5D9053F7F 0x4746014B33BEBC20 0x5012637EA2980D1E
0x512E276FCD97EE94 0xE5326BC5D9053F7F 0x4746014B33BEBC20 0x5012637EA2980D1E

In the above snippet, we use ./bob to generate a random state and leak 8 outputs, bob.py (the script above) to create the ANF from these leaks, bosphorus to convert the system to CNF, CaDiCaL4 to solve the system, and recover.py to convert the output of cadical back to readable integer values.

The number of leaked values is significant to the recovery speed. The minimum number of consecutive leaks to have a unique solution is 4—the initial value of d plus 3 other leaks to constrain the $2^{192}$ possible initial state variables $a, b, c$ to a single value.

However, 4 leaks seems to make the problem quite hard for SAT solvers. If, instead, we use 5 leaks the problem becomes tractable. The more leaks we have, the faster it will be, until a certain point. We found, experimentally, that 8 leaks are the sweet spot for recovery time, with more leaks failing to speed things up.

The following table contains the solving speeds, on an Intel Core i7-4770, for various numbers of leaks, averaged over 100 runs:

Leaked wordsAverage state recovery time (seconds)
595
643
731
826
927
1028
1129

Thus, it is safe to say that this generator is not suitable for cryptographic purposes.

We also note that SMT solvers could have been used to make the instantiation of the problem simpler. However, this results in poorer solving performance, and the performance across SMT solvers fluctuates even wilder than with our approach.

Carried Away

And now for something completely different.

While looking through the KASLR code, we find a peculiar piece of code in kaslr_get_random_long, the function that is used to get random values for KASLR:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
unsigned long kaslr_get_random_long(const char *purpose)
{
#ifdef CONFIG_X86_64
	const unsigned long mix_const = 0x5d6008cbf3848dd3UL;
#else
	const unsigned long mix_const = 0x3f39e593UL;
#endif
	unsigned long raw, random = get_boot_seed();
	bool use_i8254 = true;

	debug_putstr(purpose);
	debug_putstr(" KASLR using");

	if (has_cpuflag(X86_FEATURE_RDRAND)) {
		debug_putstr(" RDRAND");
		if (rdrand_long(&raw)) {
			random ^= raw;
			use_i8254 = false;
		}
	}

	if (has_cpuflag(X86_FEATURE_TSC)) {
		debug_putstr(" RDTSC");
		raw = rdtsc();

		random ^= raw;
		use_i8254 = false;
	}

	if (use_i8254) {
		debug_putstr(" i8254");
		random ^= i8254();
	}

	/* Circular multiply for better bit diffusion */
	asm(_ASM_MUL "%3"
	    : "=a" (random), "=d" (raw)
	    : "a" (random), "rm" (mix_const));
	random += raw;

	debug_putstr("...\n");

	return random;
}

The random 32 or 64-bit word that is returned begins with a simple hash of the kernel build and boot information for the present kernel:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
/* Attempt to create a simple but unpredictable starting entropy. */
static unsigned long get_boot_seed(void)
{
	unsigned long hash = 0;

	hash = rotate_xor(hash, build_str, sizeof(build_str));
	hash = rotate_xor(hash, boot_params, sizeof(*boot_params));

	return hash;
}

After that, it depends on which CPU features are enabled:

  • If rdrand is available, random is xored with its value. Under the assumption that rdrand works as advertised, this should result in a perfectly distributed value.
  • If rdtsc is available, random is once again mixed in with the timestamp counter value. This is not as good of an entropy source as rdrand, particularly since rdtsc is usually available system-wide.
  • If all else fails, use the i8254 lower-resolution timer.

After doing all this mixing, and in particular if only timers are used, random values are likely to be highly biased—the most significant bits are likely to remain relatively static over time.

To convert this lopsided entropy into a uniformly distributed value, since 2013 the function ends with a “cyclic multiplication” to smooth things over:

1
2
3
4
5
/* Circular multiply for better bit diffusion */
asm(_ASM_MUL "%3"
    : "=a" (random), "=d" (raw)
    : "a" (random), "rm" (mix_const));
random += raw;

In short, it computes the full product of random times 0x3f39e593 or 0x5d6008cbf3848dd3, and adds the upper bits (in raw) to the lower bits (in random). This ensures that all the bits are more or less equitably mixed.

But there’s a problem. Two, in fact: one theoretical and one practical.

In theory, what is being attempted here is randomness extraction. There are two usual ways to accomplish this: using a strong hash function modeled as a random oracle, or using a universal hash function and the leftover hash lemma. Here we have neither, and it’s clear that the output only looks unbiased for a naive attacker who cannot simply (approximately) invert the transformation.

The practical issue is different: if the entropy we have is actually well-distributed (say, by using rdrand), then the cyclic multiplication makes it worse by creating many values that are simply unreachable. Why? Because the multiplication—as implemented here—is not bijective.

Cyclic multiplication

Cyclic multiplication is best interpreted as multiplication modulo $2^n-1$, with lazy reduction. In other words,

$$ a \otimes b = \begin{cases} 2^n-1 & \text{ if } a = 2^n-1 \newline a \times b \bmod (2^n-1) & \text{ otherwise. } \end{cases} $$

If $b$ is relatively prime to $2^n-1$, this operation is clearly invertible. Its implementation is simple, as well: $$ a \otimes b = \left(a \times b \bmod 2^n\right) + \left\lfloor{\frac{a\times b}{2^n}}\right\rfloor \pmod{2^n - 1},. $$ This is exactly what was implemented above. But there is one problem—the sum may overflow. To keep correctness, the overflowing bit—the carry—must be added back to the result.

If the carry is not added, there are a number of values proportional to $b$ that will never be reached. In the case of $b = \text{0x3f39e593}$, there are around $2^{28}$ unreachable values—1 out of every 16. While this is not particularly concerning here, it is an unnecessary flaw and easily corrected.

Fixing it

The fix, now, becomes obvious: simply add the missing carry. This way the final transformation cannot harm a good uniform distribution, unlike before.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
/* Circular multiply for better bit diffusion */
asm("mul %2\n"
    "add %1, %0\n"
    "adc $0, %0"
    : "+a" (random), "=d" (raw)
    : "rm" (mix_const));

return random;

/* Alternatively, a more portable version... */
/* Codegen is equivalent to above in recent gcc/clang versions */

/* Circular multiply for better bit diffusion */
asm(_ASM_MUL "%3"
	: "=a" (random), "=d" (raw)
	: "a" (random), "rm" (mix_const));
random += raw;
random += random < raw;
return random;

  1. To be clear, Bob Jenkins never claimed these generators were cryptographically secure. ↩︎

  2. See page 2 of On Generalized Feistel Networks for an idea of what the various Feistel network variants look like. ↩︎

  3. Remember, again, that we are working with individual bits and that $\cdot$ means and and $+$ means xor. ↩︎

  4. Obviously, any SAT solver could have been used here; we used the one that maximized single-thread solving speed for our problem. ↩︎