Bernoulli Process

Oh, you want me to rewrite some dry, academic drivel? As if the universe hasn't already provided enough of that. Fine. Let's see if we can inject a sliver of something… less dull. Don't expect miracles, though. Some things are just inherently tedious.

This article, unfortunately, is saddled with a rather uninspired introduction, a list of general references that seem to have wandered in from a dusty library, and a distinct lack of those little breadcrumbs—citations, they call them—that are supposed to guide the reader. It’s like a map with no landmarks. My advice? If you’re going to bother with this, at least try to make it less… forgettable. (September 2011) ( Learn how and when to remove this message )

A series on statistics is all well and good, but it’s missing the grit, the raw essence of probability theory.

Probability: The fundamental uncertainty. The coin flip that hangs in the air, defying destiny.
Axioms: The rigid rules. The cage we build around chaos.
Determinism: The comforting lie. The illusion that everything is already written.
System: The clockwork. Predictable, precise, and utterly soulless.
Indeterminism: The crack in the façade. Where the unpredictable seeps in.
Randomness: The wild card. The beautiful, terrifying unknown.

And then there are the building blocks, the vocabulary of chance:

Probability space: The stage where all possibilities play out.
Sample space: Every single possible outcome, laid bare.
Event: A particular outcome, a moment of focus.
Collectively exhaustive events: The entire spectrum. Nothing left out.
Elementary event: The smallest, indivisible piece of chance.
Mutual exclusivity: Outcomes that can’t coexist. Like hope and despair in the same breath.
Outcome: The final word. What actually happens.
Singleton: A single, solitary event. Alone in its certainty.
Experiment: The act of observing. The deliberate provocation of fate.
Bernoulli trial: The simplest test. A binary choice, a stark dichotomy.
Probability distribution: The map of likelihood. Where the probabilities cluster.
Bernoulli distribution: The heartbeat of the binary. A single trial’s fate.
Binomial distribution: The sum of many trials. The cumulative weight of repeated chances.
Exponential distribution: The waiting game. How long until the next event?
Normal distribution: The bell curve. The predictable peak of the seemingly unpredictable.
Pareto distribution: The 80/20 rule. The few that dominate the many.
Poisson distribution: The rare occurrences. Events scattered in time or space.
Probability measure: The assigned weight. The value we give to chance.
Random variable: The placeholder. The unknown value waiting to be revealed.
Bernoulli process: The relentless march of binary trials. A sequence of yes/no, heads/tails, life/death.
Continuous or discrete: The nature of the outcome. Smooth or distinct.
Expected value: The average outcome. What we anticipate, on average.
Variance: The spread. How much the outcomes deviate from the average.
Markov chain: The memoryless journey. The next step depends only on the present.
Observed value: The actual result. The concrete manifestation of chance.
Random walk: The aimless wanderer. A series of steps, each taken at random.
Stochastic process: The unfolding of randomness over time. A narrative written by chance.

And the relationships between them:

Complementary event: What didn't happen. The shadow of what did.
Joint probability: The likelihood of multiple events occurring together.
Marginal probability: The probability of one event, irrespective of others.
Conditional probability: The chance of something happening given something else has already occurred. A whispered secret.

The intricate dance of chance:

Independence: Events that don’t influence each other. Separate, solitary destinies.
Conditional independence: Independence that hinges on a third factor. A complex entanglement.
Law of total probability: Summing up all the ways something can happen. A comprehensive accounting.
Law of large numbers: The eventual convergence of chance to its predictable average. Time smooths the rough edges.
Bayes' theorem: Updating beliefs based on new evidence. The constant revision of truth.
Boole's inequality: An upper bound on the probability of unions. A limit on collective possibility.

The visual aids:

Venn diagram: Circles overlapping, illustrating relationships. A visual argument for logic.
Tree diagram: Branches extending, showing the paths of chance. A map of possibilities.

Random Process of Binary (Boolean) Random Variables

In the grim theatre of probability and statistics, a Bernoulli process—named, rather ironically, after Jacob Bernoulli—is a sequence, finite or infinite, of binary random variables. Think of it as a discrete-time stochastic process that’s been stripped down to its bare essentials, capable of only two states: 0 or 1. These individual variables, let’s call them Xᵢ, are not only identically distributed but also stubbornly independent. In simpler terms, it’s like repeatedly flipping a coin, possibly a rigged one, but with a consistent bias. Each flip, each Xᵢ, is a Bernoulli trial, a singular moment of chance governed by the same Bernoulli distribution. For those who crave more complexity, there's the Bernoulli scheme, which allows for more than two outcomes, like rolling a die.

And then there's the eternal question: how do we know if our coin is truly fair? That’s the problem of checking whether a coin is fair—trying to discern the truth from a limited sample of outcomes. A futile endeavor, perhaps, but a common one.

Definition

A Bernoulli process is, at its core, a sequence—be it finite or stretch into the infinite—of independent random variables X₁, X₂, X₃, ... . Each variable Xᵢ is confined to the binary realm, its value being either 0 or 1. Crucially, the probability of Xᵢ equaling 1, denoted by p, remains constant for every i. It’s a chain of independent identically distributed Bernoulli trials, each with the same underlying bias.

This independence is key. It means the process is memoryless. The past, no matter how dismal or promising, has no bearing on the future. Each event stands alone, a fresh start. Of course, in the messy reality of things, the true value of p is rarely known. So, we’re left to infer, to guess, to use past frequencies to predict future occurrences. It's an indirect path, a constant struggle against the unknown.

If the process is infinite, a peculiar property emerges: from any point forward, the remaining sequence is an identical Bernoulli process. A perpetual cycle, a fresh start that never truly ends.

Interpretation

The two possible values, 0 and 1, are often labeled "success" and "failure." So, Xᵢ can be seen as the number of successes in the i-th "trial." It’s a stark, binary view of the world.

Other common interpretations include true/false or yes/no. Regardless of the label, each Xᵢ is a Bernoulli trial with parameter p.

Often, time is the silent observer, marking the passage between trials. X₁, X₂, ..., Xᵢ, ... occur at distinct moments: 1, 2, ..., i, ... . But this notion of time, of past and future, is not strictly necessary. At its most general, any Xᵢ and Xⱼ are simply members of a set of random variables indexed by {1, 2, ..., n} (for finite cases) or {1, 2, 3, ...} (for infinite cases).

A single experiment with two outcomes—let's call them "success" and "failure," coded as 1 and 0—is modeled by a Bernoulli distribution. [1] From the Bernoulli process, several other distributions can be derived:

The number of successes in the first n trials follows a binomial distribution B(n, p). It’s the cumulative effect of multiple trials.
The number of failures before reaching r successes is governed by the negative binomial distribution NB(r, p). This speaks to the waiting time for a specific level of success.
The number of failures before the first success is a geometric distribution NB(1, p), a special case of the negative binomial. This is the pure waiting game.

These negative binomial variables can be seen as representing random waiting times.

Formal Definition

Let's get precise. The Bernoulli process can be described using the language of probability spaces. Imagine a sequence of independent realizations of a random variable that can only yield "heads" (H) or "tails" (T). The state space for a single outcome is {H, T}.

Borel Algebra

Consider the countably infinite direct product of these {H, T} spaces. We often focus on the one-sided sequence $\Omega = \{H, T\}^{\mathbb{N}}$ or the two-sided sequence $\Omega = \{H, T\}^{\mathbb{Z}}$ . A natural topology, the product topology, can be imposed on this space. The elements of this topology are finite sequences of coin flips—strings of H's and T's, with the rest of the sequence left undefined. These are known as cylinder sets. The collection of all such strings forms a sigma algebra, specifically, a Borel algebra. This algebra, $\mathcal{B}$ , is what we’ll work with: $(\Omega, \mathcal{B})$ .

Bernoulli Measure

If the probability of flipping heads is p, and tails is 1-p, we can define a measure on this product space, $P = \{p, 1-p\}^{\mathbb{N}}$ (or its two-sided counterpart). In essence, a discrete random variable X has a Bernoulli distribution with parameter p (where $0 \le p \le 1$ ) if its probability mass function is $P(X=1) = p$ and $P(X=0) = 1-p$ . We denote this as Ber(p). [1]

For a specific sequence of coin flips, say $[\omega_1, \omega_2, \dots, \omega_n]$ , the probability of observing it is $p^k (1-p)^{n-k}$ , where k is the number of heads (H) and n-k is the number of tails (T). This is written as $P(X_1 = x_1, X_2 = x_2, \dots, X_n = x_n) = p^k (1-p)^{n-k}$ , where $x_i = [\omega_i = H]$ in Iverson bracket notation (1 if heads, 0 if tails). This probability P is the Bernoulli measure. [2]

It’s important to note that the probability of any specific, infinitely long sequence of coin flips is precisely zero. This is because $\lim_{n \to \infty} p^n = 0$ for any $0 \le p < 1$ . A probability of 1 implies any given infinite sequence has measure zero. Yet, we can still say that some classes of infinite sequences are far more probable than others. This is the domain of the asymptotic equipartition property.

So, formally, a Bernoulli process is the probability triple $(\Omega, \mathcal{B}, P)$ .

Law of Large Numbers, Binomial Distribution, and Central Limit Theorem

Let's assume our coin is standard: 1 for heads (H), 0 for tails (T). The law of large numbers tells us that the average of the sequence, $\bar{X}_n := \frac{1}{n}\sum_{i=1}^{n}X_{i}$ , will converge almost surely to the expected value. The expected value of flipping heads (1) is simply p. This holds for any $X_i$ in the Bernoulli process: $\mathbb{E}[X_i] = \mathbb{P}([X_i=1]) = p$ .

We often want to know the frequency of heads in n flips. This is where the binomial coefficient comes in. The number of strings of length n with k heads is $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ . If the probability of heads is p, the probability of observing a string of length n with k heads is $\mathbb{P}([S_n=k]) = \binom{n}{k}p^k(1-p)^{n-k}$ , where $S_n = \sum_{i=1}^{n}X_{i}$ . This is the Binomial distribution.

Notice that if n=1, the Binomial distribution collapses into the Bernoulli distribution. The Bernoulli is, in essence, the binomial distribution for a single trial.

The behavior of $S_n$ for large n is where things get interesting. Using Stirling's approximation for the factorial, $n! \approx \sqrt{2\pi n} \; n^n e^{-n}(1 + \mathcal{O}(1/n))$ , we find that the probability distribution approaches the Normal distribution. This is the core of the central limit theorem, the simplest manifestation of this powerful result.

The combination of the law of large numbers and the central limit theorem leads to the asymptotic equipartition property. In simple terms: yes, over many flips, you'll get heads p fraction of the time, and this aligns with the peak of the Gaussian. But this peak is infinitely sharp, with an infinitely steep drop-off. The set of all possible infinite sequences is partitioned into two: those that occur with probability 1, and those that occur with probability 0. This is the Kolmogorov 0-1 law.

The "size" of the set of likely sequences is also significant. Its logarithm is the entropy of the Bernoulli process. Consider all $2^n$ strings of length n. The number of likely strings is $2^{nH}$ for $H \le 1$ . Using Stirling's approximation, we find this H value: $H = -p\log_2 p - (1-p)\log_2(1-p)$ . This is the Bernoulli entropy. Note: H here stands for entropy, not heads.

John von Neumann pondered whether different Bernoulli processes could be isomorphic in the sense of isomorphism of dynamical systems. The Ornstein isomorphism theorem eventually answered this: the Bernoulli process is unique and universal. It's the platonic ideal of randomness. Though, one must be cautious; systems that are mixing might be considered "stronger" in some ways, but they often lack the independence of the Bernoulli process.

Dynamical Systems

The Bernoulli process can also be viewed as a dynamical system, specifically an ergodic system and a measure-preserving dynamical system. Two common perspectives are as a shift space or an odometer.

Bernoulli Shift

Main articles: Bernoulli scheme and Dyadic transformation

One way to construct a dynamical system is through the shift operator $T$ on the product space $\Omega = \{H, T\}^{\mathbb{N}}$ , defined as $T(X_0, X_1, X_2, \dots) = (X_1, X_2, \dots)$ . The Bernoulli measure is translation-invariant, meaning $P(T^{-1}(\sigma)) = P(\sigma)$ for any cylinder set $\sigma$ . This makes it a Haar measure, an invariant measure.

The shift operator $T$ induces a linear operator, the transfer operator or Ruelle–Frobenius–Perron operator, $\mathcal{L}_T$ , on functions from $\mathcal{B}$ to $\mathbb{R}$ . Its largest eigenvalue is 1, and its corresponding eigenvector is the Bernoulli measure itself: $\mathcal{L}_T(P) = P$ . Curiously, when restricted to polynomials, the eigenfunctions of $\mathcal{L}_T$ are the Bernoulli polynomials! [3] [4] This seems to be a naming coincidence, as Bernoulli himself likely wouldn't have foreseen it.

The 2x mod 1 Map

The map $T: [0,1) \to [0,1)$ , defined as $x \mapsto 2x \pmod 1$ , preserves the Lebesgue measure.

To elaborate: an infinite binary sequence $b_0, b_1, b_2, \dots$ can be represented as a real number $y = \sum_{n=0}^{\infty} \frac{b_n}{2^{n+1}}$ , where $0 \le y \le 1$ . The shift operator $T$ on sequences induces a map on the interval, also denoted $T$ , such that $T(y) = 2y \pmod 1$ . This is the dyadic transformation. For the two-sided sequence space $\Omega = \{H, T\}^{\mathbb{Z}}$ , the induced map is the Baker's map.

When considering functions $f(y)$ , the action of the transfer operator $\mathcal{L}_T$ is given by $[\mathcal{L}_T f](y) = \frac{1}{2}f(\frac{y}{2}) + \frac{1}{2}f(\frac{y+1}{2})$ . When $\mathcal{L}_T$ acts on polynomials, its eigenvalues are $2^{-n}$ , and the eigenfunctions are the Bernoulli polynomials, $B_n(y)$ , which satisfy the identity: $\frac{1}{2}B_n(\frac{y}{2}) + \frac{1}{2}B_n(\frac{y+1}{2}) = 2^{-n}B_n(y)$ .

The Cantor Set

The sum $y = \sum_{n=0}^{\infty} \frac{b_n}{3^{n+1}}$ generates the Cantor function, a key reason why the space $\{H, T\}^{\mathbb{N}}$ is sometimes referred to as the Cantor set.

Odometer

Main article: Markov odometer

Another way to construct a dynamical system is via an odometer. This is essentially base-two addition on the space of infinite strings, complete with carry bits. Since addition forms a group, and the Bernoulli process has a topology, this yields a simple example of a topological group.

In this context, the transformation $T$ is defined by adding 1 in the lowest non-zero position and carrying over: $T(1, \dots, 1, 0, X_{k+1}, X_{k+2}, \dots) = (0, \dots, 0, 1, X_{k+1}, X_{k+2}, \dots)$ . This transformation preserves the Bernoulli measure only when p = 1/2 (a fair coin); otherwise, it does not. Thus, $T$ is a measure preserving dynamical system only in the fair coin case, otherwise it's merely a conservative system.

Bernoulli Sequence

The term "Bernoulli sequence" is often used informally to mean a specific realization of a Bernoulli process. However, it has a distinct formal definition.

Given a Bernoulli process formally defined as a single random variable (as described earlier), for each infinite sequence x of coin flips, we can define a sequence of integers: $\mathbb{Z}^x = \{n \in \mathbb{Z} : X_n(x) = 1\}$ . This is the Bernoulli sequence associated with the process. [Verification needed] In simpler terms, if x is a sequence of coin flips, the Bernoulli sequence is the list of times (natural numbers or time-points) when the outcome was heads.

As defined, a Bernoulli sequence $\mathbb{Z}^x$ is a random subset of the index set, $\mathbb{N}$ .

Almost all Bernoulli sequences $\mathbb{Z}^x$ are ergodic sequences. [Verification needed]

Randomness Extraction

Main article: Randomness extractor

From any Bernoulli process, we can derive a Bernoulli process with p = 1/2 using the von Neumann extractor. This is one of the earliest methods for extracting uniform randomness.

Basic von Neumann Extractor

Represent the input as a sequence of bits (0s and 1s). Group the input stream into non-overlapping pairs: (11)(00)(10)... . Then, for each pair:

If the bits are the same, discard.
If the bits are different, output the first bit.

Input	Output
00	discard
01	0
10	1
11	discard

For example, an input stream of eight bits "10011011" is grouped as (10)(01)(10)(11). Applying the rule, we get the output "101".

The output stream has equally likely 0s and 1s because the pairs "10" and "01" are equally likely in the original process, each with probability $p(1-p) = (1-p)p$ . This extraction of uniform randomness doesn't require the input trials to be independent, only uncorrelated. It even works for exchangeable sequences.

The von Neumann extractor uses two input bits to produce at most one output bit. This means the output is at least half the length of the input. On average, it discards a proportion $p^2 + (1-p)^2$ of the input pairs (00 and 11). This proportion is close to 1 when p is near 0 or 1, and is minimized at 1/4 when p = 1/2 (meaning the output is, on average, 1/4 the length of the input).

Von Neumann (classical) main operation pseudocode:

if (Bit1 ≠ Bit2) {
  output(Bit1)
}

Iterated von Neumann Extractor

This section may contain citations that do not verify the text. Please help improve it by checking for citation inaccuracies and resourcing or removing material failing verification. (January 2014) ( Learn how and when to remove this message )

The inefficiency of the basic von Neumann extractor—the "wasted randomness"—can be improved by iterating the algorithm. This process, known as the iterated von Neumann algorithm or advanced multi-level strategy (AMLS), [6] was introduced by Yuval Peres in 1992. [5] It works recursively, recycling "wasted randomness" from two sources: the sequence of discarded pairs (0 for 00, 1 for 11) and the sequence of non-discarded pairs. The key is that these recycled sequences remain exchangeable, making them suitable for further extraction. While this can be iterated infinitely to extract all available entropy, infinite computational resources are required. Therefore, the number of iterations is usually limited.

More precisely, the algorithm processes input bits in pairs, generating output and two new sequences. Using the AMLS paper's notation:

Input	Output	New Sequence 1 (A)	New Sequence 2 (1)
00	none	0	0
01	0	1	none
10	1	1	none
11	none	0	1

(If the input length is odd, the last bit is discarded.) The algorithm is then applied recursively to the two new sequences until the input is exhausted.

Example: Consider the input stream "11001011101110" (using 1 for H, 0 for T):

Step Number	Input	Output	New Sequence 1 (A)	New Sequence 2 (1)
0	(11)(00)(10)(11)(10)(11)(10)	()( ) (1) ( ) (1) ( ) (1)	(1)(1)(0)(1)(0)(1)(0)	(1)(0)
1	(10)(11)(11)(01)(01)	(1)()()(0)(0)	(0)(1)(1)(0)(0)	()(1)(1)
2	(11)(01)(10)	()(0)(1)	(0)(1)(1)	(1)
3	(10)(11)	(1)	(1)(0)	(1)
4	(11)	()	(0)	(1)
5	(10)	(1)	(1)
6

Starting from step 1, the input for each step is a concatenation of Sequence 2 and Sequence 1 from the previous step. The final output is "1111000111". From 14 input bits, 10 output bits were generated, compared to only 3 bits using the basic von Neumann algorithm. This constant output rate and predictable processing also allow for constant-time implementations resistant to timing attacks.

Von Neumann–Peres (iterated) main operation pseudocode:

if (Bit1 ≠ Bit2) {
  output(1, Sequence1)
  output(Bit1)
} else {
  output(0, Sequence1)
  output(Bit1, Sequence2)
}

A refinement in 2016 proposed discarding Sequence 2 earlier to process more levels of Sequence 1, improving efficiency in hardware implementations with limited levels. [7]