Probability and Randomness

1. Why study randomness and probability?

The lecture first addressed a conceptual question: why do computer scientists, and especially cryptographers, need probability at all?

The instructor explained that probability theory is the mathematical framework for dealing with systems that behave in uncertain or random ways. Historically, probability emerged from attempts to analyze games of chance such as card games. Over time, it became an essential tool in many sciences, especially for systems too complex to describe exactly, such as physical systems with enormous numbers of particles. In physics, probability appears both as a practical modeling tool and, in quantum mechanics, as an apparently intrinsic feature of nature.

The lecturer then contrasted this with computer science, where randomness is often not merely a nuisance but a resource. In theoretical computing, there are problems for which efficient randomized algorithms are known, while deterministic efficient solutions are unknown. Thus, randomness can make computation easier or more efficient. Cryptography, in particular, will later be shown to be impossible without randomness.

2. Monte Carlo approximation of \(\pi\)

To motivate probabilistic computation, the lecturer discussed a classic randomized method for approximating \(\pi\). One imagines throwing points uniformly at random into a square and checking whether they fall inside an inscribed circle. Since the probability that a random point falls inside the circle equals the ratio of the circle’s area to the square’s area, repeated sampling allows one to estimate \(\pi\).

More concretely:

  • the square has area \(4\),
  • the circle has radius \(1\), hence area \(\pi\),
  • so the probability of landing inside the circle is \(\pi/4\).

If one samples many points uniformly at random and counts the fraction that fall inside the circle, then multiplying this fraction by 4 gives an approximation of \(\pi\). The lecturer emphasized that repeated trials make the estimate converge quickly, and mentioned concentration bounds such as the Chernoff bound to explain why the convergence is exponentially good in the number of samples.

The point of the example was that randomization can solve numerical problems in a surprisingly simple way, even if deterministic methods are better for this particular task.

3. Randomness as an algorithmic resource

The instructor then gave examples showing how randomization can help in algorithm design. One example discussed was primality testing: for a long time, very efficient randomized primality tests were known before deterministic polynomial-time primality testing was discovered. The lecturer clarified that today primality is known to be in P, but historically it served as a major example of the power of randomization.

A more current and theoretically important example was polynomial identity testing. Here, one is given a multivariate polynomial over a finite field and wants to decide whether it computes the zero function. Randomized algorithms can do this efficiently by evaluating the polynomial at randomly chosen points: if a nonzero value is ever found, the polynomial is not identically zero; if repeated random evaluations all yield zero, then with very high probability the polynomial is the zero function. According to the lecture, this remains a famous problem for which efficient randomized algorithms are known but efficient deterministic ones are not.

4. Election forecasting as a real-world sampling example

Another intuitive example of randomization was election prediction. The lecturer described how one can predict the winner of an election by taking a random sample of votes rather than counting every vote immediately. If the sample is chosen uniformly and is large enough, then the sample proportions provide a very accurate estimate of the true vote proportions.

The lecturer referred to a hypergeometric Hoeffding-type bound, which shows that the probability of a significant deviation between the sample result and the true result decreases exponentially fast with the sample size. A numerical example was given: with a large population and a sample of 50,000 votes, the probability of a 1\% error can already be very small, below \(10^{-4}\).

This example illustrated the general principle that sampling can provide reliable information about a large system without observing everything, provided the sample is truly random.

5. Formal notion of a probability space

After the motivation, the lecture introduced the formal mathematical model: the probability space. In the finite setting used in the course, a probability space consists of:

  1. a finite sample space \(\Omega\), whose elements are the possible elementary outcomes, and
  2. a probability function assigning each elementary outcome \(\omega \in \Omega\) a number \(\Pr[\omega]\) between 0 and 1, such that the probabilities of all elementary outcomes sum to 1.

A subset \(E \subseteq \Omega\) is called an event. The probability of an event is defined as the sum of the probabilities of all elementary outcomes contained in it. The empty event has probability 0.

The lecturer stressed that this is the formal framework underlying all later probabilistic reasoning, although in practice one often works with higher-level rules instead of constantly returning to the definition.

6. Basic examples of probability spaces

Several standard examples were introduced:

6.1. Fair coin

The sample space is \(\{0,1\}\), and each outcome has probability \(1/2\). This models an ideal fair coin toss.

6.2. Biased coin

The sample space is again \(\{0,1\}\), but now one side has probability \(p\) and the other has probability \(1-p\), where \(0 \le p \le 1\).

6.3. Fair die

The sample space is \(\{1,2,3,4,5,6\}\), with each outcome having equal probability.

6.4. Uniform distribution on \(\{1,\dots,n\}\)

More generally, a uniform probability space on a finite set gives each element the same probability \(1/n\). The lecturer noted that in such spaces, computing probabilities is just counting: the probability of an event is the number of favorable outcomes divided by the total number of outcomes.

7. Events and set operations

The lecture then reviewed the operations on events, treating events explicitly as sets. This allowed logical statements about events to be expressed using set language.

  • Intersection \(E_1 \cap E_2\): corresponds to logical “and.” It is the event that both \(E_1\) and \(E_2\) happen.
  • Union \(E_1 \cup E_2\): corresponds to logical “or.” It is the event that at least one of the two happens.
  • Complement \(\Omega \setminus E\): corresponds to logical negation, meaning that \(E\) does not happen.
  • Subset relation \(A \subseteq B\): means that event \(A\) implies event \(B\). If \(A\) happens, then \(B\) must also happen.

These simple set operations were important because many probability identities follow directly from them.

8. Basic properties of probabilities

The lecturer then listed and explained several fundamental rules.

8.1. Inclusion–exclusion principle

For any two events \(A\) and \(B\), \[ \Pr[A \cup B] = \Pr[A] + \Pr[B] - \Pr[A \cap B]. \] This corrects for the double counting of the overlap \(A \cap B\).

8.2. Disjoint union

If \(A\) and \(B\) are disjoint, then \[ \Pr[A \cup B] = \Pr[A] + \Pr[B]. \] This is a special case of inclusion–exclusion.

8.3. Union bound

Since probabilities are nonnegative, \[ \Pr[A \cup B] \le \Pr[A] + \Pr[B]. \] This inequality is widely used because it provides a simple upper bound even when the overlap is unknown.

8.4. Complement rule

\[ \Pr[\overline A] = 1 - \Pr[A]. \] This follows because \(A\) and its complement partition the whole sample space.

8.5. Monotonicity

If \(A \subseteq B\), then \[ \Pr[A] \le \Pr[B]. \] This reflects the obvious fact that a smaller event cannot have greater probability than a larger one containing it.

9. Conditional probability

A major new concept introduced in the lecture was conditional probability. For events \(A\) and \(B\), with \(\Pr[B] > 0\), the conditional probability of \(A\) given \(B\) is defined as \[ \Pr[A \mid B] = \frac{\Pr[A \cap B]}{\Pr[B]}. \]

The lecturer gave an intuitive interpretation: conditioning on \(B\) means that one now treats \(B\) as the new effective sample space. In other words, once \(B\) is assumed to have happened, probabilities are re-normalized inside \(B\).

This interpretation is important: after conditioning on \(B\), the event \(B\) is no longer uncertain; it is treated as a fact. Then \(\Pr[A \mid B]\) describes the remaining uncertainty about \(A\).

10. Independence of events

The lecture next defined independence. Two events \(A\) and \(B\) are independent if \[ \Pr[A \cap B] = \Pr[A]\Pr[B]. \]

The lecturer explained that this definition becomes natural when viewed through conditional probability. If \(A\) and \(B\) are independent, then \[ \Pr[A \mid B] = \Pr[A], \] meaning that learning that \(B\) happened gives no new information about whether \(A\) happened. The same is true symmetrically in the other direction.

So independence means that the occurrence of one event has no influence on the probability of the other.

11. Example: two dice rolls

To illustrate these concepts, the lecturer considered two independent rolls of a fair die. The sample space consists of all ordered pairs \((i,j)\) with \(i,j \in \{1,\dots,6\}\), for a total of 36 outcomes, each equally likely.

From this model:

  • the probability that the first roll is 1 is \(1/6\),
  • the conditional probability that the second roll is 2 given that the first roll is 1 is also \(1/6\).

This shows concretely that the two die rolls are independent: knowing the first outcome does not affect the probability distribution of the second.

12. Bayes’ rule

The lecture then introduced Bayes’ rule, which relates two conditional probabilities: \[ \Pr[A \mid B] = \frac{\Pr[B \mid A]\Pr[A]}{\Pr[B]}. \]

The proof was presented as essentially a one-line algebraic manipulation from the definition of conditional probability. Although mathematically simple, the lecturer emphasized that Bayes’ rule is extremely useful in practice, especially when one wants to infer a hidden cause from an observed event.

The important message was that Bayes’ rule allows one to reverse conditioning:

  • from the probability of an observation given a hypothesis,
  • to the probability of the hypothesis given the observation.

This reversal is central in many applications, including diagnosis, inference, and cryptographic reasoning.

13. Law of total probability

The second major tool introduced together with Bayes’ rule was the law of total probability. Suppose events \(B_1, \dots, B_n\) are pairwise disjoint and their union is the entire sample space, so they form a partition. Then for any event \(A\), \[ \Pr[A] = \sum_i \Pr[A \mid B_i]\Pr[B_i]. \]

The lecturer explained this as a kind of probabilistic case distinction. If the sample space is partitioned into mutually exclusive cases, then the probability of \(A\) can be obtained by summing over all cases:

  • probability of \(A\) in that case,
  • multiplied by the probability that the case occurs.

The proof followed directly from expressing \(A\) as the union of the disjoint pieces \(A \cap B_i\).

14. Medical test example: Bayes’ rule in action

The lecture ended with a detailed example showing why Bayes’ rule matters. Consider a disease that affects 2 in 1000 people, so the prior probability of disease is \(0.002\). A diagnostic test has the following properties:

  • if a person is sick, the test is positive with probability \(0.95\),
  • if a person is healthy, the test is still positive with probability \(0.03\).

The question was: if a person tests positive, what is the probability that they actually have the disease?

The lecturer defined:

  • \(A\): the event that the person has the disease,
  • \(B\): the event that the test result is positive.

What is known:

  • \(\Pr[A] = 0.002\),
  • \(\Pr[B \mid A] = 0.95\),
  • \(\Pr[B \mid \overline A] = 0.03\).

The desired quantity is \(\Pr[A \mid B]\). Using Bayes’ rule: \[ \Pr[A \mid B] = \frac{\Pr[B \mid A]\Pr[A]}{\Pr[B]}. \]

So one still needs \(\Pr[B]\), which is obtained via the law of total probability: \[ \Pr[B] = \Pr[B \mid A]\Pr[A] + \Pr[B \mid \overline A]\Pr[\overline A]. \]

Substituting the numbers gives a final posterior probability of roughly 6%.

The key lesson is that even a test that is highly accurate for a known sick person may still have a surprisingly low predictive value when the disease itself is rare. This is because false positives among the many healthy people can dominate the true positives among the few sick people.

15. Probability Background Continued

15.1. Definition of Random Variables

A random variable is simply a function \[ X : \Omega \to D \] from the sample space \(\Omega\) to some domain \(D\).

Here:

  • \(\Omega\) is the sample space, i.e. the set of all elementary outcomes.
  • \(D\) is the set of values that the random variable can take.

So a random variable does not directly mean ``something random’’ in the everyday sense. Formally, it is just a function that assigns to each outcome \(\omega \in \Omega\) a value \(X(\omega) \in D\).

15.2. Events Defined by a Random Variable

Once a random variable \(X\) is given, one can define events by asking whether \(X\) takes a certain value.

For any \(x \in D\), define \[ E_x = \{\omega \in \Omega : X(\omega) = x\} \subseteq \Omega. \]

This is the event that the random variable \(X\) assumes the value \(x\).

Equivalently, \(E_x\) is the preimage of \(\{x\}\) under \(X\): \[ E_x = X^{-1}(\{x\}). \]

Very often, instead of writing \(E_x\), one simply writes \[ X = x \] as shorthand for the event \(E_x\).

So the notation \[ \Pr[X = x] \] really means \[ \Pr[E_x] = \Pr[\{\omega \in \Omega : X(\omega)=x\}]. \]

15.3. Probability Distribution of a Random Variable

The probability distribution of \(X\) records the probability that \(X\) takes each possible value.

For \(x \in D\), define \[ F(x) = \Pr[X=x]. \]

Thus \(F\) is a function on the domain \(D\), and for each value \(x\), it tells how likely it is that the random variable outputs \(x\).

In this discrete setting, \(F\) is the probability mass function of \(X\).

15.4. Partition View

The events \[ E_x = X^{-1}(\{x\}), \qquad x \in D, \] partition the sample space \(\Omega\) into disjoint pieces.

That is:

  1. every \(\omega \in \Omega\) belongs to exactly one set \(E_x\),
  2. the sets \(E_x\) are pairwise disjoint,
  3. their union is the whole sample space: \[ \Omega = \bigcup_{x \in D} E_x. \]

This is because each outcome \(\omega\) has exactly one image \(X(\omega)\).

So one can think of a random variable as cutting the sample space into regions according to which value of \(D\) the outcome is mapped to.

15.5. Example Picture Intuition

Suppose \[ D=\{1,2,3,4\}. \]

Then the sample space \(\Omega\) can be split into four disjoint events: \[ X^{-1}(\{1\}), \quad X^{-1}(\{2\}), \quad X^{-1}(\{3\}), \quad X^{-1}(\{4\}). \]

Each region contains exactly those outcomes that are mapped by \(X\) to the corresponding value.

Then, for example, \[ \Pr[X=2] = \Pr[X^{-1}(\{2\})]. \]

So the probability distribution of \(X\) is obtained by measuring the probability of each of these regions.

15.6. Main Takeaway

A random variable is a function \[ X:\Omega \to D. \]

From it, one gets events of the form \[ X=x \] which really mean \[ \{\omega \in \Omega : X(\omega)=x\}. \]

The distribution of \(X\) is then defined by \[ F(x)=\Pr[X=x]. \] Hence, a random variable can be understood as a way of organizing the sample space \(\Omega\) into disjoint events indexed by the possible values in \(D\).

15.7. Independence of Random Variables

The lecture begins by defining independence for random variables \(X\) and \(Y\). They are independent if, for every possible value \(x\) of \(X\) and every possible value \(y\) of \(Y\), the events \(X=x\) and \(Y=y\) are independent. Equivalently, \[ \Pr[X=x \land Y=y] = \Pr[X=x]\Pr[Y=y]. \]

The key point is that random variables can be understood as structured collections of events, so independence of random variables is reduced to independence of all corresponding value-events.

15.8. Constructing Independent Random Variables

A natural way to construct independent random variables is to take the Cartesian product of two sample spaces and define two projection maps. One projection returns the first coordinate, and the other returns the second coordinate. This gives a canonical construction of independent random variables.

15.9. Uniform Random Variables

A random variable \(X\) on a finite domain \(D\) is uniform if every value in \(D\) occurs with probability \(1/|D|\). Examples discussed in the lecture include:

  • \(D=\{0,1\}\): a fair coin toss
  • \(D=\{1,2,3,4,5,6\}\): a fair die roll

15.10. Derived Random Variables

From existing random variables, one can define new ones by applying deterministic functions. For example, if \(X\) and \(Y\) are fair dice rolls, then \(X+Y\) is a new random variable. Unlike a fair die, this derived variable is not uniform, because some sums can arise in more ways than others.

15.11. Indicator Random Variables

For an event \(E\), one can define an indicator random variable that takes value \(1\) if \(E\) happens and \(0\) otherwise. The lecture stresses that events and indicator random variables are often interchangeable viewpoints.

16. How Computer Science Thinks About Probability

16.1. Sample Space Usually Stays in the Background

The lecturer explains that in practice, especially in computer science and cryptography, one usually does not keep writing down the full sample space explicitly. Instead, one specifies a set of independent base random variables, and all other random variables are deterministic functions of them. So the probability space is effectively generated ``on the fly’’ from these base variables.

16.2. Notation for Uniform Sampling

The notation \[ x \leftarrow\$ D \] means that \(x\) is chosen uniformly at random from the domain \(D\). This becomes standard notation throughout the course.

16.3. IID Bernoulli Variables and the Binomial Distribution

If \(x_1,\dots,x_n\) are independent and each is uniform on \(\{0,1\}\), then they are iid Bernoulli trials with parameter \(1/2\). Their sum \[ Y = x_1+\cdots+x_n \] follows the binomial distribution. This serves as a typical example of building a more complicated random variable from simple independent base variables.

16.4. Sampleable Random Variables

A random variable is called sampleable if there exists an efficient algorithm that takes iid random bits as input and outputs that random variable. The random bits are called the algorithm’s random coins. The lecture emphasizes that in computer science, distributions are typically described through such sampling algorithms.

17. Philosophical Point About Probability

The lecturer gives the example of asking for the probability that some unimaginably distant digit of \(\pi\) is even. Under the strict computational view used here, that digit is determined by a deterministic algorithm, so the correct answer is not \(1/2\), but rather ``either 0 or 1, but we do not know which.’’ This illustrates the course’s stance: probability is tied to actual randomness, not merely to human ignorance.

Author: Lowtroo

Created on: 2026-04-16 Thu 19:00

Powered by Emacs 29.3 (Org mode 9.6.15)