πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Expectation and Variance

ProbabilityMoments🟒 Free Lesson

Advertisement

Why It Matters

Why It Matters

Expectation and variance are the cornerstones of probability theory and statistical inference. The expected value tells you the long-run average outcome of a random variable, while variance quantifies uncertainty around that average. In machine learning, these quantities underpin everything from loss function design (expected risk minimization) to model evaluation (mean squared error, cross-entropy). Understanding them is essential for building robust, well-calibrated models. In reinforcement learning, agents maximize expected cumulative reward. In finance, portfolio optimization balances expected return against variance (risk). In deep learning, batch normalization and dropout regularization both exploit properties of expectation and variance to stabilize training. Without a firm grasp of these concepts, you cannot reason properly about uncertainty, make optimal decisions under risk, or debug statistical pipelines.


Expected Value

DfExpected Value (Discrete Case)

For a discrete random variable XX taking values x1,x2,…x_1, x_2, \dots with probabilities p(x1),p(x2),…p(x_1), p(x_2), \dots, the expected value (or mean) is defined as:

E[X]=βˆ‘ixiβ‹…p(xi)E[X] = \sum_{i} x_i \cdot p(x_i)

provided the sum converges absolutely, i.e., βˆ‘i∣xiβˆ£β‹…p(xi)<∞\sum_i |x_i| \cdot p(x_i) < \infty. The expected value represents the "center of mass" of the probability distribution β€” the value you would obtain on average if you repeated the random experiment infinitely many times.

Expected Value β€” Discrete

E[X]=βˆ‘i=1∞xiβ‹…P(X=xi)E[X] = \sum_{i=1}^{\infty} x_i \cdot P(X = x_i)

Here,

  • E[X]E[X]=Expected value of random variable X
  • xix_i=Possible values X can take
  • P(X=xi)P(X=x_i)=Probability mass function (PMF) at x_i

DfExpected Value (Continuous Case)

For a continuous random variable XX with probability density function f(x)f(x), the expected value is:

E[X]=βˆ«βˆ’βˆžβˆžxβ‹…f(x) dxE[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx

provided the integral converges absolutely. For distributions with heavy tails (e.g., Cauchy distribution), the expected value may not exist.

Expected Value β€” Continuous

E[X]=βˆ«βˆ’βˆžβˆžx f(x) dxE[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx

Here,

  • E[X]E[X]=Expected value of continuous random variable X
  • f(x)f(x)=Probability density function (PDF)
  • dxdx=Integration variable

Expected Value Examples

Example 1 (Discrete): Fair six-sided die.

E[X]=16(1+2+3+4+5+6)=216=3.5E[X] = \frac{1}{6}(1+2+3+4+5+6) = \frac{21}{6} = 3.5

Example 2 (Continuous): Uniform distribution X∼U(0,1)X \sim U(0,1).

E[X]=∫01x dx=12E[X] = \int_0^1 x \, dx = \frac{1}{2}

Example 3 (Bernoulli): X∼Bernoulli(p)X \sim \text{Bernoulli}(p).

E[X]=0β‹…(1βˆ’p)+1β‹…p=pE[X] = 0 \cdot (1-p) + 1 \cdot p = p

Properties of Expectation

ThProperties of Expectation (Linearity)

The expectation operator is linear. For any random variables X,YX, Y and constants a,ba, b:

  1. Constant rule: E[c]=cE[c] = c for any constant cc.
  2. Linearity: E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y].
  3. Iterated expectation: E[X]=E[E[X∣Y]]E[X] = E[E[X|Y]] (law of total expectation).
  4. Monotonicity: If X≀YX \leq Y with probability 1, then E[X]≀E[Y]E[X] \leq E[Y].
  5. Non-negativity: If Xβ‰₯0X \geq 0 with probability 1, then E[X]β‰₯0E[X] \geq 0.

Important: Linearity holds even when XX and YY are dependent. This is what makes expectation so powerful β€” no independence assumption is needed for E[X+Y]=E[X]+E[Y]E[X+Y] = E[X] + E[Y].

Linearity vs Independence

Linearity of expectation (E[X+Y]=E[X]+E[Y]E[X+Y] = E[X]+E[Y]) always holds. However, E[XY]=E[X]β‹…E[Y]E[XY] = E[X] \cdot E[Y] holds only when XX and YY are independent (or uncorrelated). Confusing these two facts is a common source of errors.


Variance

DfVariance

The variance of a random variable XX measures the spread of its distribution around the mean ΞΌ=E[X]\mu = E[X]. It is defined as:

Var(X)=E[(Xβˆ’ΞΌ)2]\text{Var}(X) = E[(X - \mu)^2]

Variance is always non-negative. Var(X)=0\text{Var}(X) = 0 if and only if XX is a constant with probability 1. The units of variance are the square of the units of XX, which is why the standard deviation is often more interpretable.

Variance β€” Computational Formula

Var(X)=E[X2]βˆ’(E[X])2\text{Var}(X) = E[X^2] - (E[X])^2

Here,

  • Var(X)\text{Var}(X)=Variance of random variable X
  • E[X2]E[X^2]=Second moment of X
  • (E[X])2(E[X])^2=Square of the first moment (mean)

Variance β€” Definition Form

Var(X)=E[(Xβˆ’E[X])2]\text{Var}(X) = E[(X - E[X])^2]

Here,

  • E[X]E[X]=Mean (expected value) of X
  • (Xβˆ’E[X])2(X - E[X])^2=Squared deviation from the mean

Why the Computational Formula?

The formula Var(X)=E[X2]βˆ’(E[X])2\text{Var}(X) = E[X^2] - (E[X])^2 is called the computational formula because it often simplifies calculations. To find variance, you compute E[X2]E[X^2] and (E[X])2(E[X])^2 separately, then subtract. This avoids computing deviations from the mean for each outcome. However, the definition form E[(Xβˆ’ΞΌ)2]E[(X-\mu)^2] is conceptually clearer and more numerically stable in practice.

Variance of a Fair Die

For a fair six-sided die, E[X]=3.5E[X] = 3.5.

E[X2]=16(12+22+32+42+52+62)=916E[X^2] = \frac{1}{6}(1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2) = \frac{91}{6}
Var(X)=916βˆ’(72)2=916βˆ’494=3512β‰ˆ2.917\text{Var}(X) = \frac{91}{6} - \left(\frac{7}{2}\right)^2 = \frac{91}{6} - \frac{49}{4} = \frac{35}{12} \approx 2.917

Properties of Variance

ThProperties of Variance

For any random variable XX and constants a,ba, b:

  1. Scaling: Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X).

    • Adding a constant shifts the mean but does not change the spread.
    • Multiplying by aa scales the variance by a2a^2.
  2. Sum of variances: Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y).

  3. Independent case: If XX and YY are independent, Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y).

  4. Generalized sum: Var ⁣(βˆ‘i=1naiXi)=βˆ‘i=1nβˆ‘j=1naiajCov(Xi,Xj)\text{Var}\!\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n \sum_{j=1}^n a_i a_j \text{Cov}(X_i, X_j).

  5. Non-negativity: Var(X)β‰₯0\text{Var}(X) \geq 0 for all XX, with equality if and only if XX is a constant a.s.

Variance of a Sum Is NOT the Sum of Variances

A very common mistake is assuming Var(X+Y)=Var(X)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) always holds. This is true only when XX and YY are uncorrelated (Cov(X,Y)=0\text{Cov}(X,Y) = 0). For correlated variables, you must include the covariance term. In deep learning, this matters when analyzing gradient noise across correlated mini-batches.


Standard Deviation

Standard Deviation

ΟƒX=Var(X)\sigma_X = \sqrt{\text{Var}(X)}

Here,

  • ΟƒX\sigma_X=Standard deviation of X
  • Var(X)\text{Var}(X)=Variance of X

DfWhy Standard Deviation?

The standard deviation Οƒ\sigma is the square root of the variance. Its key advantage is that it shares the same units as the original random variable XX, making it directly interpretable. For a normal distribution, approximately 68% of observations fall within ΞΌΒ±Οƒ\mu \pm \sigma, 95% within ΞΌΒ±2Οƒ\mu \pm 2\sigma, and 99.7% within ΞΌΒ±3Οƒ\mu \pm 3\sigma (the 68-95-99.7 rule). In machine learning, we often report mean Β±\pm standard deviation to convey both the average performance and its variability.

Standard Deviation in Practice

If a model's test accuracy has ΞΌ=0.92\mu = 0.92 and Οƒ=0.03\sigma = 0.03, then approximately:

  • 68% of runs achieve accuracy in [0.89,0.95][0.89, 0.95]
  • 95% of runs achieve accuracy in [0.86,0.98][0.86, 0.98]

This tells you the model is reasonably stable (Οƒ\sigma is small relative to ΞΌ\mu).


Moments

DfRaw Moments

The nn-th raw moment of a random variable XX is:

ΞΌnβ€²=E[Xn]\mu_n' = E[X^n]

The first raw moment is the mean: ΞΌ1β€²=E[X]\mu_1' = E[X]. The second raw moment is E[X2]E[X^2], which appears in the computational variance formula. Higher raw moments capture increasingly detailed information about the shape of the distribution.

DfCentral Moments

The nn-th central moment of a random variable XX with mean ΞΌ\mu is:

ΞΌn=E[(Xβˆ’ΞΌ)n]\mu_n = E[(X - \mu)^n]

Central moments are invariant to shifts in the distribution (adding a constant to XX does not change central moments). The second central moment is the variance: ΞΌ2=Var(X)\mu_2 = \text{Var}(X).

Moments β€” Summary Table

ΞΌ1β€²=E[X]=meanΞΌ2=E[(Xβˆ’ΞΌ)2]=varianceΞΌ3=E[(Xβˆ’ΞΌ)3]Β (skewnessΒ numerator)ΞΌ4=E[(Xβˆ’ΞΌ)4]Β (kurtosisΒ numerator)\begin{aligned} \mu_1' &= E[X] = \text{mean} \\ \mu_2 &= E[(X-\mu)^2] = \text{variance} \\ \mu_3 &= E[(X-\mu)^3] \text{ (skewness numerator)} \\ \mu_4 &= E[(X-\mu)^4] \text{ (kurtosis numerator)} \end{aligned}

Here,

  • ΞΌnβ€²\mu_n'=n-th raw moment
  • ΞΌn\mu_n=n-th central moment
  • skewness\text{skewness}=\gamma_1 = \mu_3 / \sigma^3 (measures asymmetry)
  • kurtosis\text{kurtosis}=\kappa = \mu_4 / \sigma^4 (measures tail heaviness)

Skewness and Kurtosis

Skewness (ΞΌ3/Οƒ3\mu_3 / \sigma^3) measures the asymmetry of a distribution. Positive skew means a longer right tail. Kurtosis (ΞΌ4/Οƒ4\mu_4 / \sigma^4) measures the heaviness of tails. The normal distribution has kurtosis = 3 (excess kurtosis = 0). Leptokurtic distributions (kurtosis > 3) have heavier tails, which matters in risk management β€” extreme events are more likely than the normal model predicts.


Moment Generating Function

Moment Generating Function (MGF)

MX(t)=E[etX]M_X(t) = E[e^{tX}]

Here,

  • MX(t)M_X(t)=Moment generating function of X evaluated at t
  • tt=Real parameter (near 0)
  • etXe^{tX}=Exponential transform of X

DfWhy the MGF Matters

The moment generating function MX(t)=E[etX]M_X(t) = E[e^{tX}] uniquely determines the distribution of XX (when it exists in a neighborhood of t=0t=0). Its key properties:

  1. Moment extraction: MX(n)(0)=E[Xn]M_X^{(n)}(0) = E[X^n], i.e., the nn-th derivative at t=0t=0 gives the nn-th raw moment.
  2. Sum of independent variables: If XX and YY are independent, MX+Y(t)=MX(t)β‹…MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t). This makes it easy to find the distribution of sums.
  3. Uniqueness: If MX(t)=MY(t)M_X(t) = M_Y(t) for all tt in a neighborhood of 0, then XX and YY have the same distribution.

The MGF is related to the Laplace transform. The characteristic function Ο•X(t)=E[eitX]\phi_X(t) = E[e^{itX}] always exists (even when the MGF does not) and serves a similar role using complex exponentials.

MGF of the Normal Distribution

If X∼N(ΞΌ,Οƒ2)X \sim N(\mu, \sigma^2), then:

MX(t)=exp⁑ ⁣(ΞΌt+Οƒ2t22)M_X(t) = \exp\!\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

Taking derivatives and evaluating at t=0t=0 recovers all moments. For instance, MXβ€²(0)=ΞΌM_X'(0) = \mu and MXβ€²β€²(0)=ΞΌ2+Οƒ2M_X''(0) = \mu^2 + \sigma^2, confirming E[X2]=ΞΌ2+Οƒ2E[X^2] = \mu^2 + \sigma^2.


Python Implementation

Python for Expectation and Variance

Python's numpy and scipy libraries provide efficient tools for computing and verifying theoretical expectations, variances, and moments.

import numpy as np
from scipy import stats

# === Theoretical Values ===
# Normal distribution: mu=5, sigma=2
mu, sigma = 5, 2
print(f"E[X] = {mu}")
print(f"Var(X) = {sigma**2}")

# === Empirical Verification via Sampling ===
np.random.seed(42)
samples = np.random.normal(mu, sigma, size=100_000)
print(f"Sample mean: {samples.mean():.4f}")
print(f"Sample variance: {samples.var():.4f}")
print(f"Sample std dev: {samples.std():.4f}")

# === Discrete Distribution Moments ===
# Fair die
die = np.arange(1, 7)
prob = np.ones(6) / 6
E_die = np.sum(die * prob)
E_die2 = np.sum(die**2 * prob)
Var_die = E_die2 - E_die**2
print(f"E[die] = {E_die:.4f}, Var(die) = {Var_die:.4f}")

# === Custom Random Variable ===
def compute_expectation(values, probs):
    """Compute E[X] for a discrete random variable."""
    return np.sum(values * probs)

def compute_variance(values, probs):
    """Compute Var(X) for a discrete random variable."""
    mean = compute_expectation(values, probs)
    return compute_expectation((values - mean)**2, probs)

values = np.array([0, 1, 2, 3, 4])
probs = np.array([0.1, 0.2, 0.3, 0.25, 0.15])
print(f"E[X] = {compute_expectation(values, probs):.4f}")
print(f"Var(X) = {compute_variance(values, probs):.4f}")

# === Moment Generating Function ===
def mgf_normal(t, mu, sigma):
    """MGF of N(mu, sigma^2)."""
    return np.exp(mu * t + 0.5 * sigma**2 * t**2)

t = 0.1
print(f"M_X({t}) = {mgf_normal(t, mu, sigma):.6f}")

# === Empirical MGF ===
empirical_mgf = np.mean(np.exp(samples * t))
print(f"Empirical M_X({t}) = {empirical_mgf:.6f}")

# === Higher Moments with Scipy ===
from scipy.stats import skew, kurtosis
print(f"Skewness: {skew(samples):.4f}")
print(f"Excess Kurtosis: {kurtosis(samples):.4f}")

Applications in AI/ML

Why ML Engineers Care About Moments

Expectation and variance are not abstract math β€” they directly inform how we design, train, and evaluate machine learning systems.

DfExpected Risk Minimization

In supervised learning, the population risk (expected loss) is:

R(f)=E(x,y)∼P[L(f(x),y)]R(f) = E_{(x,y) \sim P}[L(f(x), y)]

We approximate this with the empirical risk (average loss over training data):

R^(f)=1nβˆ‘i=1nL(f(xi),yi)\hat{R}(f) = \frac{1}{n} \sum_{i=1}^n L(f(x_i), y_i)

By the law of large numbers, R^(f)β†’R(f)\hat{R}(f) \to R(f) as nβ†’βˆžn \to \infty. The variance of the loss estimator tells us how much our risk estimate fluctuates with different training samples β€” high variance indicates the estimate is unreliable.

DfGradient Variance in SGD

Stochastic gradient descent (SGD) approximates the true gradient with a mini-batch estimate. The variance of this estimate directly affects convergence speed:

Var(g^)=Var(βˆ‡Li)B\text{Var}(\hat{g}) = \frac{\text{Var}(\nabla L_i)}{B}

where BB is the batch size. Doubling the batch size halves the gradient variance. This is why larger batches produce smoother training curves, though not always better generalization.

DfBias-Variance Tradeoff

The expected prediction error decomposes as:

E[(yβˆ’f^(x))2]=Bias2(f^)+Var(f^)+Οƒ2E[(y - \hat{f}(x))^2] = \text{Bias}^2(\hat{f}) + \text{Var}(\hat{f}) + \sigma^2
  • Bias (Bias2\text{Bias}^2): error from wrong assumptions (underfitting).
  • Variance: error from sensitivity to training data (overfitting).
  • Irreducible noise (Οƒ2\sigma^2): inherent data noise.

Complex models (deep neural networks) have low bias but high variance. Regularization techniques (dropout, weight decay, early stopping) reduce variance at the cost of slightly increased bias.

Value at Risk (Finance ML)

In portfolio optimization, a financial ML model estimates the expected return ΞΌ\mu and risk Οƒ\sigma of a portfolio. A risk-averse investor maximizes:

U=ΞΌβˆ’Ξ»2Οƒ2U = \mu - \frac{\lambda}{2} \sigma^2

where Ξ»>0\lambda > 0 is the risk aversion parameter. This mean-variance framework directly uses expectation and variance as the two fundamental quantities.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Var(X+Y)=Var(X)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) alwaysOnly true for independent (or uncorrelated) variablesVar(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)
E[XY]=E[X]β‹…E[Y]E[XY] = E[X] \cdot E[Y] alwaysOnly true for independent (or uncorrelated) variablesCompute E[XY]E[XY] directly or use covariance: E[XY]=Cov(X,Y)+E[X]E[Y]E[XY] = \text{Cov}(X,Y) + E[X]E[Y]
Variance is in the same units as XXVariance has units of X2X^2Use standard deviation Οƒ=Var(X)\sigma = \sqrt{\text{Var}(X)} for interpretable units
Expected value always existsSome distributions (e.g., Cauchy) have no finite meanCheck convergence before using expectation-based results
"Variance = standard deviation"They are different quantities; Οƒβ‰ Οƒ2\sigma \neq \sigma^2Var(X)=Οƒ2\text{Var}(X) = \sigma^2, Οƒ=Var(X)\sigma = \sqrt{\text{Var}(X)}
Forgetting a2a^2 in Var(aX+b)\text{Var}(aX+b)Var(aX+b)=a2Var(X)\text{Var}(aX+b) = a^2 \text{Var}(X), not aβ‹…Var(X)a \cdot \text{Var}(X)The factor is a2a^2 because variance involves squaring deviations
Confusing moments with central momentsRaw moments E[Xn]E[X^n] and central moments E[(Xβˆ’ΞΌ)n]E[(X-\mu)^n] are differentUse the correct definition for skewness, kurtosis, etc.

Interview Questions

Question 1: Expectation of a Function

Q: If X∼Poisson(Ξ»)X \sim \text{Poisson}(\lambda), what is E[X(Xβˆ’1)]E[X(X-1)]?

A: E[X(Xβˆ’1)]=Ξ»2E[X(X-1)] = \lambda^2. Using the fact that E[X(Xβˆ’1)]=E[X2]βˆ’E[X]E[X(X-1)] = E[X^2] - E[X] and E[X2]=Ξ»+Ξ»2E[X^2] = \lambda + \lambda^2 for a Poisson distribution, so E[X(Xβˆ’1)]=Ξ»2E[X(X-1)] = \lambda^2. This technique is useful for computing higher moments.

Question 2: Variance of a Sum

Q: If Var(X)=4\text{Var}(X) = 4, Var(Y)=9\text{Var}(Y) = 9, and Cov(X,Y)=2\text{Cov}(X,Y) = 2, what is Var(2Xβˆ’3Y+5)\text{Var}(2X - 3Y + 5)?

A: Var(2Xβˆ’3Y+5)=4Var(X)+9Var(Y)βˆ’12Cov(X,Y)=16+81βˆ’24=73\text{Var}(2X - 3Y + 5) = 4\text{Var}(X) + 9\text{Var}(Y) - 12\text{Cov}(X,Y) = 16 + 81 - 24 = 73. The constant 5 drops out, and the coefficient of the covariance term is 2β‹…(βˆ’3)β‹…2=βˆ’122 \cdot (-3) \cdot 2 = -12.

Question 3: Linearity of Expectation

Q: A class of 30 students each flip a fair coin. What is the expected number of heads?

A: Let XiX_i be the indicator for student ii getting heads. E[Xi]=0.5E[X_i] = 0.5. By linearity, E[βˆ‘Xi]=βˆ‘E[Xi]=30β‹…0.5=15E[\sum X_i] = \sum E[X_i] = 30 \cdot 0.5 = 15. Linearity works even though the coin flips are independent β€” it always works.

Question 4: Conditional Expectation

Q: What is E[X∣X>0]E[X \mid X > 0] when X∼N(0,1)X \sim N(0, 1)?

A: By symmetry of the standard normal, E[X∣X>0]=2/Ο€β‰ˆ0.7979E[X \mid X > 0] = \sqrt{2/\pi} \approx 0.7979. This uses the truncated normal distribution: E[X∣X>0]=Ο•(0)1βˆ’Ξ¦(0)=1/2Ο€0.5=2/Ο€E[X \mid X > 0] = \frac{\phi(0)}{1 - \Phi(0)} = \frac{1/\sqrt{2\pi}}{0.5} = \sqrt{2/\pi}.

Question 5: Moment Generating Functions

Q: If X∼N(ΞΌ1,Οƒ12)X \sim N(\mu_1, \sigma_1^2) and Y∼N(ΞΌ2,Οƒ22)Y \sim N(\mu_2, \sigma_2^2) are independent, what is the distribution of X+YX+Y?

A: MX+Y(t)=MX(t)β‹…MY(t)=e(ΞΌ1+ΞΌ2)t+(Οƒ12+Οƒ22)t2/2M_{X+Y}(t) = M_X(t) \cdot M_Y(t) = e^{(\mu_1+\mu_2)t + (\sigma_1^2+\sigma_2^2)t^2/2}, which is the MGF of N(ΞΌ1+ΞΌ2,Οƒ12+Οƒ22)N(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2). By uniqueness of MGFs, X+Y∼N(ΞΌ1+ΞΌ2,Οƒ12+Οƒ22)X+Y \sim N(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2).

Question 6: Jensen's Inequality

Q: State Jensen's inequality and give an example.

A: For a convex function Ο•\phi and random variable XX: Ο•(E[X])≀E[Ο•(X)]\phi(E[X]) \leq E[\phi(X)]. Example: By convexity of x2x^2, (E[X])2≀E[X2](E[X])^2 \leq E[X^2], which implies Var(X)=E[X2]βˆ’(E[X])2β‰₯0\text{Var}(X) = E[X^2] - (E[X])^2 \geq 0. This is a fundamental inequality used in variational inference (ELBO derivation).


Practice Problems

Problem 1: Expected Value of a Function

Let XX have PMF P(X=0)=0.2P(X=0)=0.2, P(X=1)=0.5P(X=1)=0.5, P(X=2)=0.3P(X=2)=0.3. Find E[X]E[X], E[X2]E[X^2], and Var(X)\text{Var}(X).

Solution

E[X]=0(0.2)+1(0.5)+2(0.3)=1.1E[X] = 0(0.2) + 1(0.5) + 2(0.3) = 1.1
E[X2]=02(0.2)+12(0.5)+22(0.3)=1.7E[X^2] = 0^2(0.2) + 1^2(0.5) + 2^2(0.3) = 1.7
Var(X)=1.7βˆ’1.12=1.7βˆ’1.21=0.49\text{Var}(X) = 1.7 - 1.1^2 = 1.7 - 1.21 = 0.49

Problem 2: Linear Transformation

If Y=3X+5Y = 3X + 5 and Var(X)=4\text{Var}(X) = 4, find Var(Y)\text{Var}(Y) and SD(Y)\text{SD}(Y).

Solution

Var(Y)=Var(3X+5)=9β‹…Var(X)=9β‹…4=36\text{Var}(Y) = \text{Var}(3X + 5) = 9 \cdot \text{Var}(X) = 9 \cdot 4 = 36
SD(Y)=36=6\text{SD}(Y) = \sqrt{36} = 6

Note: Adding 5 shifts the mean but does not affect variance.

Problem 3: Linearity with Dependent Variables

Let XX and YY be random variables with E[X]=2E[X] = 2, E[Y]=3E[Y] = 3, Var(X)=1\text{Var}(X) = 1, Var(Y)=4\text{Var}(Y) = 4, and Cov(X,Y)=βˆ’1\text{Cov}(X,Y) = -1. Find E[2Xβˆ’Y+3]E[2X - Y + 3] and Var(2Xβˆ’Y+3)\text{Var}(2X - Y + 3).

Solution

E[2Xβˆ’Y+3]=2E[X]βˆ’E[Y]+3=4βˆ’3+3=4E[2X - Y + 3] = 2E[X] - E[Y] + 3 = 4 - 3 + 3 = 4
Var(2Xβˆ’Y+3)=4Var(X)+Var(Y)βˆ’4Cov(X,Y)=4(1)+4βˆ’4(βˆ’1)=12\text{Var}(2X - Y + 3) = 4\text{Var}(X) + \text{Var}(Y) - 4\text{Cov}(X,Y) = 4(1) + 4 - 4(-1) = 12

The covariance term has a βˆ’4-4 coefficient because the formula gives 2(2)(βˆ’1)Cov(X,Y)2(2)(-1)\text{Cov}(X,Y).

Problem 4: MGF Application

If MX(t)=11βˆ’tM_X(t) = \frac{1}{1-t} for t<1t < 1, identify the distribution of XX and find E[X]E[X] and Var(X)\text{Var}(X).

Solution

The MGF MX(t)=(1βˆ’t)βˆ’1M_X(t) = (1-t)^{-1} is the MGF of an Exponential(1) distribution.

E[X]=MXβ€²(0)=1(1βˆ’0)2=1E[X] = M_X'(0) = \frac{1}{(1-0)^2} = 1
E[X2]=MXβ€²β€²(0)=2(1βˆ’0)3=2E[X^2] = M_X''(0) = \frac{2}{(1-0)^3} = 2
Var(X)=E[X2]βˆ’(E[X])2=2βˆ’1=1\text{Var}(X) = E[X^2] - (E[X])^2 = 2 - 1 = 1

This confirms X∼Exponential(1)X \sim \text{Exponential}(1) with λ=1\lambda = 1.

Problem 5: Chebyshev's Inequality

A random variable XX has ΞΌ=10\mu = 10 and Οƒ2=4\sigma^2 = 4. Use Chebyshev's inequality to bound P(∣Xβˆ’10∣β‰₯5)P(|X - 10| \geq 5).

Solution

Chebyshev's inequality states: P(∣Xβˆ’ΞΌβˆ£β‰₯kΟƒ)≀1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}.

Here kσ=5k\sigma = 5, so k=5/2=2.5k = 5/2 = 2.5.

P(∣Xβˆ’10∣β‰₯5)≀1(2.5)2=16.25=0.16P(|X - 10| \geq 5) \leq \frac{1}{(2.5)^2} = \frac{1}{6.25} = 0.16

This holds for any distribution with the given mean and variance β€” no normality assumption needed.


Quick Reference

QuantityFormulaPython
Expectation (discrete)E[X]=βˆ‘xβ‹…p(x)E[X] = \sum x \cdot p(x)np.sum(x * p)
Expectation (continuous)E[X]=∫xf(x)dxE[X] = \int x f(x) dxnp.trapz(x * pdf, x)
VarianceVar(X)=E[X2]βˆ’(E[X])2\text{Var}(X) = E[X^2] - (E[X])^2np.var(x)
Standard Deviationσ=Var(X)\sigma = \sqrt{\text{Var}(X)}np.std(x)
nn-th raw momentE[Xn]E[X^n]np.mean(x**n)
nn-th central momentE[(Xβˆ’ΞΌ)n]E[(X-\mu)^n]np.mean((x - mu)**n)
SkewnessΞ³1=ΞΌ3/Οƒ3\gamma_1 = \mu_3 / \sigma^3scipy.stats.skew(x)
KurtosisΞΊ=ΞΌ4/Οƒ4\kappa = \mu_4 / \sigma^4scipy.stats.kurtosis(x)
MGFMX(t)=E[etX]M_X(t) = E[e^{tX}]np.mean(np.exp(x * t))
Linearity of EEE[aX+b]=aE[X]+bE[aX+b] = aE[X]+bβ€”
Variance scalingVar(aX+b)=a2Var(X)\text{Var}(aX+b) = a^2\text{Var}(X)β€”
Covariance ruleVar(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X+Y) = \text{Var}(X)+\text{Var}(Y)+2\text{Cov}(X,Y)np.cov(x, y)

Cross-References

⭐

Premium Content

Expectation and Variance

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Mathematics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement