Data 100 SP26 — Finals Study Guide

01

Pandas

comfortable

Key concepts

Series vs. DataFrame: Series = 1D labeled array; DataFrame = 2D table with labeled rows & columns.
.loc vs. .iloc: .loc[row_label, col_label] — label-based; .iloc[row_int, col_int] — integer position-based.
Boolean filtering: df[df['col'] > 5] creates a mask; chain with & | ~ (not Python's and/or/not).
GroupBy: df.groupby('col').agg({'other': 'mean'}) — split → apply → combine pattern.
Merge/Join: pd.merge(df1, df2, on='key', how='inner'). Know inner / left / outer differences.
Missing values: df.isna(), dropna(), fillna(val). NaN propagates through arithmetic.
Apply: df['col'].apply(func) applies element-wise; df.apply(func, axis=1) applies row-wise.

⚡ Key insight

Never use Python's and/or in boolean masks — always use & and | and wrap each condition in parentheses. This is a classic exam trap.

✎ Practice questions

Given df with columns name, score, grade, write code to get the mean score for each grade group, sorted descending.
Show answer
df.groupby('grade')['score'].mean().sort_values(ascending=False)
What is the difference between df.loc[0] and df.iloc[0] when the DataFrame index is [10, 20, 30]?
Show answer
df.loc[0] raises a KeyError (0 is not a label in the index). df.iloc[0] returns the first row (index label 10).
Write a merge that returns only rows where a student appears in both scores_df and roster_df, matching on student_id.
Show answer
pd.merge(scores_df, roster_df, on='student_id', how='inner')

02

EDA — Exploratory Data Analysis

comfortable

↗ Course Note 5

Key concepts

EDA checklist: shape, dtypes, null counts, summary stats, distributions, outliers.
Data faithfulness: Does the data actually measure what we think? Check for outliers, inconsistencies, placeholder values (e.g. -999, "N/A").
Granularity: What does each row represent? (a person, a transaction, a day?)
Skewness: Right-skewed → mean > median. Log-transform often fixes right skew.
KDE (Kernel Density Estimate): Smooth version of a histogram; sensitive to bandwidth choice.
Transformations: Log, square root, reciprocal — use to make distributions more symmetric before modeling.

⚡ Key insight

EDA is never done just once. After transformations or cleaning, re-examine distributions. The goal is to understand the data, not just compute stats.

✎ Practice questions

A column has mean = 50,000 and median = 35,000. Is it likely right-skewed or left-skewed? What transformation might help?
Show answer
Right-skewed (mean > median, pulled by large values). A log transform would compress the right tail and make the distribution more symmetric.
Name two ways to identify potential data quality issues during EDA.
Show answer
Any two of: checking for NaN/null counts; looking at min/max for impossible values; value_counts for unexpected categories; verifying data types match expected; checking for duplicates.

03

Regular Expressions (Regex)

comfortable

↗ Course Note 6

Cheat sheet

. — any character (except newline) | \d digit | \w word char | \s whitespace
* zero or more | + one or more | ? zero or one
{m,n} between m and n times | ^ start of string | $ end of string
[abc] character class | [^abc] negated class | (a|b) alternation
() capturing group | (?:) non-capturing group
Python: re.search(), re.findall(), re.sub(), re.fullmatch()
Pandas: df['col'].str.extract(r'pattern'), str.contains(), str.replace()

⚡ Key insight

Quantifiers are greedy by default (match as much as possible). Add ? after to make lazy: .*?. On exams, trace through the string character-by-character.

✎ Practice questions

Write a regex pattern that matches a US phone number in the format XXX-XXX-XXXX.
Show answer
\d{3}-\d{3}-\d{4}
What does re.findall(r'\b\w{4}\b', 'the quick brown fox') return?
Show answer
['quick', 'brown'] — words with exactly 4 characters. "the" has 3, "fox" has 3.

04

Visualization

comfortable

↗ Course Notes 7–8

Key concepts

Choose the right plot: Distributions → histogram / KDE; Relationships → scatter; Categories → bar; Time → line.
Histogram vs. KDE: Histograms bin data (bin width matters!); KDE smooths with a kernel (bandwidth matters!).
Overplotting solutions: reduce alpha (alpha=0.3), jitter, hexbin, or contour plots.
Transformations: Log scale on an axis ≠ log-transforming data. Know when each is appropriate.
Color: Use perceptually uniform colormaps for continuous data (viridis). Avoid rainbow. Categorical → qualitative palettes.
Misleading charts: Truncated y-axis, inconsistent bin widths, dual axes — know how to spot and critique these.

⚡ Key insight

The choice of plot type should follow from the question you're asking, not the data shape. Always label axes and include units.

✎ Practice questions

You have salary data for 10,000 employees. Which plot would best show the distribution shape and why?
Show answer
A histogram or KDE — both show the distribution's shape (modality, skew, spread). With 10,000 points, a box plot would hide the shape. A bar chart would be inappropriate for continuous data.
A scatter plot of two variables looks like all points cluster at the lower-left with a few extreme outliers. What transformation might help reveal the relationship?
Show answer
Apply a log transform to one or both axes (or the data itself). This compresses the right tail and spreads out the clustered low values, often revealing a linear or power-law relationship.

05

Sampling

comfortable

↗ Course Note 9

Key concepts

Population vs. sample: Population = everyone we care about; Sample = who we actually measure.
SRS (Simple Random Sampling): Every individual has an equal chance of being selected. With vs. without replacement matters for small populations.
Stratified sampling: Divide into subgroups (strata), then SRS within each. Ensures representation.
Cluster sampling: Randomly select clusters (e.g. classrooms), then survey everyone in selected clusters.
Convenience sampling: Not random — major source of bias. (e.g. surveying only people who walk by)
Bias types: Selection bias, response bias, non-response bias, survivorship bias.

⚡ Key insight

A large sample size does not fix bias — the 1936 Literary Digest poll had 2.4 million respondents but was wildly wrong due to selection bias. Randomization is what removes bias.

✎ Practice questions

A hospital surveys patients who were discharged — not those who died — to measure treatment effectiveness. What bias is present?
Show answer
Survivorship bias. Only patients who survived and were discharged are in the sample, so the sample systematically excludes the worst outcomes.
Why is stratified sampling preferred over SRS when the population has important subgroups of very different sizes?
Show answer
SRS might under-sample small but important subgroups by chance. Stratified sampling guarantees each subgroup is represented proportionally (or at a chosen rate), reducing variance in estimates.

06

Modeling & SLR

comfortable

↗ Course Note 10

Key concepts

Model: A simplified mathematical representation of a process. All models are wrong, some are useful.
Loss function: Measures how wrong predictions are. MSE = mean squared error; MAE = mean absolute error.
Optimal constant model: Under MSE → mean; under MAE → median. This is a critical fact.
SLR: $\hat{y} = \theta_0 + \theta_1 x$. Minimize MSE to get $\theta_1 = r \cdot (\sigma_y / \sigma_x)$, $\theta_0 = \bar{y} - \theta_1 \bar{x}$.
Residuals: $e_i = y_i - \hat{y}_i$. OLS residuals always sum to zero and are uncorrelated with $x$.
$R^2$: Proportion of variance in $y$ explained by the model. $R^2 = 1 - \mathrm{RSS}/\mathrm{TSS}$. Higher is better (max 1).

SLR formulas

$$\begin{aligned} \theta_1 &= r \cdot \frac{\sigma_y}{\sigma_x} \quad \text{(slope)} \\ \theta_0 &= \bar{y} - \theta_1 \bar{x} \quad \text{(intercept)} \end{aligned}$$

where $r$ is the Pearson correlation coefficient

⚡ Key insight

The optimal prediction under MSE is always the mean of the conditional distribution. This is why the constant model minimizer is the mean, and why SLR predictions pass through $(\bar{x}, \bar{y})$.

✎ Practice questions

If $r = 0.8$, $\sigma_y = 10$, $\sigma_x = 4$, $\bar{x} = 5$, $\bar{y} = 20$, find the SLR slope and intercept.
Show answer
$\theta_1 = 0.8 \times (10/4) = 2.0$. $\theta_0 = 20 - 2.0 \times 5 = 10$. Model: $\hat{y} = 10 + 2x$.
You fit a constant model (predicting the same value for all inputs). What value minimizes MSE? What minimizes MAE?
Show answer
MSE → minimized by the mean of $y$. MAE → minimized by the median of $y$.

07

Ordinary Least Squares (OLS)

comfortable

↗ Course Note 12

Key concepts

Matrix form: Model is $\hat{y} = X\theta$. Loss is $\mathrm{MSE} = \tfrac{1}{n}\lVert y - X\theta\rVert^2$. Minimize to get normal equations.
Normal equations: $\hat{\theta} = (X^\top X)^{-1} X^\top y$ — unique solution when $X^\top X$ is invertible (columns are linearly independent).
Geometric view: $\hat{y} = X\hat{\theta}$ is the orthogonal projection of $y$ onto the column space of $X$. Residuals $e = y - \hat{y} \perp \mathrm{col}(X)$.
Design matrix $X$: First column is all 1s (for the intercept). Each subsequent column is a feature.
Full rank: If $X$ does not have full column rank (e.g., duplicate or perfectly correlated features), $(X^\top X)^{-1}$ does not exist → infinitely many solutions.

OLS Normal Equations

$$\begin{aligned} X^\top X\,\hat{\theta} &= X^\top y \\ \hat{\theta} &= (X^\top X)^{-1} X^\top y \quad \text{(when } X^\top X \text{ invertible)} \end{aligned}$$

Geometric form (with hat / projection matrix $H$):

$$\hat{y} = X(X^\top X)^{-1} X^\top y = H y$$

⚡ Key insight

The residual vector $e = y - \hat{y}$ is always perpendicular to every column of $X$. This means $X^\top e = 0$, which is the key OLS property used to derive the normal equations.

✎ Practice questions

You add a third feature to your design matrix that equals $2\times$ the first feature. What happens to the OLS solution?
Show answer
$X^\top X$ becomes singular (not invertible) because the columns are linearly dependent. The OLS solution is not unique — infinitely many $\hat{\theta}$ achieve the same minimum MSE.
Why do we include a column of 1s in the design matrix?
Show answer
It allows the model to estimate an intercept term $\theta_0$. Without it, the regression line is forced through the origin.

08

Gradient Descent

comfortable

↗ Course Note 13

Key concepts

Idea: Iteratively update parameters by moving opposite to the gradient of the loss.
Update rule: $\theta \leftarrow \theta - \alpha \cdot \nabla L(\theta)$, where $\alpha$ is the learning rate.
Learning rate $\alpha$: Too large → overshoots, may diverge. Too small → converges very slowly.
Batch GD: Gradient computed on all $n$ data points per step — exact but slow for large $n$.
Stochastic GD (SGD): Gradient on 1 random point per step — noisy but fast.
Mini-batch GD: Gradient on a batch of $k$ points — the practical standard.
Convexity: If the loss is convex, GD is guaranteed to find the global minimum. MSE is convex; log-loss is convex.

Gradient Descent Update

$$\theta \;\leftarrow\; \theta - \alpha \cdot \nabla_\theta L(\theta)$$

For MSE loss with model $\hat{y} = X\theta$:

$$\nabla_\theta L = -\tfrac{2}{n}\, X^\top (y - X\theta)$$

⚡ Key insight

GD is needed when there is no closed-form solution (e.g. logistic regression). For OLS (linear + MSE), the closed-form $\hat{\theta} = (X^\top X)^{-1} X^\top y$ is faster and exact — no need for GD.

✎ Practice questions

Your loss curve oscillates wildly between iterations instead of decreasing. What is the likely cause and fix?
Show answer
Learning rate $\alpha$ is too large — the steps overshoot the minimum. Fix: reduce $\alpha$.
Compare batch GD and SGD: which converges more smoothly and which is faster per epoch for $n = 1{,}000{,}000$?
Show answer
Batch GD converges smoothly (uses all data to compute exact gradient) but is slow — each step requires $n = 10^6$ operations. SGD updates parameters much more frequently (after each point) and can converge much faster in wall-clock time, but the path is noisy/jagged.

09

Feature Engineering

comfortable

↗ Course Notes 14–15

Key concepts

Polynomial features: Add $x^2$, $x^3$, … to allow a linear model to fit curves. Still a linear model in $\theta$.
One-hot encoding: Convert a categorical column with $k$ levels into $k - 1$ binary columns (drop one to avoid dummy variable trap / multicollinearity).
Interaction terms: $x_1 \cdot x_2$ — captures joint effects of two features.
Standardization: $z = (x - \mu)/\sigma$. Makes features zero-mean, unit-variance. Important for regularization and GD convergence.
Training error vs. test error: Adding more features always ↓ training MSE, but can ↑ test MSE (overfitting).

⚡ Key insight

Feature engineering keeps the model linear in the parameters $\theta$ even when it's nonlinear in the original features. A model with $x$, $x^2$, $x^3$ is still OLS — you just have a bigger design matrix.

✎ Practice questions

A categorical variable "season" has 4 values: Spring, Summer, Fall, Winter. How many one-hot encoded columns should you add to the design matrix, and why?
Show answer
3 columns ($k - 1 = 4 - 1$). Dropping one avoids perfect multicollinearity — the dropped category is the reference level implicitly captured by the intercept.
You add polynomial features up to degree 10 and your training MSE drops to near zero, but test MSE is very high. What happened?
Show answer
Overfitting. The high-degree polynomial model memorized the training data (near-zero training error) but fails to generalize. Regularization or reducing polynomial degree would help.

10

Cross-Validation

comfortable

↗ Course Note 17

Key concepts

Why CV? Estimate how well a model generalizes to unseen data, without using the test set.
Train / Val / Test split: Train on train set; tune hyperparameters on validation; report final performance on test set (once!).
k-Fold CV: Split data into k folds; train on k−1, validate on 1, rotate k times; average the k validation errors.
Leave-One-Out CV (LOOCV): k = n. Unbiased but computationally expensive.
Test set contamination: If you peek at the test set to make any decision, it's no longer a valid test set.

⚡ Key insight

The test set is sacred — you use it exactly once, at the very end, to report final performance. Using it to select hyperparameters inflates reported accuracy.

✎ Practice questions

You perform 5-fold CV and get validation MSEs of $[12, 15, 11, 14, 13]$. What is your estimated generalization MSE?
Show answer
$\text{Mean} = (12+15+11+14+13)/5 = 65/5 = $ 13.0.
A classmate says "I tried 20 different models and picked the one with the best test set performance." What's wrong?
Show answer
Test set contamination. By using the test set to select a model, they've effectively trained on it. The reported test performance is optimistically biased and won't reflect true generalization.

11

Regularization

comfortable

↗ Course Note 17

Key concepts

Purpose: Penalize large coefficients to reduce overfitting. Adds a penalty term to the loss function.
Ridge (L2): Loss $+ \lambda \lVert\theta\rVert_2^2$. Shrinks all coefficients toward zero; never sets them exactly to zero. Has a closed form.
LASSO (L1): Loss $+ \lambda \lVert\theta\rVert_1$. Can set coefficients exactly to zero → performs feature selection.
$\lambda$ (lambda): Regularization strength. $\lambda = 0$ → plain OLS. $\lambda \to \infty$ → all coefficients → 0.
Tuning $\lambda$: Use cross-validation to select the best $\lambda$.
Important: Regularization penalizes the intercept? No — typically the intercept $\theta_0$ is not penalized.
Standardize features before regularizing — otherwise features with larger scales get penalized more.

Ridge regression (L2)

$$\hat{\theta}_{\text{ridge}} = \arg\min_\theta \left[\, \tfrac{1}{n}\lVert y - X\theta\rVert^2 + \lambda \lVert\theta\rVert_2^2 \,\right] = (X^\top X + n\lambda I)^{-1} X^\top y$$

Always invertible thanks to the $n\lambda I$ term.

LASSO (L1)

$$\hat{\theta}_{\text{lasso}} = \arg\min_\theta \left[\, \tfrac{1}{n}\lVert y - X\theta\rVert^2 + \lambda \lVert\theta\rVert_1 \,\right]$$

No closed form — requires a numerical solver.

⚡ Key insight

Ridge never exactly zeros out a coefficient; LASSO can. This is because the $L_1$ penalty has a "corner" at zero in its geometry, which allows the optimum to sit exactly on zero.

✎ Practice questions

You have 50 features and suspect only 5 are truly relevant. Which regularization method should you prefer and why?
Show answer
LASSO ($L_1$) — it can set irrelevant feature coefficients exactly to zero, effectively selecting the 5 relevant features. Ridge would shrink all 50 but keep all nonzero.
As $\lambda$ increases in Ridge regression, what happens to the bias and variance of the model?
Show answer
Bias increases (the model is constrained away from the true OLS solution) and variance decreases (coefficients are more stable). This is the bias-variance tradeoff.

12

Random Variables

★ harder

📌 Spend extra time here — probability is foundational to inference, bias-variance, and logistic regression.

↗ Course Note 18

Key concepts

Random Variable (RV): A variable whose value is determined by a random process. Not a specific number — a function from outcomes to numbers.
PMF (discrete): $P(X = x)$ for each possible value $x$. Must sum to 1.
PDF (continuous): $f(x) \ge 0$; $P(a \le X \le b) = \int_a^b f(x)\,dx$. Must integrate to 1.
CDF: $F(x) = P(X \le x)$. Always non-decreasing; $F(-\infty) = 0$, $F(+\infty) = 1$.
Expectation: $E[X] = \sum x \cdot P(X = x)$ (discrete) or $\int x f(x)\,dx$ (continuous). The "center of mass."
Variance: $\mathrm{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$. Always $\ge 0$.
Standard deviation: $\mathrm{SD}(X) = \sqrt{\mathrm{Var}(X)}$. Same units as $X$.
Linearity of expectation: $E[aX + b] = a \cdot E[X] + b$. Works even for dependent RVs.
Variance of linear combo: $\mathrm{Var}(aX + b) = a^2 \cdot \mathrm{Var}(X)$. $\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y)$ only if $X, Y$ are independent.
Common distributions:
- Bernoulli$(p)$: $E = p$, $\mathrm{Var} = p(1-p)$.
- Binomial$(n, p)$: $E = np$, $\mathrm{Var} = np(1-p)$.
- Normal$(\mu, \sigma^2)$: $E = \mu$, $\mathrm{Var} = \sigma^2$.

Key Identities

$$\begin{aligned} E[aX + bY] &= a \cdot E[X] + b \cdot E[Y] && \text{(always)} \\ \mathrm{Var}(X) &= E[X^2] - (E[X])^2 && \text{(computational form)} \\ \mathrm{Var}(aX + b) &= a^2 \cdot \mathrm{Var}(X) && \text{(scaling)} \\ \mathrm{Var}(X + Y) &= \mathrm{Var}(X) + \mathrm{Var}(Y) && \text{(if } X \perp Y \text{)} \\ \mathrm{Cov}(X, Y) &= E[XY] - E[X] \cdot E[Y] \\ \mathrm{Corr}(X, Y) &= \frac{\mathrm{Cov}(X, Y)}{\mathrm{SD}(X) \cdot \mathrm{SD}(Y)} \end{aligned}$$

⚡ Key insight

Expectation is always linear — $E[X + Y] = E[X] + E[Y]$ regardless of dependence. Variance is not — $\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y)$ only when $X$ and $Y$ are independent.

✎ Practice questions

$X$ is Bernoulli with $P(X = 1) = 0.3$. Find $E[X]$, $E[X^2]$, and $\mathrm{Var}(X)$.
Show answer
$E[X] = 0.3$. $E[X^2] = 1^2 \cdot 0.3 + 0^2 \cdot 0.7 = 0.3$. $\mathrm{Var}(X) = E[X^2] - (E[X])^2 = 0.3 - 0.09 = $ 0.21. (Or use $p(1-p) = 0.3 \cdot 0.7 = 0.21$.)
If $X$ and $Y$ are independent with $\mathrm{Var}(X) = 4$ and $\mathrm{Var}(Y) = 9$, find $\mathrm{Var}(2X - 3Y + 5)$.
Show answer
$\mathrm{Var}(2X - 3Y + 5) = 2^2 \cdot \mathrm{Var}(X) + (-3)^2 \cdot \mathrm{Var}(Y) = 4 \cdot 4 + 9 \cdot 9 = 16 + 81 = $ 97. (Constants add 0 variance.)
A fair die is rolled. Let $X$ be the outcome. Find $E[X]$ and $\mathrm{Var}(X)$.
Show answer
$E[X] = (1+2+3+4+5+6)/6 = 3.5$. $E[X^2] = (1+4+9+16+25+36)/6 = 91/6 \approx 15.17$. $\mathrm{Var}(X) = 91/6 - (3.5)^2 = 91/6 - 49/4 = 182/12 - 147/12 = 35/12 \approx $ 2.917.

13

Bias & Variance

★ harder

📌 This is one of the most tested conceptual topics on the final. Know the decomposition cold.

↗ Course Note 19

Key concepts

Context: We imagine drawing many different training sets from the same population and fitting a model each time.
Bias: How far is the average prediction from the truth? $E[\hat{\theta}] - \theta$. Systematic error — from wrong model assumptions.
Variance: How much do predictions vary across different training sets? $\mathrm{Var}(\hat{\theta})$. Sensitivity to training data.
MSE decomposition: $E[(\hat{\theta} - \theta)^2] = \text{Bias}^2 + \text{Variance}$. (Plus irreducible noise for prediction tasks.)
Bias-Variance tradeoff: Complex models → low bias, high variance. Simple models → high bias, low variance.
Underfitting: Model too simple; high bias. Both training and test error are high.
Overfitting: Model too complex; high variance. Training error low, test error high.
Bootstrap: Resample with replacement from your data to estimate the sampling distribution of a statistic.

MSE Decomposition (estimator $\hat{\theta}$ of parameter $\theta$)

$$E\!\left[(\hat{\theta} - \theta)^2\right] = \mathrm{Bias}(\hat{\theta})^2 + \mathrm{Var}(\hat{\theta}) \quad\text{where } \mathrm{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta$$

For prediction at a point $x$:

$$\text{Expected prediction error} = \mathrm{Bias}^2 + \mathrm{Var} + \sigma_{\text{noise}}^2$$

⚡ Key insight

An unbiased estimator is not necessarily better — it might have very high variance. A slightly biased estimator with much lower variance can have lower total MSE. This is exactly why regularization helps: it introduces a little bias to greatly reduce variance.

✎ Practice questions

You fit a degree-1 model to data generated from a true degree-3 relationship. Is this an example of high bias or high variance? What about fitting a degree-10 model to 15 data points?
Show answer
Degree-1 model → high bias (underfitting — too simple for the true function). Degree-10 on 15 points → high variance (overfitting — model is too complex relative to data size, will change drastically with different samples).
Explain in one sentence why the sample mean $\bar{x}$ is an unbiased estimator of the population mean $\mu$.
Show answer
$E[\bar{x}] = E\!\left[\tfrac{1}{n}\sum_i x_i\right] = \tfrac{1}{n} \cdot n\mu = \mu$, so the expected value of the sample mean equals the population mean — bias is zero.
Describe the bootstrap procedure for estimating the standard error of the median.
Show answer
(1) Resample n points with replacement from your data, B times. (2) Compute the median of each resample → get B medians. (3) The standard deviation of those B medians is your bootstrap estimate of the standard error.

14

Inference & Hypothesis Testing

★ harder

📌 Commonly misunderstood. Know what a p-value and confidence interval actually mean — exams test the precise definition.

↗ Course Note 20

Key concepts

Null hypothesis $H_0$: The "no effect / no difference" claim. Hypothesis tests evaluate evidence against $H_0$.
p-value: Probability of observing data this extreme (or more) if $H_0$ were true. Small $p$ → evidence against $H_0$. NOT the probability $H_0$ is true.
Significance level $\alpha$: Threshold for rejecting $H_0$ (commonly $0.05$). If $p < \alpha$, reject $H_0$.
Type I error: False positive — reject $H_0$ when it's true. Rate $= \alpha$.
Type II error: False negative — fail to reject $H_0$ when it's false. Rate $= \beta$. Power $= 1 - \beta$.
Confidence interval (CI): A range of plausible values for the parameter. A 95% CI means: if we repeated the procedure many times, 95% of such intervals would contain the true parameter.
Bootstrap CI: Compute statistic on many bootstrap samples; use the 2.5th–97.5th percentiles for a 95% CI.
Causation vs. correlation: Observational data cannot establish causation. Only randomized experiments (RCTs) can establish causal effects by controlling confounders.

⚡ Key insight

A 95% confidence interval does NOT mean "there is a 95% probability the parameter is in this interval." The parameter is fixed; the interval is random. 95% refers to the long-run coverage probability of the procedure.

✎ Practice questions

You run a test and get $p = 0.03$ with $\alpha = 0.05$. You conclude there is "a 97% chance the null hypothesis is false." What's wrong with this statement?
Show answer
The p-value is not the probability that $H_0$ is false. $p = 0.03$ means: if $H_0$ were true, you'd see data this extreme 3% of the time. We can reject $H_0$ at $\alpha = 0.05$, but we cannot assign a probability to $H_0$ being true/false (that's a Bayesian concept).
A study finds that people who eat breakfast have higher GPAs. Can we conclude breakfast causes higher GPAs?
Show answer
No. This is observational data. Confounders (e.g. socioeconomic status, sleep habits) might explain both breakfast-eating and GPA. Only a randomized experiment could isolate the effect of breakfast.
You compute a 95% bootstrap CI for the mean: $[42.1,\, 58.3]$. Interpret this interval.
Show answer
We are 95% confident that the true population mean lies between $42.1$ and $58.3$. More precisely: this interval was constructed using a procedure that captures the true mean in 95% of repetitions.

15

SQL

★ harder

📌 SQL questions are very common on finals. Practice tracing through queries by hand.

↗ Course Notes 21–22

Query structure (order matters!)

SELECT   [columns / aggregations]    -- what to return
FROM     [table]                      -- source table
JOIN     [other_table] ON [condition] -- combine tables (optional)
WHERE    [row filter]                 -- filter BEFORE grouping
GROUP BY [columns]                    -- aggregate into groups
HAVING   [group filter]               -- filter AFTER grouping
ORDER BY [columns] ASC/DESC          -- sort result
LIMIT    [n]                          -- cap rows returned

Key concepts

Aggregation functions: COUNT(*), SUM(col), AVG(col), MAX(col), MIN(col).
WHERE vs. HAVING: WHERE filters individual rows before grouping; HAVING filters groups after aggregation.
JOIN types: INNER JOIN = intersection (both tables must match); LEFT JOIN = all left rows + matching right (NULLs for no match); OUTER JOIN = all rows from both.
Subqueries: A SELECT inside another SELECT. Can appear in WHERE, FROM, or HAVING clauses.
NULL: NULL ≠ NULL. Use IS NULL / IS NOT NULL, not = NULL.
DISTINCT: SELECT DISTINCT col removes duplicate values.
Pandas ↔ SQL: df.groupby('col').agg() → GROUP BY + aggregation; df[df['col'] > 5] → WHERE; merge → JOIN.

⚡ Key insight

SQL executes in this order: FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. Understanding this order explains why you can't use a SELECT alias in a WHERE clause (SELECT hasn't happened yet).

✎ Practice questions

Write a query to find the average salary per department, but only for departments with more than 5 employees, sorted by average salary descending.
Show answer
```
SELECT department, AVG(salary) AS avg_sal
FROM employees
GROUP BY department
HAVING COUNT(*) > 5
ORDER BY avg_sal DESC;
```
What is the difference between a LEFT JOIN and an INNER JOIN? Give an example of when each is appropriate.
Show answer
INNER JOIN returns only rows with matches in both tables. LEFT JOIN returns all rows from the left table plus matched rows from the right (NULL where no match). Use INNER when you only want complete records; use LEFT when you want to keep all left rows even without matches (e.g., listing all students and their grades — students with no grades should still appear).
What does HAVING COUNT(*) > 1 do, and why can't you replace it with WHERE COUNT(*) > 1?
Show answer
HAVING filters groups after GROUP BY is applied. WHERE runs before GROUP BY and cannot use aggregate functions — COUNT(*) is not defined at the WHERE stage. So WHERE COUNT(*) > 1 is a SQL error.

16

Logistic Regression

★ harder

📌 Know the sigmoid function, log-loss derivation, and classification metrics — all commonly tested.

↗ Course Notes 23–24

Key concepts

Goal: Binary classification. Output $P(Y = 1 \mid x) \in (0, 1)$.
Sigmoid function: $\sigma(z) = \dfrac{1}{1 + e^{-z}}$. Maps any real number to $(0, 1)$. $\sigma(0) = 0.5$.
Model: $P(Y = 1 \mid x) = \sigma(x^\top \theta) = \dfrac{1}{1 + \exp(-x^\top \theta)}$.
Decision boundary: $x^\top \theta = 0 \Rightarrow P = 0.5$. Predict $Y = 1$ if $P > $ threshold (default $0.5$).
Loss — Log-loss (cross-entropy): $-\bigl[y \log p + (1 - y)\log(1 - p)\bigr]$. Convex in $\theta$, no closed form → use GD.
Classification metrics:
- Accuracy $= \dfrac{TP + TN}{TP + TN + FP + FN}$
- Precision $= \dfrac{TP}{TP + FP}$ — of predicted positives, how many are correct?
- Recall (Sensitivity) $= \dfrac{TP}{TP + FN}$ — of actual positives, how many did we catch?
- $F_1 = \dfrac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$ — harmonic mean
Threshold choice: Lowering threshold → more positives predicted → higher recall, lower precision. Raising → opposite.

Logistic Regression model

$$P(Y = 1 \mid x) = \sigma(x^\top \theta) = \frac{1}{1 + \exp(-x^\top \theta)}$$

Log-loss for one sample

$$L(\theta) = -\Bigl[y \log\sigma(x^\top \theta) + (1 - y) \log\bigl(1 - \sigma(x^\top \theta)\bigr)\Bigr]$$

Gradient of log-loss

$$\nabla_\theta L = \tfrac{1}{n} X^\top (\hat{p} - y) \quad\text{where } \hat{p} = \sigma(X\theta)$$

⚡ Key insight

Logistic regression outputs probabilities, not labels. The decision threshold of $0.5$ is a choice — in imbalanced datasets or when false negatives are costly (e.g. disease detection), you might lower the threshold to increase recall at the cost of precision.

✎ Practice questions

Out of 100 actual cancer patients, your model flags 60 as positive. Out of 50 model-predicted positives, 40 are actual cancer. Compute Precision and Recall.
Show answer
$TP = 40$ (correctly flagged cancer patients). $FP = 50 - 40 = 10$. $FN = 60 - 40 = 20$ (cancer patients the model missed). Precision $= 40/50 = $ 0.80. Recall $= 40/60 \approx $ 0.667.
Why can't we use MSE as the loss function for logistic regression instead of log-loss?
Show answer
MSE with the sigmoid is non-convex for logistic regression — gradient descent may get stuck in local minima. Log-loss is convex in $\theta$, guaranteeing a global minimum. Additionally, log-loss has a nice probabilistic interpretation as maximum likelihood under a Bernoulli model.
What is $\sigma(-3)$? Is the model predicting class 0 or class 1 at this point (with threshold $0.5$)?
Show answer
$\sigma(-3) = \dfrac{1}{1 + e^{3}} \approx 0.047$. Since $0.047 < 0.5$, predict class 0.

17

Clustering

★ harder

📌 Focus on K-Means mechanics, inertia, and choosing k — these are most exam-tested.

↗ Course Note 25

Key concepts

Unsupervised learning: No labels. Goal: find structure/groupings in the data.
K-Means algorithm:
1. Initialize k centroids (randomly or with k-means++).
2. Assign each point to the nearest centroid.
3. Update each centroid to the mean of its assigned points.
4. Repeat 2–3 until centroids stop moving.
Inertia (within-cluster sum of squares): $\displaystyle \sum_{k=1}^{K} \sum_{x_i \in C_k} \lVert x_i - \mu_k \rVert^2$, summed over all points in each cluster. K-Means minimizes this.
Choosing $k$ — Elbow method: Plot inertia vs. $k$; look for the "elbow" where adding more clusters gives diminishing returns.
Limitations of K-Means: Assumes spherical/equally-sized clusters; sensitive to initialization; must specify $k$ in advance; can get stuck in local optima.
Hierarchical clustering: Agglomerative (bottom-up): start with each point as its own cluster, merge closest pairs. Results in a dendrogram.

⚡ Key insight

Inertia always decreases as $k$ increases — at $k = n$, every point is its own cluster with inertia $= 0$. So lower inertia alone doesn't mean a better model. The elbow heuristic looks for meaningful structure, not just minimizing inertia.

✎ Practice questions

You run K-Means with $k = 3$ on 2D data. After convergence, describe what the centroids represent.
Show answer
Each centroid is the mean (average position) of all points assigned to that cluster. It represents the "center" of its cluster in the 2D feature space.
K-Means gives different results on different runs. Why? How do you mitigate this?
Show answer
K-Means is sensitive to initial centroid placement and can converge to different local optima. Mitigate by: running K-Means multiple times with different random initializations and picking the run with the lowest inertia; or use K-Means++ initialization, which spreads out initial centroids.

18

PCA — Principal Component Analysis

★ harder

📌 PCA is math-heavy. Focus on the intuition, SVD connection, and explained variance — most exam Qs test these.

↗ Course Note 26

Key concepts

Goal: Reduce dimensionality — project high-dimensional data onto fewer dimensions while preserving as much variance as possible.
Principal Components (PCs): New orthogonal axes. PC1 captures the most variance; PC2 the second most (orthogonal to PC1); etc.
PCs are eigenvectors of the covariance matrix $X^\top X$ (centered $X$). Eigenvalues $=$ variance explained by each PC.
SVD connection: $X = U \Sigma V^\top$. Columns of $V$ are the principal components (directions). $\Sigma$ contains singular values; eigenvalues $=$ (singular values)$^2 / n$.
Explained variance ratio: Fraction of total variance captured by $k$ PCs $= \dfrac{\sum_{i=1}^{k} \lambda_i}{\sum_i \lambda_i}$.
Scree plot: Plot explained variance vs. component number. Elbow suggests how many PCs to keep.
Preprocessing: Always center (subtract mean) before PCA. Often scale (standardize) too, especially if features have different units.
Use cases: Visualization (project to 2D), noise reduction, removing multicollinearity before regression.

SVD Decomposition

$$X = U \Sigma V^\top$$

$U$: $n \times n$ orthogonal — left singular vectors
$\Sigma$: $n \times p$ diagonal with $\sigma_1 \ge \sigma_2 \ge \dots \ge 0$
$V$: $p \times p$ orthogonal — right singular vectors $=$ PCs

Variance explained by $\text{PC}_k$

$$\frac{\sigma_k^2}{\sum_i \sigma_i^2}$$

Projection onto first $k$ PCs

$$Z = X \cdot V_k \quad (n \times k \text{ matrix of scores})$$

⚡ Key insight

PCA finds the directions of maximum variance. PC1 is the direction along which the data varies the most. Projecting onto PC1 gives a 1D summary that preserves the most information (in the variance sense). PCs are always orthogonal to each other.

✎ Practice questions

You run PCA on a 100-feature dataset and find the first 3 PCs explain 85% of the variance. What does this mean, and why is it useful?
Show answer
It means 85% of the variability in the 100-dimensional data can be captured in just 3 dimensions. This is useful for: (1) visualization in 2D/3D, (2) reducing computation for downstream models, (3) removing noise (the remaining 15% may be noise), and (4) avoiding the curse of dimensionality.
Why must you center (subtract the mean) from each feature before applying PCA?
Show answer
PCA finds directions of maximum variance. If you don't center, the first PC would point toward the mean of the data (the direction of largest magnitude, not largest variance). Centering ensures PCA captures variance structure, not just the data's offset from the origin.
Two features $x_1$ and $x_2$ are perfectly correlated ($x_2 = 3 x_1$). How many non-trivial PCs will PCA find?
Show answer
Only 1. The data lies on a 1D line in 2D space. The covariance matrix has rank 1, so only one non-zero eigenvalue — PC1 points along the line $x_2 = 3 x_1$. The second PC has eigenvalue 0 (no variance).

Loss functions

$$\begin{aligned} \mathrm{MSE} &= \tfrac{1}{n}\sum_i (y_i - \hat{y}_i)^2 && \text{optimal constant} \to \text{mean} \\ \mathrm{MAE} &= \tfrac{1}{n}\sum_i \lvert y_i - \hat{y}_i\rvert && \text{optimal constant} \to \text{median} \\ \text{Log-loss} &= -\tfrac{1}{n}\sum_i \bigl[y_i \log \hat{p}_i + (1 - y_i)\log(1 - \hat{p}_i)\bigr] \end{aligned}$$

Linear models

$$\begin{aligned} \text{SLR:} &\quad \theta_1 = r \cdot \tfrac{\sigma_y}{\sigma_x}, \quad \theta_0 = \bar{y} - \theta_1 \bar{x} \\ \text{OLS:} &\quad \hat{\theta} = (X^\top X)^{-1} X^\top y \\ \text{Ridge:} &\quad \hat{\theta} = (X^\top X + n\lambda I)^{-1} X^\top y \\ \text{GD:} &\quad \theta \leftarrow \theta - \alpha \cdot \nabla L(\theta) \end{aligned}$$

Probability

$$\begin{aligned} E[aX + bY] &= aE[X] + bE[Y] && \text{(always)} \\ \mathrm{Var}(aX + b) &= a^2\,\mathrm{Var}(X) \\ \mathrm{Var}(X + Y) &= \mathrm{Var}(X) + \mathrm{Var}(Y) && \text{(if independent)} \\ \mathrm{Var}(X) &= E[X^2] - (E[X])^2 \\ \text{Bernoulli:} &\quad E = p,\; \mathrm{Var} = p(1 - p) \\ \text{Binomial:} &\quad E = np,\; \mathrm{Var} = np(1 - p) \end{aligned}$$

Bias-Variance

$$\mathrm{MSE}(\hat{\theta}) = \mathrm{Bias}(\hat{\theta})^2 + \mathrm{Var}(\hat{\theta})$$

Classification

$$\begin{aligned} \sigma(z) &= \tfrac{1}{1 + e^{-z}}, \quad \sigma(0) = 0.5 \\ \text{Precision} &= \tfrac{TP}{TP + FP} \\ \text{Recall} &= \tfrac{TP}{TP + FN} \\ F_1 &= \tfrac{2 \cdot P \cdot R}{P + R} \end{aligned}$$

PCA / SVD

$$X = U \Sigma V^\top, \qquad \text{variance explained by } \text{PC}_k = \tfrac{\sigma_k^2}{\sum_i \sigma_i^2}$$

Data 100 — Finals Study Guide

Contents

Pandas

EDA — Exploratory Data Analysis

Regular Expressions (Regex)

Visualization

Sampling

Modeling & SLR

Ordinary Least Squares (OLS)

Gradient Descent

Feature Engineering

Cross-Validation

Regularization

Random Variables

Bias & Variance

Inference & Hypothesis Testing

SQL

Logistic Regression

Clustering

PCA — Principal Component Analysis