possibly-wrong math lessons

on Some Fun 'Paradoxes':

Sand Heap Paradox/Tragedy of the Commons
Unexpected Hanging/Bottle Imp Paradoxes
"Guess the Second-Highest Number" Party Game: everyone chooses a natural number up to 100, and the winner is the one who ends up choosing the second-largest... but, assuming optimal and self-consistent gameplay, the only rational outcome is for everyone to choose 0. Instead, nobody will.
Horse Color Paradox
Euler's identity
Friendship Paradox
Curses of Dimensionality (e.g. suppose you coat the interior walls of an n-dimensional cubic room with an arbitrarily thin layer of paint... as n goes to infinity, the livable volume of the cube (that is, the room left over after you finish painting it) vanishes)
A drunk ant will eventually find his way home, but a drunk bird will eventually fly away forever (a random walk in 2 dimensions is recurrent, but in 3 dimensions is not)
The series of naturals sums to -1/12 (this is not just an artifact of arithmetic... it shows up in real-world physics!)
Monty Hall vs Two-Envelope Paradox
Simpson's Paradox
St Petersburg Paradox (this can be 'resolved' by being a median- or minimaximizer i.e. appraising the 'fair value' of a game to be its median (rsp minimum, or even any other quantile/percentile) oucome, but this isn't consistent with the von-Neumann-Morgenstern utility axioms... i like quantile-based (vs z-score-based) FV metrics because they don't assert that any moments must exist e.g. a 75th-percentile-maximizer would pay exactly $1 to play a game where they 'win' a Cauchy-distributed random amount of money, whereas a +1-STD-maximizer wouldn't even know where to begin... but quantile-based approaches have the weakness of by-definition ignoring moments like variance, skew, kurtosis, etc e.g. a median-maximizer is indifferent between (a) winning $0 surely, vs (b) winning $0 with 51% probability and $1 otherwise... on a pseudorelated note, i don't think that loss aversion/lottery preference is totally irrational, (a) once you get used to something there's a tremendous friction/switching cost associated with downgrading, (b) if your utility is sigmoid w.r.t. wealth and you're on the low flat part then it might be worth risking an approximately-flat move down if the potential payoff is a spike upward to the high flat part)
The Gaussian shows up everywhere, from statistics (Central Limit Theorem and its derivatives like asymptotic dist'n of MLE) to quantum mechanics (minimum-uncertainty wavepacket) and beyond
The evil Cauchy distribution (simulate a billion random draws... the sequence of partial means won't converge)
Coupling from the Past... not really a paradox, other than that in its simplest form (as treated by e.g. Introduction to Stochastic Processes with R), by the time you finish every step has been so obvious that the profundity of the result sorta blindsides you
Neyman-Scott Problem
Inconsistency of the MLE in Gaussian mixture models and beyond
Stein's Paradox (but note the paradox disappears if we set variance to zero).. and while we're on the subject, both Stein's Paradox and the non-recurrence of a random walk (from above) happen at the jump from 2 dimensions to 3. What's so special about that jump in statistics? Sure, it's only two examples so far. But one of you can find a third and then we've got a conspiracy theory on our hands.
Voltmeter Story
Miller JB & Sanjurjo A, 2019 Feb: Surprised by the Hot Hand Fallacy? A Truth in the Law of Small Numbers (link)... authors' footnotes 6/7 try to reconcile it... 2 random trivia points for their illustration with 3 flips (Table 1 from the paper): (a) the distribution of recorded "% heads" is terribly non-Normal, (b) half the underlined letters are 'H'... i think this one's reconciliation will have flavor of length-biased sampling and/or friendship-paradox sampling... compare also Prof Blizstein's Mixed Practice example: if we play a game where we just flip a coin repeatedly and the moment we see 'HH' I immediately win whereas the moment we see 'TH' you immediately win, then you're more likely to win... i think the intuition there was, if we see 'H' on the first flip then I'm halfway progressed to winning, but then if we see 'T' on the next flip not only do I not win but also my progress is destroyed and now you've made halfway progress to winning... whereas if we see 'T' on the first flip then you're halfway to winning, and then if we see 'H' next you've won but even if we see 'T' then you've again made halfway progress to winning
#!TODO This one is less a paradox than just something I need to reconcile for myself. Usually, we say that levering an investment by e.g. 10x simply multiplies all your (excess) returns by 10x. Obviously, this can drive you completely out of business and even lead you to lose more than you started with. But suppose the underlying returns -20% then +20%... your returns aren't -200% then +200%. There's some asymmetry/ "regime switch" the first time you dip to zero or below (in fact, even if the underlying returns -10% then +10%, your returns aren't -100% then +100%).
#!TODO Another puzzle. Consider the simple case where the asset's return and your return-predicting signal are i.i.d. bivariate Normal with some fixed (positive) correlation (the return and the signal obviously aren't i.i.d. on the same day, the random vector consisting of both of them is i.i.d. over time). To clarify the setup: every day god draws r,s from a BVN distribution; you don't get to observe r but you do get to observe s, and because they're correlated, you can get information about r from s, which you can use to size your bet. At the end of the day, you observe r, and hence realize some profit (or loss). Notice that in this case risk weight is always just some fixed scalar multiple of notional weight, since the vol is constant over time. Suppose your objective is to maximize ex-post i.e. realized Sharpe (that is, at the end of the year you will calculate the empirical mean and standard deviation of your daily PNL, and you want to maximize that... this is different from ex-ante Sharpe, for example even if your investment asset's return was constant every day i.e. your daily ex-ante Sharpe was infinite at each point in time, you could easily choose to vary your leverage according to your mood, and then you would not realize an ex-post Sharpe of infinity). Then, the optimal (or at least, a very good) strategy is to set your daily view/position equal to the value of the signal (obviously in real life you'd set a long-term volatility or leverage target, but this just scales the strategy up or down without affecting the Sharpe). UNLESS... the correlation is perfect, in which case the optimal strategy is to set your position to be the RECIPROCAL of the signal value each day... exactly the opposite recommendation! Because if you set your position equal to the signal value each day your daily PNL is basically an i.i.d. draw from a standard Chi-square distribution i.e. has Sharpe 1/sqrt(2), but if you set your position equal to the reciprocal of the signal value each day your PNL is always just +100%, +100%, +100%, and so on. Two hints that might help resolve this: First, if the correlation is zero, then no strategy exists with a Sharpe better than zero. The second hint has two parts: (a) the Normal reciprocal distribution has no moments, which means that the distribution of your daily position has no mean, variance, etc... but this seems a little irrelevant, because what you're really interested in is the distribution of your daily PNL, which is just a point mass of unity at +100%; (b) I fibbed a bit just now, because there's technically a pathological outcome where the signal value is zero, in which case your PNL is undefined because you set your position to be 1/0 then hit that with a return of +0% which gives 0/0... which seems philosophically similar to the blowup at zero of the Normal product distribution. # Update: Resolved! If we twist our minds for a moment and consider this as not a multi-period single-asset investment problem, but a single-period multi-asset one. Each "day" is an independent constant-vol asset, and we want to allocate to each "asset". Markowitz tells us the optimal strategy is to allocate risk to each proportional to its ex-ante Sharpe. HOWEVER -- although all the assets have the same vol, we have to realize that given this correlation structure, that vol is NOT simply the raw asset's marginal vol: It is the raw asset's CONDITIONAL vol GIVEN the signal. In particular, if each raw asset is unit-vol, then its CONDITIONAL vol GIVEN the signal is sqrt(1-rho**2). This is fine, because its ER is simply the signal itself, so given that every asset has the same vol, allocating proportional to ER is exactly what Markowitz says to do. BUT... when we hit rho=1, the conditional vol vanishes! Thus, each asset becomes infinite-Sharpe, and so in a weird way, the consistent way to take Markowitz's advice... is to allocate equally to them all! The one wrinkle here is that I haven't proven that the solution to the T-period 1-asset problem is the same as the solution to the 1-period T-asset problem. In the former, you actually get to observe information as time passes, hence can adjust your future strategy depending on how things are going. But that seems like an interesting but separate problem of its own (albeit one whose solution may shed even further light on this insight).

on Some Biohazards (a la Joe Blitzstein) in Simple Research, Which I Learned the Hard Way:

Confusing definitions/identities (e.g. MC = (∂/∂Q)TC) with optimality conditions (e.g. P = MR = MC, which is true exactly at PCLRE)
Not enforcing strict train/tune/test dataset splits
Setting distractingly unrealistic simulation conditions (e.g. 10 Sharpe LOL)
Not seeding your RNG
Not saving raw intermediate results (e.g. simulated values, model parameters, training error)

On the base-1 representation of natural numbers:

We often define "a sequence of `1`s whose length is `n`" as the base-1 representation of `n`. This is a bit inconsistent, as for every other base B, the allowable digits are `0,...,B-1`. In the base-1 case, this should mean that the only allowable digit is `0`. Furthermore, in every other base B, the length (up to the highest nonzero digit) of the sequence required to represent `n` is of the order `log_B_n = log(n) / log(B)`. In the base-1 case, this should mean that the length of the sequence required to represent `n` is of the order `log(n) / log(1)`... but `log(1)=0`, so the length of every "number" should be infinite! And indeed, it is. In "true" base-1, you are allowed only the digit `0`, and hence can never represent any other number. So, we "cheat" and say: Well, in the base-1 number system, the `b`th digit is worth `1^b = 1`: That is, every digit is worth unity. So, we might as well just fill up the sequence with ones, up until we have `n` ones. And it's most convenient to group them together (e.g. 2 in base 10 is more conveniently represented as 11 in base 1 rather than 101 or 1001 or 110), so we should disallow "separated" ones. In fact, why not do away with the digit `0` altogether, and just let the empty sequence represent 0? Hence our usual definition of the base-1 representation.

on Volume vs Surface Area in Calculus:

I never really understood why we could use cylinders to approximate the volume of a solid of revolution, but had to use frustums to approximate the surface area. Watching VCubingX's video on Gabriel's Horn, I finally decided to look into it. Turns out, it's because volumes are well-behaved under small deformations of n-dimensional regions in d-dimensional space, but surface areas are not. (Note that in 2D, volume is enclosed area, and surface area is perimeter.) What that means wasn't clear to me until somebody gave the following example: Consider a 2D right triangle with legs 1 and hypotenuse sqrt(2), and approximate it with successively granular square pixels. The area covered by the pixels will be essentially indistinguishable from the area covered by the triangle once you get fine enough, but the perimeter of the region covered by pixels will always be 4: exactly 1 along the legs, but always 2 along the "staircase" or "taxicab path" approximating the hypotenuse. Fractals are another good example, mentioned in the video but not explicitly treated in the foregoing way.

on Fourier Decompositions:

The Fourier decomposition is possible because sines and cosines (or, more elegantly, complex exponentials) form an orthonormal basis of the space of periodic functions: That is: The inner product of each element with itself is unity (they are "normal", as in "normalized"); The inner product of each element with a different element is zero (they are "ortho", as in "orthogonal", generalizing the definition of "orthogonal" from perpendicular vectors in Real space); and the members together are "complete" AKA "span the space" (they are a "basis"). The Fourier decomposition is useful for a variety of reasons, none of which I am qualified to discuss.

I've seen many cute visualizations of Fourier series as "wrapping a function around a circle", or "tracing out a function using a marker on a bicycle wheel", but that doesn't really help me understand why it works. Indeed that's one of my recurring gripes with many "cute" mathematical proofs (as used here, "cute" is a technical term): They're refreshing and delightful, but ultimately almost by-definition useless for gaining intuition. A proof is "cute" exactly when it combines a couple well-known results as building blocks to quickly and elegantly prove something new, which means that by definition it is unintuitive: Everybody knows the building blocks, so if the proof were intuitive, they wouldn't need to see it. They'd just immediately accept the final result as true. Maybe I'm just whining here.

Clearly, Fourier decompositions are related to Taylor decompositions (more commonly known as series expansions) but the intuition behind the coefficient formulas are not the same. (By the way, I still have to think about Taylor coefficients. Consider x^4 - x^2.. how can the derivatives at x=0, where this looks exactly like -x^2, predict that the curve is going to suddenly reverse course as you go to x = +/-infinity, and shoot up toward y=infinity? These links might help: one (archive), two (archive).)

In any case, here's a way I learned to think of Fourier decompositions that I think is truly intuitive. For simplicity, I will use the Real cos+sin series form, even though it's clearly less elegant than the complex exponential series form. I will also assume that the original function is periodic with period 2pi and consider the interval [-pi, pi]. Finally, I will use the "normalized" form: For example, instead of "cos(nx)" terms, I will use "cos(nx)/sqrt(pi)" terms. This is because the latter are normal, so even though they themselves are a bit uglier than the former, they make the formula for the Fourier coefficients prettier and more intuitive. In the usual form, the Fourier coefficients have to do double-work: They have to decide the strength of the agreement between each Fourier function and the original function, and then they also have to normalize that measurement. (Convince yourself that cos(nx)/sqrt(pi) is normalized over [-pi, pi]: To solve the integral, do a u-sub then use the double-angle trig identity.)

Now, let's ignore the constant term (AKA cos(0) term) and sine terms (if it makes you feel better, assume that f is levelled to have average value zero so that the constant term drops out, and further that f is an even function so that all the sine terms drop out) and just consider trying to write f(x) as an infinite sum of the form "a_1 cos(x)/sqrt(pi) + a_2 cos(2x)/sqrt(pi) + a_3 cos(3x)/sqrt(pi) + ..". What we need to do is figure out how much of each cosine term we need: a_1 tells how much cos(x)/sqrt(pi) we need, a_2 tells how much cos(2x)/sqrt(pi) we need, etc.

That's exactly where the formula comes from: a_n is the definite integral of cos(nx)/sqrt(pi) multiplied by f(x). This formula "zips them up". Multiplying the functions by each other then taking the integral tells us how much they agree with each other: If they overlap a lot (they have peaks in the same places), the definite integral will be big, but if they're "on the wrong phase" (one tends to peak when the other is very close to zero, and the other tends to peak when the one is very close to zero) they're gonna cancel out ("destructively interfere!") and the coefficient will be very small. Notice that if f(x) is cos(x)/sqrt(pi) then a_1 will be unity, and every other coefficient will be zero (orthogonality of the basis!).. we recover exactly the same function! Because in this case f(x) exactly agrees with cos(x)/sqrt(pi). If f(x) is 2cos(x)/sqrt(pi), then a_1 will be 2, and again every other coefficient will be zero.. in this case, the function we're analyzing "doubly agrees" with cos(x)/sqrt(pi), that's how strong the agreement is. And if we try to analyze f(x) = -cos(x)/sqrt(pi), a_1 will be -1 and every other coefficient will be zero, since this function "negatively agrees" with cos(x)/sqrt(pi).

In fact, stat majors will recognize that this is exactly the same idea as estimating the correlation (really, the covariance) between two (long-term zero-mean) stochastic processes! In this setup, each stochastic process is run from time t=0 to time T=2pi. For example, the first cosine basis function can be seen as the realization of an Ornstein–Uhlenbeck process. (There is also a calculation called "correlation" (a cousin of "convolution") in signal processing, but that's not directly related to this. You can use correlation and convolution to to find the PDF of sums of random variables.)

This can almost certainly also be framed in terms of cosine-similarity between the target function and each basis function in Hilbert space, but I need to work out the math.

on What to Expect:

Suppose a random variable "X" crystallizes to some value "x" that is not it's expected value "mu". This might not be unexpected. For example, suppose the random variable is the roll of a fair die. Then "mu = 3.5", which is of course impossible, and usually you won't even be close. On the other hand, suppose the random variable is the number of faces on a randomly-selected fair die. Then "mu = 6" is the only possibility. For this reason, it's often more relevant to consider interquartile range or some other "reasonable" predictive interval to figure out what to expect, or p-values to figure out what NOT to expect. Loosely related is the phenonemon from economics that maximizing the expectation of a concative utility function "U(W)" (where the maximization is done over a choice of probability distribution for W, which formally represents some "investment option" or "gamble") isn't always the same as maximizing the expectation of the wealth itself, which is a consequence of Jensen's inequality.

on the Variance of the Product of Random Variables:

Suppose $X$, $Y$ are random variables (whose means, variances, and covariances exist). Then, \[ \Var(XY) = (\Var(X) + \E^2(X))(\Var(Y) + \E^2(Y)) + \Cov(X^2, Y^2) - (\Cov(X, Y) + \E(X)\E(Y))^2 .\] Pf: \[ \Var(XY) := \E((XY - \E(XY))^2) = \E((XY)^2) - \E^2(XY) = \E(X^2Y^2) - \E^2(XY) \] \[ = \E(X^2Y^2) - (\E(XY))^2 = \E(X^2Y^2) - (\E(XY) - \E(X)\E(Y) + \E(X)\E(Y))^2 = \E(X^2Y^2) - (\Cov(X, Y) + \E(X)\E(Y))^2 \] \[ = \E(X^2Y^2) - \E(X^2)\E(Y^2) + \E(X^2)\E(Y^2) - (\Cov(X, Y) + \E(X)\E(Y))^2 \] \[ = \E(X^2)\E(Y^2) + \E(X^2Y^2) - \E(X^2)\E(Y^2) - (\Cov(X, Y) + \E(X)\E(Y))^2 \] \[ = \E(X^2)\E(Y^2) + \Cov(X^2, Y^2) - (\Cov(X, Y) + \E(X)\E(Y))^2 \] \[ = (\E(X^2) - \E^2(X) + \E^2(X))(\E(Y^2) - \E^2(Y) + \E^2(Y)) + \Cov(X^2, Y^2) - (\Cov(X, Y) + \E(X)\E(Y))^2 \] \[ = (\Var(X) + \E^2(X))(\Var(Y) + \E^2(Y)) + \Cov(X^2, Y^2) - (\Cov(X, Y) + \E(X)\E(Y))^2 ,\] QED.

on Expectations of Products of Squares of Weakly-Correlated Bivariate-Normal Random Variables:

Suppose $X,Y$ are jointly bivariate-Normal with $\rho \approx 0$. Then, $\E(X^2Y^2) - \E^2(XY) \approx \E(X^2)\E(Y^2)$. Pf: We use a property of the BVN distribution (a special case of a more general result from Gaussian linear regression) to write \[ Y = cX + \varepsilon ,\] where $c = \frac{\sigma_Y}{\sigma_X}\rho =: k\rho$ is a constant and $\varepsilon \perp X$ is an independent (of $X$) zero-mean noise term. Then, \[ \E(X^2Y^2) = \E(X^2(cX+\varepsilon)^2) = \E(X^2(c^2X^2 + 2cX\varepsilon + \varepsilon^2)) = \E(c^2X^4 + 2cX^3\varepsilon + X^2\varepsilon^2) \] whence by Linearity of Expectation and independence of $\varepsilon \perp X$, \[ = c^2\E(X^4) + \E(X^2)\E(\varepsilon^2) = (k\rho)^2\E(X^4) + \E(X^2)\E(\varepsilon^2) \approx 0 + \E(X^2)\E(\varepsilon^2) = \E(X^2)\E(\varepsilon^2) .\] Furthermore, \[ \E^2(XY) = \E^2(X(cX+\varepsilon)) = \E^2(cX^2 + X\varepsilon) = c^2\E^2(X^2) = (k\rho)^2\E^2(X^2) \approx 0 .\] And lastly, \[ \E(X^2)\E(Y^2) = \E(X^2)\E((cX+\varepsilon)^2) = \E(X^2)\E(c^2X^2 + 2cX\varepsilon + \varepsilon^2) \] \[ = \E(X^2)(c^2\E(X^2) + \E(\varepsilon^2)) = c^2\E^2(X^2) + \E(X^2)\E(\varepsilon^2) = (k\rho)^2\E^2(X^2) + \E(X^2)\E(\varepsilon^2) \] \[ \approx 0 + \E(X^2)\E(\varepsilon^2) = \E(X^2)\E(\varepsilon^2) .\] Immediately, QED.

P.S. I've often wanted to show that in general $|\rho| \ll 1 \implies \Cov(X^2, Y^2) \approx \Cov^2(X, Y)$, but it's not easy to determine how well this holds. For example, consider two random variables $X,Y$ where in one state of the world $X=0$ and $Y \sim \textrm{Unif}(0, 1)$ (with probability $9/10$ for this state), and otherwise $X=2=Y$ (with of course probability $1/10$ for this state). Then, the correlation between $X^2,Y^2$ is actually much stronger than the correlation between $X,Y$.

on a Weird Correlation Regime:

At least, it's weird to me. Consider a stream of iid standard Normals, $(Z_t)$ for $t$ from $0$ to $\infty$. Then consider a stream of random variables $(X_t)$ for $t$ from $1$ to $\infty$, defined as $X_t := Z_{t-1} + Z_t$. Then, obviously $Z_t$ is uncorrelated with $Z_{t+1}$. But, $X_t$ is positively correlated with $X_{t+1}$, since both include $Z_t$... and $X_{t+1}$ is positively correlated with $X_{t+2}$... yet $X_t$ is (unconditionally, i.e. marginally) uncorrelated with $X_{t+2}$! Because $X_t$ is the sum of $Z_{t-1}$ and $Z_{t}$, and $X_{t+2}$ is the sum of $Z_{t+1}$ and $Z_{t+2}$. (I think if you condition on $X_{t+1}$, then $X_t$ and $X_{t+2}$ are negatively correlated, because e.g. if you know $X_{t+1}$ is very large then you can guess that $X_{t+2}$ might also be very large because they share the $Z_{t+1}$ component, but then if you also know that $X_t$ was very large, then you can guess that $X_{t+1}$'s largeness was probably driven by $X_t$'s shared $Z_t$ component, and the $Z_{t+1}$ was probably just average.) /* EDIT: why was this so mind-blowing to me? it's just saying that if X=A+B and Y=B+C and Z=C+D then X is correlated with Y and Y is correlated with Z but X is not correlated with Z... */

Also, for completeness, observe that the conditional distribution $X_{t+1} | X_t$ is, if my intuition serves me correctly, Normal with mean $X_t / 2$ and variance unity.

on Estimation:

This is simple but profound. The "usual" finite-sample standard deviation estimate is based the Bessel-corrected finite-sample variance estimate, which is unbiased; it has "n-1" in the denominator. Thanks to Jensen's inequality, this standard deviation estimate is NOT unbiased for the standard deviation (its square, i.e. the variance estimate, is unbiased for the variance). In particular, the standard deviation estimate is downward-biased. The MLE (of both the std and the var, by MLE's parametrization-equivariance property) actually has "n" in the denominator, we just don't use it (possibly because we can't use our nice beautiful Student-t distribution as the reference under the null in a hypothesis test if we use the MLE standard deviation estimate). And the MMSE estimate (of the variance) has "n+1" in the denominator (or is it "n+2"? I can't remember). The point is, an "estimate" is nothing but a "guess"... and an "estimator" is some principled way of arriving at the guess. Technically, "3" is an estimator: no matter what your estimand is, or what data you observe, you just close your eyes and guess "3".

on Conjugate Priors (copy-pasted email from my professor):

"Thinking about conjugate priors. Isn't it kind of arbitrary? For instance, I could define a 'Sparsh distribution family' that contains every nonnegative function that integrates to unity.. Now suddenly the Sparsh distribution family is a conjugate prior for every parameter? (E.g. If I want to estimate the number of degrees of freedom for a t-distribution, no matter how many data points I observe, each successive posterior distribution will be in the Sparsh distribution family.)"

"The Sparsh family proves that conjugate priors are not unique. Instead of Universality of the Uniform we can have Universality of the Sparsh family. :) So that's a good point. Still, there is usually a natural choice of conjugate prior, and there is a nice intuition if you think of 'mimicking the likelihood' (take the likelihood and change data to hyperparameters, obtaining something that could have been a likelihood function with prior data, and use that to give a prior density)."

on OLS Assumptions, and "Spurious Regression":

In math, I think it's very important to distinguish between and treat different as appropriate (1) hard assumptions and (2) things that make some mathematical object well-behaved/convenient, but aren't actually assumptions. OLS (the entire apparatus, where "(X-transpose X)-inverse X-tranpose y" is consistent/MLE/BLUE and the usual SEs can be used for Wald/t hypothesis testing, etc) has a very simple set of assumptions, encapsulated cleanly as follows: y_i | x_i ~indep N(x_i'B, \sigma^2). (Prof Rader used to call this "conditional independence, Normality, linearity, homoskedasticity". Notice I'm assuming the intercept/constant/ones is subsumed into the design matrix.) I hope you don't think that "y, x are jointly MVN" is an assumption, because if so then you are misunderstanding the entire premise of frequentist inference: the design matrix (X) is a given, we treat it as a constant and therefore can't even attempt to model its (nonexistent) randomness, forget posit that it has any particular distribution. And even if being philosophically wrong doesn't bother you and so you insist on modeling X as random (which is a legitimate approach in ML/generative models), consider that X can't generally even be modelled as drawn from some MVN: x's are often chosen to be evenly spaced along some interval (e.g. "drug dosage"), a discrete count (e.g. "number of children"), or some binary indicator (e.g. "yes/no was a recession year"). And finally, even if the x's are not directly chosen by the investigator, you can't just assume that the x's you observe are representative of the marginal distribution of x's: even if y, x are jointly MVN with mean zero, you could observe points from only the upper-right quadrant (i'm imagining plotting the joint BVN distribution on the Cartesian plane), and yet your estimation is still consistent. Recall that on the other hand, if the x_i DO have a distribution, then what "region" of that distribution you observe values from will affect your inference, because the regression line confidence band will be narrowest around the observed average (x-bar, y-bar) and widen from there. On the other hand, do notice that "no (perfect or imperfect) multicollinearity" is also not an assumption: in the first case you'll run into identification problems (some people call this problem "you can't invert the X'X matrix/the X'X matrix is singular" but I call it "there's no unique MLE"), and in the second case you'll have huge SEs (uncertainty... SE is just the vol of your coefficient point estimate, or in practice, a point estimate of the vol of your coefficient point estimate); but neither situation violates an assumption of the estimation procedure.

P.S. What about the classic case of spurious regression between two random walks? The key here is to observe that the random walk's local "trends" aren't deterministically related to any "time" or related variable, so you can't recover conditional independence across observations by including anything on the RHS. No matter how many variables you condition on (except the random walk itself, or the random walk lagged by one timestep), you'll never have that y_{t+1} is independent of y_t. This is where the breakdown arises.

on A Weird Way of Thinking about Regression:

This is something I'm playing with, that builds on the finance stuff I discuss later. (1) With multicollinearity, one diagnosable 'problem' is that your coefficient estimates become terribly unstable... basically, you start overfitting to the coincidental idiosyncratic "noise" between your highly correlated predictors... so, in a way, you just become extremely unstable/sensitive to the "spreads" between them, which is exactly the same as the problem that MVO encounters when you feed in a Cov with high entries (tightly-correlated assets)... it just starts WAY overfitting to the spread bets (and in particular, because the level of spread between two tightly correlated similar-vol assets is going to be very small relative to that vol, MVO will blow the magnitude of the bets on either side long/short way the fuck out to the point that it dominates whatever "original" bet it was making on their first principal component), which is why in practice we often shrink Covs toward zero. In fact, it's the same underlying problem: it's the inversion of the Covariance matrix (reminder: the X'X term from OLS coefficients!) that becomes extremely unstable/sensitive. (2) In finance, you "take risk proportional to Sharpe", and a Sharpe ratio is just a t-stat (scaled by some annualization factor). So, in prediction problems, maybe the way to justify dropping insignificant (i.e. low t-stat) coefficients is to say that you're "taking belief/credibility bets proportional to regressor Sharpe". Like, the other coefficients are just noising up your prediction (they're like 'low-bias, high-variance' estimates, in the language of the BV tradeoff... although notice that the other coefficients aren't necessarily biased either, in the most well-behaved cases, OLS is totally unbiased for every coefficient, and only the standard error of each coefficient varies). Right? Kinda?

on The Prosecutor's Fallacy in the Context of Frequentist Inference:

(Inspired by Joe Blitzstein's Introduction to Probability (2015), Biohazard 2.8.1 (p66). As Prof Blitz remarked in class, examples in statistics tend to be quite morbid for some reason. Sorry.) Consider the Sally Clark trial of 1998. In simplified terms, the lawyers arguing that case committed the Prosecutor's Fallacy: they confused a tiny value of Pr(infant dies | mother innocent) as implying a tiny value of Pr(mother innocent | infant dies). (I recommend reading the textbook or Google to learn more about why this is a terrible error in reasoning.)

First of all, I note that this is a good, natural application of Bayesian hypothesis-testing: we're comfortable interpreting the "probability" that the mother is innocent as the strength of our belief that she is innocent, even though in reality the infant has already died and at this point the mother either certainly committed the crime or certainly didn't (we just don't know which).

However, we can also apply frequentist ideas here: we can conduct a hypothesis test of the alternative (H1, that the mother is guilty, i.e. that she murdered the infant) against the null (H0, that she is innocent). A natural "test statistic" is the binary outcome ('yes'/'no') of whether the infant dies, in this case 'yes'. The p-value is Pr('yes' | H0) = Pr(infant dies | mother innocent) = Pr(SIDS) (since the only way for the infant to die if the mother is innocent is SIDS), which we already know is tiny, so tiny that it would be smaller than any reasonable Type I error level. Therefore, we reject the null, and conclude that we have statistically significant evidence that the mother is guilty. To me, there seems to be a paradox here: we apparently followed sound frequentist principles, but our conclusion is badly wrong.

One might choose to 'resolve' this paradox by arguing that in a court of law, the Type I error level should be chosen to be even tinier than tiny, almost zero in fact, according to Blackstone's principle that "it is better that ten guilty persons escape than that one innocent suffer". But this doesn't actually resolve anything: pretend that the 'punishment' for a guilty verdict is mild, maybe a $100 fine, then we're back to tolerating a small but reasonable Type I error level, say 1%. (For simplicity, ignore the issue that achieving exactly 1% isn't always possible with a discrete test statistic.) In theory, then, if we were presented with 100 random mothers who we knew to be innocent, we'd desire and expect our test to accidentally flag at most only about one of them as guilty. This is at odds with our intuition from the discussion of the Prosecutor's Fallacy: there, we saw that in most cases like Sally Clark's, the mother is actually innocent; but this test always flags them as guilty, because the p-value is so tiny. Apparently, then, this procedure's actual Type I error rate is literally 100%.

What's going on here is that we've misspecified the reference distribution of the test statistic under the null, because we failed to condition on relevant information. What it means for our procedure to have a Type I error level of 1% is that, if we fed in 100 random mothers who we knew to be innocent, the test would accidentally flag only about one of them as guilty. But this is satisfied: if we went out into the general populace and collected 100 random mothers who we knew to be innocent, almost none of them would have infant who had died at all, and therefore our test would set them free. But we're not interested in the general populace: we're interested in a very specific subpopulation, which is mothers whose infants have died, because those mothers are the ones who will face trial in a court of law. So, in considering cases like Sally Clark's, we were actually feeding into our procedure not 100 random mothers who we knew to be innocent, but 100 random mothers who we knew to be innocent and also whose infants had died. (For clarity, I'm restricting my attention to only mothers who we know to be innocent; of course, in real life, we couldn't know this, otherwise we wouldn't need to put them on trial in the first place.)

So the right thing to do is to compare the alternative H1, that the mother is guilty and the infant died, against the null H0, that the mother is innocent and the infant died. Then, the p-value is Pr('yes' | H0) = Pr(infant dies | mother innocent, infant dies), which is of course trivially 100%. So, the p-value here will always be larger than any reasonable Type I error level, and this test is powerless to detect any deviations from the null. What we'd have to do in this case, if we insisted on using frequentist inference, is find a different test statistic, that could actually give us a nontrivial test.

on Operators in Physics & Higher-Order Functions in CS:

Not really a lesson, more of just an observation. Physicists, especially in classical & quantum mechanics, often (deliberately) conflate an operator with the underlying physical quantity it tries to measure. For example, the momentum operator "p", which has units of "mass × length / time", is "-iℏ∇". The energy operator "E" (assume zero potential, I'm just considering at translational kinetic energy), which has units of "mass × length² / time²", is "(1/2m)(-ℏ²)∇²", often written as simply "(1/2m)p²". At first glance this might make sense: It's just a relationship between kinetic energy and momentum. But if you take a closer look, even just "p" itself isn't a number, or even a formula: It's an operator, an expression that's arguably meaningless without an argument. If you apply the "p" operator to a wavefunction, you get the corresponding quantum state's momentum, but without a wavefunction, "p" is sort of just a rule for calculating momentum. But this turns out to be a very elegant way of writing the relationship: It turns out that applying the "p" operator to a wavefunction returns another function (just as applying the "squared" operator to a function returns another function: for instance, passing the function "f", defined as "f := x -> 3x" or in more familiar notation "f(x) := 3x", into the "squared" operator yields "(squared·f)(x) = 9x²"), and then evaluating that function at a particular point gives momentum (at that point). More importantly, applying the "p" operator to that function then dividing by 2m gives yet another function, which can be evaluated at a particular point to give energy (at that point). In fact, thinking this way is very natural to functional programmers, as I alluded in the title. Clearly deriving the functional forms of each operator and proving the relationships between them is very difficult; but, at least, the idea that you can state relationships between meaningless expressions that have no value yet, is an idea that (effective) computer scientists use all the time. If I'd understood it this way in school, I might have struggled much less in my early physics classes.

on the Evolution of Quantum Wavefunctions:

When I first learned how to calculate the time-evolution of wavefunctions, I naively expected that all wavefunctions should eventually decay into Gaussians, because of a CLT-like argument. It helps that even glossing over the probabilistic interpretation of wavefunctions (hand-wavily, a wavefunction encodes a probability amplitude, so that its squared modulus is a probability density), Gaussian wavefunctions have nice properties (minimum-uncertainty wavefunction, so that the Heisenberg Uncertainty Inequality binds as an Equality).

But this is obviously silly. The evolution of a wavefunction is not analagous to repeated observation of where it collapses. My naive initial reasoning might as well say that every stochastic process eventually decays into a Gaussian---clearly untrue, just take the most basic example of a well-behaved Markov chain with a Uniform stationary distribution.

on Some Random Finance/Investing Shit:

Some rudimentary risk-management concepts: Risk contributions (an asset's "total" risk contribution can be written as its portfolio weight times its "marginal" risk contribution, which can itself be written in terms of derivative of portfolio vol w.r.t. the asset's weight (as here), or in terms of covariance between the asset and the portfolio (as here)), Idiosyncratic volatility, "Compensated" risk vs "uncompensated" risk

Some rudimentary portfolio opt concepts: MVO + the geometric interpretation (PCA, think of simple 2-asset cases), Spread bets, Black-Litterman

on an Intuitive Interpretation of IQF-11.5.1(b):

(Refers to Stephen Blyth's An Introduction to Quantitative Finance (2014), Exercise 11.5.1(b).) We can understand this result as expressing the limit of our graphical illustration of how to approximate an arbitrary payoff using only call options. As the increment between successive strikes becomes more granular (vanishing in the limit), we actually converge to a replicating portfolio. The constant in front simply tares our level; but the real insight is that the integral is a sum over call options with different strikes, and the second derivative factor in the integrand (the instantaneous change in slope of the payout) tells us how many calls we must buy/sell at the current strike to adjust our trajectory up or down.

on IQF-2.10.7 and -9.6.4 (Resolving Stat123's Stat110 FX forward brainteaser, Parts I and II):

Ia/b: By solving for the respective $\mu$'s that satisfy the constraint that the rsp FX forward rates are the rsp expected future FX spot rates, you are implicitly calculating rsp risk-neutral/no-arb probability distributions (subject to following the basic Lognormal form). By the way, the reason you have to e.g. discount the dollar-per-euro FX forward rate by the euro interest rate is same reason you have to discount a dollar-per-barrel commodity forward price by its convenience yield (or, as the case may be, "markup" the forward price by the carry cost).
Ic: The condition is that $\sigma = 0$ i.e. there is no diffusion/indeterminacy over time, and the practical implication is that it will never be true that both FX forward rates are equal to the corresponding expected future FX spot rates. Or is it??? Really, the practical implication is that it must always be true that both FX forward rates are equal to the corresponding expected future FX spot rates, BUT you have to use the rsp appropriate no-arb measure. This is a bit definitional/circular: you have to make sure that the FX forward rate (which you can calculate directly from the formula using no-arb arguments alone, without even knowing the concept of a risk-neutral measure) is the same as the expected future FX spot rate, and that's how you DEFINE the no-arb measure, which you can then use for other calculations on other assets. So really, verifying that the FX forward rate is equal to the expected future FX spot rate is more like a way of sanity-checking that you calculated the no-arb measure properly. But there are often many different ways to calculate the no-arb measure w.r.t. a given numeraire, and the fact that all the other sanity checks tie out no matter which way you choose can feel like a magical feat of self-consistency. #!TODO: Applying this idea to IIb/c below is straightforward, but is there really nothing we can say in e.g. IIc about $E_{p^*}(Y_1)$, the expected future pound-per-dollar FX spot rate under the risk-neutral measure w.r.t. the $ZCB numeraire?
Id: #!TODO: Why is this, after all?
IIb: I find it easier to think about calculating the risk-neutral/no-arb probabilities using FTAP with the $ZCB numeraire, but remember that in this case "a pound" at time t isn't still "a pound" at time T... it is "bigger" by the risk-free interest rate.
IIb/c: The "expectations" here are being set equal to the corresponding FX forward rates.
IId: The point here is that even though the no-arb measure was mechanically different based on which numeraire you chose, the actual current spot price of an asset calculated under the first measure is the same as the spot price of that asset (or in this case, a different asset with identical payout profile) calculated under the second measure.

on Incorporating the Cost of Carry into the Forward Price:

Consider an asset with current spot price $X_t$, let the cost of carry be $c$ (where $c$ could be a negative number, as if the cost of carry is a storage cost; or a positive number, as if the cost of carry is some convenience yield... or some number netting out the effect of both, as if the asset is gold bullion and there is some cash outflow associated with storing a large shipment for a long time but also some cash inflow associated with being able to rent out your physically owned gold to rappers for use in their music videos), and let the (fixed) interest rate be $r$. Then, one way to express the no-arbitrage forward price for a contract with settlement/delivery date $T$ (struck at time $t$) is $F_{T, t} = X_t \exp{(r+c)(T-t)}$.

This often confuses people in the case where the asset is a bond, because in that case $c$ is used to represent the coupon payments of the bond (another way to write this is $F_{T, t} = (S_t - PV_t(C)) \exp{r(T-t)}$), where $PV_t(C)$ represents the present value at time $t$ of the coupon payment stream between now and the settlement date). Because they feel like "well, isn't it unfair that the seller/short side of the forward contract has to lose the value of the coupons and therefore ends up effectively on the hook for coupon payments?". But this isn't true. First of all, consider that if $F_{T, t}$ didn't deduct the value of the coupons, there's an immediate arbitrage beginning with an empty portfolio: at time $t$, go short one forward contract and borrow $X_t$ of cash to go long one physical bond; at time $T$, deliver the bond and receive $X_t \exp{r(T-t)}$ of cash, which you use to repay the loan; which seems fair because you still end up empty, until you realize that you've actually been pocketing the coupons between $t$ and $T$.

And that's the key here: the asset you deliver at time $T$ isn't actually the same as the asset you bought at time $t$... it's as if you promised to sell a truck full of cash, but between the date of the sale and the date of delivery you took out some of the cash. Really, the bond can be thought of as two separate assets (ignore the effect of default risk here): one stream of guaranteed cashflows from $T$ onward, and a separate stream of guaranteed cashflows from $t$ to $T$. The contract is struck on the former, and so it doesn't really make sense to consider the latter in the terms of the sale... but that means you have to discount the value of the latter from the base of the sale price, because today's price of that bond in the spot market implicitly bundles up both streams. To wit, even if the seller of the forward didn't actually own a physical bond yet and instead just chose to buy in the spot market at $T$ to cover the short, he would be buying a version of the bond whose price no longer included the effects of the coupons before $T$ (because those were already settled and gone), and so it's only fair to set the forward price accordingly.

TL;DR: You can't just blindly base the forward price on the current spot price, because the current spot market prices in any cashflows received between now and the settlement date, which the buyer of the forward neither expects nor is entitled to receive, and therefore which the seller of the forward neither expects nor is required to pay.

on How Futures Returns Are Already Excess of ZCB Returns:

How Futures Returns Are Already Excess of ZCB Returns

on Yield-Curve Rolldown (Bond/Fixed-Income Rolldown Return):

The 'term premium' describes the observation that the yield to maturity is higher for long-dated loans than short-dated. But, what is rolldown yield? Well, it's the 'unfair' or 'extra' yield you get as the loan you made 'rolls down' the yield curve toward maturity (i.e. zero-dated). For example, suppose YTM(3yr) = 1.75%, YTM(2yr) = 1.5%, YTM(1yr) = 1%. Then, a 3-year zero-coupon bond has price $0.949, a 2-year ZCB has price $0.971, and a 1-year ZCB has price $0.990. Suppose the yield curve stays fixed, so that the term structure doesn't change. If you hold each of these bonds to their respective maturities, the annualized rate you earn is exactly what's advertised. But let's look at how it breaks down: With the 1-year ZCB, your bond appreciates in price to par, and you earn 1% this year. With the 2-yr, your bond MUST appreciate in price to the 1-yr ZCB price this year, since at the end of this year it becomes effectively a 1-yr ZCB... so you earn ($0.990 - $0.971) / $0.971 = 1.957% this year! And then 1% the next year as your bond appreciates to par. And with the 4-yr, analogously, your bond earns ($0.971 - $0.949) / $0.949 = 2.318% this year! And then 1.957% the next, and finally 1% in the third year. So, if you continuously rebalance your portfolio to long-dated debt, you can pick up (ceteris paribus) well over 2% a year, even though the 'term structure' might suggest that the max you can earn is 1.75%/yr! One argument for why this is 'fair' (IRL, not in our simple fixed-YC example) is that longer-dated bonds bear more default/duration/inflation risk/uncertainty. In math (P is price at time 't', fixing T, the maturity):

P[t,T] = exp[-r[T-t] * (T-t)], e.g. r[t=0, T=3] = 1.75%/yr (obviously this needs to be converted to the logarithmic/continuous rate)

(dP/dt)/P = d/dt[r[T-t] * (T-t)] = r[T-t] - (T-t)*(d/dt)[r[T-t]]

In the final expression above, r[T-t] is positive (it's the current YTM/'term premium'), but (d/dt)[r[T-t]] is NEGATIVE... since it represents the change in the term premium of this bond, which is dropping as time progresses and this bond becomes a shorter- and shorter-dated security (2% for the 2y, 1% for the 1y). So the -Negative part at the end is the rolldown. Notice that the rolldown part IS sensitive to the duration of your bond: in this case, for a ZCB, the T-t is its duration. I can sort of see why it's called "rolldown": in our example above, at the far end of the curve (3y->2y) where it was kinda flat, we picked up only extra (2.318%-1.75%)/3=189bps of duration-adjusted return, whereas at the nearer end (2y->1y) where it was steeper, we picked up extra (1.957%-1.5%)/2=229bps of duration-adjusted return. (P.S. One extension to this analysis is to consider some suitably well-behaved stochastic yield-curve model.)

Note: d/dt[dP/dt / P] = d/dt [ r[T-t] - (T-t)*(d/dt)[r[T-t]] ] = d/dt[r[T-t]] - d/dt[ (T-t)*(d/dt)[r[T-t]] ] = 2r° - (T-t)r°° (Newton's notation)

Upshot: in our example and in practice, both the first and second (convexity) derivative of r w.r.t. little-t tend to be negative (remember, we fixed the yield curve and maturity T, so r[] is a function of (T-t), which gets SMALLER as little-t gets BIGGER... i say this to clarify in case you're looking at the traditional picture of the YC which shows r getting larger as big-T gets bigger). So, if I did my arithmetic right, there's a contention between whether your instantaneous return will rise or fall at any given time, depending on (1) how steep and (2) how concave the YC/term structure is (that is, how fast its steepness is changing). To see a concrete example, consider adding a 4th point (4y) to the YC above, but say that the term structure is just almost flat there, so the YTM is 1.76%. Then, your return on a 4yr ZCB this year is 1.790% (it appreciates from $0.933 to the price of a 3y ZCB $0.949... be careful with rounding errors!), but clearly we saw that the next year it will be ~2.318%. So yeah, the term premium was lower at 3y than 4y (steepness), but this effect was overwhelmed by how concave the YC was as a whole (how quickly it got steeper and steeper).

Be careful, I got confused here! At first I thought this was an effect of "concavity at 3y" vs "concavity at 4y", and therefore the YC must have been more concave at 3y than 4y for this example to work... but that's not true! "concavity at 3y vs 4y" is a question answered by the THIRD derivative, not the 2nd. The YC could have constant concavity everywhere (as if, for example, it was described by the left half of an upside-down parabola), and this would STILL work! I think my confusion stemmed from the fact that i was trying to juggle my discrete-time example above with the intrinsically continuous-time AKA "instantaneous" concept of differentiation. The key is just that yes, the slope is negative at 4y (TP lower at 3y), so that hurts your return going from 3y->2y vs 4y->3y... but the slope is even MORE negative at 3y (TP even more dramatically lower at 2y), so that helps your return going from 3y->2y vs 4y->3y (because of rolldown). And in this example, it helped so much that our return 3y->2y ended up being actually higher than our return 4y->3y, despite the lower term premium at 3y. I wanted to come up with a more well-behaved, arbitrarily differentiable (even if perhaps also less visually intuitive) concrete example showing this, so I thought of the following model: ZCB price at time t with maturity T is B[t,T] := exp(-y[T-t](T-t)), where YTM/term premium is y[s] := -1/1600(s-4)^2 + 1% (for s from 0 to 4), so that the second derivative (rate of change of slope, or convexity) w.r.t. little-t is constant at -1/8. If I did my arithmetic correctly, (I used WolframAlpha with the query 'exp(-(-1/1600(4-(4-(t+1)))^2 + 0.01)*(4-(t+1))) / exp(-(-1/1600(4-(4-t))^2 + 0.01)*(4-t)) - 1 where t=0,1,2,3,4' ... notice I could have gone to 5, but that doesn't make sense as it's the yield you get according to this YC if you hold this bond from time littlet=5y to time bigT=4y, going backward in time), then the sequence of annual returns if you start holding a bigT=4y bond at time littlet=0, is 1.19%, 1.32%, 1.07%, 0.44%... so indeed, even though the YTM is smaller at the 3y point than at the 4y point, we earned a larger annual return going from 3y->2y than 4y->3y, because the concavity of the YC meant that the rolldown benefit in the 3y->2y case overwhelmed the higher term premium. Notice that in fact the duration-adjusted rolldown benefit in the 3y->2y vs 4y->3y case was even greater than the unadjusted benefit, since in the 3y->2y case the adjusted benefit scaled up by the ~2.5year avg duration over that period was enough to overwhelm not only the higher term premium in the 4y->3y period but also to overwhelm in addition that 4y->3y period's own adjusted rolldown benefit scaled by the ~3.5yr avg duration over that period.

There's definitely a lot of rounding/hand-waving/guesswork in my writeup here, but I think the core is solid. Many thanks to a couple people who sat with me and helped flesh this out.

On Curing Negative Long-Futures Contango Rolldown Yield:

Deposit your goddam margin!

on Why Options Always Have Positive Vega (in Particular, Why Puts are Long Vol):

(I apologize in advance for my reliance here on words rather than drawings. HTML is not very graphics-friendly. However, if you yourself sketch out the diagrams I describe, you'll find the argument much easier to follow. Note that I assume throughout that we are long the option. Make the obvious changes if you want to understand what would happen if instead we had written it. I caution that in this lesson, rather than teaching, I will merely propose a simple example that I think is evidence that puts are indeed always long vol... I was struggling to reconcile this received wisdom in my own head, and I'm just writing down a thought I had about it. Somebody please also check my arithmetic. Sorry about the rambling/shitty formatting.)

I learned in an arbitrage-free asset pricing class that option contracts always have positive vega (a derivative's vega is the first derivative of its value with respect to volatility, i.e. the derivative's 'exposure' to vol) (as always, keep straight the distinction between a derivative---which is a contract whose payoff depends on some simpler underlying asset's observed price---and a derivative---which is a rate of change, often imagined as the slope of a tangent line). This is in fact a necessary consequence of put-call parity: the forward-minus-strike side of the parity is vol-neutral, hence so must be the call-minus-put side.

If you picture the traditional lognormal stock-price diagram (which arises from Geometric Brownian Motion), this makes sense for a call option. Traditionally, the price of a stock has a floor at zero (a stake in a company can be worthless, but not less than worthless), but as vol increases, the juicy right tail gets fatter off toward infinity, bulking up the probability of a crazy up-move. (Recall that a call option's payoff at exercise is zero for all stock prices below the strike, but rises without bound as prices climb above the strike.)

But this seems less obvious for a put. For example, say a stock's current price were literally a penny, as close to zero as makes no difference. Who cares about that penny I'm missing out on? Surely, as the holder of a put, I would want the price to stay put (no pun intended), just sitting there until exercise, rather than risk an up-move of potentially unlimited magnitude? From my perspective, as vol increases, maybe the probability of earning an extra penny increases, but so too must the probability of getting wiped out entirely? How can I possibly be long vol?

In this case, I think what's happening is that the intricacy of lognormals and Black-Scholes, which are usually well-behaved and elegant models, actually ends up obscuring insight. Take a much simpler model. Say that, rather than lognormal, the (risk-neutral) probability distribution of stock price at exercise is just two boxes, one from zero to the strike and one from the strike to some specific upper bound. Furthermore, assume that interest rates are zero, so that (a) there's no discounting necessary and (b) there's no drift, i.e. the current spot S=1 is the same as the expected future spot, which for further simplicity is the same as the strike K (i.e. the call is written at-the-money).

Now let's make the idea of vol more concrete here. Imagine three states of the world, a low-vol state A, a mid-vol state B, and a high-vol state C. In state A, the PDF of the stock price at exercise is uniform from 0.5 to 1.5 (that is, unity between 0.5 and 1.5 and zero everywhere else). In state B, the PDF is uniform from 0 to 2 (1/2 between 0 and 2 and zero everywhere else). Finally, in state C, the PDF is 2/3 from 0 to 1 and 1/6 from 1 to 3. Notice that in every case, the PDF integrates to unity and its mean is S=1=K.

What's the value of the put in each state? Well, we can simply take the expected value of the payoff, [1-S]⁺. In state A, this is the integral from 0.5 to 1 of (1-s)(1), that is 1/8. In state B, this is the integral from 0 to 1 of (1-s)(1/2), that is 1/4. In state C, this is the integral from 0 to 1 of (1-s)(2/3), that is 1/3. Voila! As the vol increased, the value of the put increased. This was true even as we moved from mid-vol (state B) to high-vol (state C), which entailed leaving the lower bound of the PDF's support at the floor of 0 but pushing out the upper bound from 2 to 3.

What happened here is that, because the risk-free rate is zero and we must analyze the problem ceteris paribus, we had to keep the PDF's mean at the current spot, 1. So, as the vol increased, the PDF's right tail got fatter (going from zero above 1.5 to zero only above 2 to zero only above 3) which increased the length/size of the sample (sub)space corresponding to us getting wiped out, but this was more than compensated for by the leftward shift in overall probability mass, which actually ended up decreasing the measure of that undesirable sample space. Said another way: Even though it wasn't true that the total number of good outcomes kept increasing, it was true that the probability of winding up in one of those good outcomes kept increasing.

As an exercise, verify put-call parity by calculating the value of the corresponding call in each of states A, B, C, keeping in mind that the payoff of that call would be [S-1]⁺.

So, at least in this example, we can amend our understanding of why options always have positive vega: for a call, as vol increases, the probability of a crazy up-move way out into the right tail also increases, which has a lot of leverage on the expected payoff; and for a put, as vol increases, initially the same effect applies, but even after we hit the zero lower bound and it stops applying, the total probability mass flows leftward and so the probability of winding up in a desirable outcome still keeps increasing.

on The Tracking Error of a Subset of an Index's Stocks:

The market is n i.i.d. unit-vol assets. Suppose you invest in an equal-weight portfolio consisting of m <= n of them (note that for this analysis, leverage doesn't matter... so I just put passive long portfolio weight of unity on each asset). What will be the correlation of your PNL to the market? Well, first notice that the market portfolio, x_1 + ... + x_m + ... + x_n, has vol sqrt(n), whereas yours, x_1 + ... + x_m, has vol sqrt(m). Now, Cov(x_1 + ... + x_m, x_1 + ... + x_m + ... + x_n) = Cov(x_1 + ... + x_m, x_1 + ... + x_m) + Cov(x_1 + ... + x_m, x_{m+1} + ... + x_n), where the first part is the variance of your portfolio and the second part is zero by assumption. So, the covariance is m. Hence, the correlation is sqrt(m/n), which is a nice result: if you have only very few assets (i.e. m << n), then adding more greatly reduces your tracking error... but if you already have most assets (m ~ n), then you'll already have a reasonably low TE (holding portfolio vol constant). I think this is basically the CLT.

on How Prices Can Both Trend and Mean-Revert:

It initially seemed paradoxical to me that asset managers can trade profitably on both short-term price trends (momentum/continuation) and short-term reversals. The key is that the first is a timing/directional strategy whereas the second is cross-sectional. Imagine that a stock's return r^{i}_t is given as m_t + d^{i}_t + \varepsilon^{i}_t where m_t is the market return (which is random but tends to trend, think of it as a first principal component), d^{i}_t is the idiosyncratic return (which is also random but tends to mean-revert), and \varepsilon^{i}_t is white noise. I find this easiest to intuit when the magnitude of the idiosyncratic term is much smaller than the magnitude of the market term. Then, betting that any given stock's return on the next day will match the sign of its return today is a good bet, but betting (long/short) that the stocks that outperformed today will underperform tomorrow (and vice versa) is also a good bet.

on XS (Cross-Sectional AKA Relative-Value) vs TS (Timeseries AKA Timing AKA Directional) Trading Strategies:

A TS strategy is a strategy that looks at individual securities (a security is just a tradable asset), asks "is this security's price going to rise or fall?", and then accordingly goes long or short that security. They get their name because academic finance papers on this topic are often titled something like "Factors that Predict the Timeseries of Returns" or "Factors that Explain the Timeseries of Expected Returns" (the titles are equivalent: in the regression/R-squared framework, using time-t information to forecast time-t+1 returns is the same as using time-t information to estimate the conditional mean i.e. compress the conditional variance of time-t+1 returns; because if you can pin down an accurate predicted value for time-t+1 returns, then you reduce your uncertainty about them... this is related to the idea of an "unpredicted" or "unexplained" error/innovation term in a regression model specification, if some RHS variable helps predict the LHS variable but you don't include it in your regression, then it "lives in the innovation term", so that to you, its effect on the LHS variable looks like unpredictable noise, decreasing your R-squared). To run a TS regression, you essentially stack each timeseries of security-specific returns (assuming each timeseries has similar and constant volatility, or else the most volatile securities/time periods will dominate the results thanks to the OLS loss specification... an alternative is to use WLS or GLS) into a single long vector on the LHS, then timeseries-zscore each security's associated signal value (focus on a single security at a time) and stack those timeseries into a single long vector on the RHS (no comment on the lookahead bias here), then regress on that (plus desired binary indicators AKA "dummies" for entity-/time-fixed effects).

A (simple) XS strategy (named based on papers like "Factors that Explain the Cross-Section of Expected Returns") is a strategy that looks at a pair of securities, asks "is B going to outperform or underperform A?", and then accordingly goes long B - short A or long A - short B. This is often referred to as going long the spread between B-A, or going short the spread between B-A. I like this notation because if you think of "short" as "negative sign", then long the spread is B-A while short the spread is -(B-A)=A-B... and indeed, shorting the spread between B-A is the same as longing the spread between A-B. (In reality, there are two practical considerations: First of all, we usually risk-target both sides, so that e.g. if B is twice as volatile as A, then the "spread" might look something like long 0.5 notional units of B for every short notional unit of A, i.e. you apply half as much gross leverage to B as you do to A, because we usually make predictions about risk-adjusted returns not raw returns. Second of all, because A and B tend to be similar securities (because it's hard to make predictions about the relative performance of very different securities) and therefore highly correlated, their spread tends to have very low volatility, so we have to lever it up to get meaningful risk exposure. For example, suppose B is an emerging-market country-level equity index with volatility 20% and A is a developed-market country-level equity index with volatility 10%, and that their correlation is 0.9 because they both load heavily on the global risk-on/risk-off factor i.e. they both tend to rise in price when investor "animal spirits" have high risk appetite, and fall when "animal spirits" have low risk appetite. Then if your NAV is $100 and you go long $50 of B (portfolio weight +0.50) by itself, you get 10% vol, same as if you go short $100 of A (portfolio weight -1.00) by itself. But if you combine both transactions, then you're only getting 4.47% = 0.0447 = sqrt(0.5**2 * 0.20**2 + (-1)**2 * 0.10**2 + 2 * 0.5 * (-1) * 0.9 * 0.20 * 0.10) vol... so you might want to "lever up" and go long $100 of B and short $200 of A, to a meaningful risk exposure of 8.94% vol.) To run an XS regression, you essentially take each "row vector" representing a single timestep's security returns and cross-sectionally zscore each one (focus on a single timestep at a time here, although clearly this is not a valid way to ex-ante risk target in practice, since you don't observe the cross-section's realized volatility beforehand... btw, notice that the de-meaning as part of the z-scoring removes each timestep's time-fixed effects!) and stack them into a single long vector on the LHS, then similarly cross-sectionally zscore (or rank) each timestep's cross-section of security-specific signal values and stack those "row vectors" into a single long vector on the RHS (there's no lookahead bias here, since we do observe signals beforehand), then regress on that (plus, optionally, binary indicators entity-fixed effects).

So for instance if a TS strategy is long A and short B, it's because it expects A to rise and B to fall, whereas if an XS strategy is long A and short B, it's because it expects A to outperform B, i.e. maybe A will rise and B will fall, or maybe both will rise but A will rise more, or maybe both will fall but A will fall less. Often TS strategies are assumed to be "trend-following", betting that prices will continue to rise or continue to fall, but this doesn't have to be the case: TS strategies can just as easily bet that the direction of prices will reverse what they've been doing recently. Similarly, often XS strategies are assumed to be "statistical arbitrage" or "pairs trading", connoting strategies that bet on a spread's mean-reversion toward zero: that the spread between B-A is usually small, but it's grown large (either positive or negative), and therefore that, going forward, the prices should come back into line with one another. But again this doesn't have to be the case: XS strategies can just as easily bet that the spread between B-A will actually continue to widen, perhaps because B has outperformed A recently thanks to some good news but that markets haven't fully reacted to (AKA "priced in") the full impact of the news yet, and so B will continue to outperform. (Note that from this perspective, an XS strategy is like a TS strategy where the "asset" we're timing is the spread itself.)

Stylistically, we often think that XS strategies have a better Sharpe than TS strategies, which is the same as saying that the R-squared of the XS panel regressions is better than the R-squared of the TS panel regressions: it's easier to predict which assets will outperform which assets, than it is to predict which asset prices will rise or fall, because predicting the latter requires you to forecast not only good or bad news but also investors' notoriously fickle animal spirits, whereas to predict the former it might be enough to ask, "regardless of whether news comes out good or bad, or whether investors get bolder or more spooked, is asset B or asset A in a better position?".

on Spurious Correlation from Joint Holdings:

If you try to estimate the correlation of two strategies (e.g. VAL, MOM) applied to a common asset universe, you might run into "spurious" correlation (or anticorrelation). For example, suppose two strategies are -0.5 correlated, and you apply them to the same asset universe. And suppose that over your backtest period, the two strategies HAPPENED to have the same directional view on some asset on some day (take for example, both long gold the day before the GFC), and on that day there was a GIANT exogenous positive shock to the gold price. Both strategies would realize outsize positive PNL that day, and if it's extreme enough (let's say usual daily PNL is between -1% and +3%, and the PNL on the day of the gold shock was +60%), the usual correlation estimate (which is really an encapsulation of linear relationship in a nonlinear world) might "pick up" on that too much, saying, "wow, these guys usually do what they do around the same -1% to +3% level, but then every now and then they both have HUGE positive days together... very correlated!". One way I was suggested to solve this is to randomly partition the asset universe into halves, backtest each strategy on each separate half, calculate the correlation estimate, then repeat over and over, and take your final estimate to be the equal-weighted average of all the rounds. (This isn't foolproof: for example, if you only do this once, and gold front-month/gold second-month happen to get split into different halves, you might still "effectively" have this same problem, since gold front-month/gold second-month are arguably the "same" asset, so you haven't changed anything. What you need to do is split the universe into two halves such that the idiosyncratic noise of any two assets that 'cross' the partition is independent.) Another way I can think of is to calculate the correlation of views: e.g. every day estimate the correlation of strategy A's views vector on strategy B's views vector, then take the average over time. But again this isn't foolproof: in an extreme example, pretend that strategy A was always just passively long the front month in every commodity (at unit leverage), and strategy B was always just passively long the second month. These strategies are "actually" highly correlated, and their return streams will be in spirit indistinguishable over the long term (maybe A is slightly more volatile, I don't know), but this "average daily views correlation" will be very negative, since the slots in A's views vector with +1 portfolio weights every day will be the slots in B's views vector with 0 portfolio weights every day, and vice versa. A third way is to try to combine the best of both worlds: estimate an ex-ante asset-level Cov every day, let u, w, be A's and B's views respectively, calculate u'\Sigma w / \sqrt(u'\Sigma u w'\Sigma w), (this is ex-ante correlation each day) and take an average over all days in the backtest. This way is much less suspectible to quirks of ex-post full-sample PNL level differences (like that extreme +60% outlier).

on Why You Need to be Careful About Interpreting Alpha:

Everyone knows that the beta of asset Y to asset X isn't just the reciprocal of the beta of asset X to asset Y, which is a little counterintuitive from a geometric standpoint but very natural from a financial standpoint: suppose Y and X have the same vol, then their beta is just their correlation, and obviously the correlation of Y and X is the same as the correlation of X and Y (and in particular, is not in general its reciprocal). But not everybody is as aware that the alpha of asset Y over asset X isn't just the negative of the alpha of asset X over asset Y. For example, consider two completely uncorrelated positive-mean assets... then, their ex-ante unconditional joint distribution looks like a cloud of points in the upper-right quadrant, and obviously each has zero beta and positive alpha to the other. More generally, any two ex-ante symmetric assets Y and X will necessarily have the same beta and alpha to each other. But the caveat doesn't stop there... for example, suppose Y is equally likely to realize total return $1%, 2%, 4.5%, 3%, or 5%$ and X's corresponding total returns are $1%, 2%, 3%, 4%, or 5%$, then we have two clearly asymmetric assets, but (do the calculation) both have both positive beta to and positive alpha over each other!

on Why You Need to be Careful About Interpreting P-Values:

Everyone knows that a p-value isn't the probability that the null is true, but the probability of observing an outcome (i.e. test statistic value) at least this extreme IF (i.e. GIVEN that, AKA conditioning on the fact that) the null is true. (Everyone, that is, except apparently Nobel laureate Gene Fame who in Fama-French 2015 "International Tests" wrote that "the GRS (1989) test says the probability the true intercepts are zero is zero to at least five decimal places", despite meaning that "the GRS (1989) test says the p-value under the null hypothesis that the intercepts are zero is zero to at least five decimal places", which means that the probability of observing a result this extreme if the true intercepts are zero would be basically zero, which means that the data is highly inconsistent with the null hypothesis and thus we can reject the null... which sounds a lot less punchy but more accurate than Fama and French's actual statement.)

But some people still say silly things like "oh, I'm using a 5% significance level (Type I error = false positive rate), so I should expect about 1 in 20 null signals to be 'accidentally' flagged as significant in my research". This is not true. Frequentist statisitics neither does nor wants to impose probability distributions on regressors (i.e. potential signals). Frequentist tests would be just as valid whether you tested 20 completely unrelated signals, or 20 very arbitrarily related signals. For example, suppose your 20 potential signals were just identical copies of each other... or suppose they were just rescaled versions of each other... or suppose they were just 20 slightly perturbed (via addition of a vanishing amount of random noise) versions of one "core" signal. You would be very stupid to expect "about one" (p.s. I hate this "about one" garbage... the expected value is 1 and exactly 1, stop adding extra shit that obscures the math and makes it seem magical or hard to understand) out of those 20 signals to be flagged: a better, more principled guess would be that either none of them will flag, or they all will (I can't intelligently assign probabilities to either of those two individual states of the world without knowing more about the market and the signals themselves, but I can pretty confidently assert that the SUM of their probabilites is nearly unity i.e. anything ELSE happening has probability almost zero). The correct statement is "oh, I'm using a 5% significance level so, given some arbitrary but fixed null signal, if god ran 20 random simulations of the market data-generation process, I should expect 1 in 20 of those return stream realizations to---by pure dumb luck---appear correlated enough with my signal to flag it as significant in a backtest". Again, we trade off some syntactic punchiness for a lot of semantic accuracy.

on Why T-Stats Suck for Identifying Investment Signals (from Least to Most Damning):

They give no sense of the economic magnitude of the coefficient estimate. On a tightly related note, people will often take two systematic/algorithmic strategies, one of which has been running live for longer than the other and therefore has a longer history, and just truncate the first so that its timeseries length matches the other's. "Because," they say, "I don't want to punish the second strategy simply for being new and innovative---let's put these guys on even footing." This is a simple and extreme example, but the idea is that people will often try to normalize things so that when comparing two things via t-stat, one doesn't get an 'artificial edge' or an 'inflated advantage' over the other merely because it has a larger effective sample size. But this isn't an 'artificial edge', it's an intrinsic property of t-stats... the way they are designed and the way they function, they MUST take into account sample size!
People just "assume" that the asymptotic Wald test (AKA asymptotic z-test) is valid, without ever validating the assumptions of the finite-sample Wald test (i.e. i.i.d. Normals)... the test is robust, but not to everything, and not in very small samples, and saying "it works for infinity data points" isn't a particularly satisfying justification when the best you can say about your own dataset is "it contains somewhere between zero and infinity points"
People rarely properly control for the other signals. Ideally, you'd throw everything into one giant regression model and fit some regularized coefficients, but of course this isn't practical (we have thousands of possible signals!). Still, if your live production model has 100 signals and you're considering adding the 101st, you should analyze it in the context of the ones you already have... otherwise, you miss things like cross-signal correlations. But of course isolating the statistical significance of the 101st coefficient would be hard if you also included the first 100... which is just a giant scientific mess. Building on the first point, if the "right" way to do something (in this case, controlling for your other signals) is not very useful (in this case, because it's messy and might not flag a signal as individually significant even if it actually IS good and additive at the margin), you should probably step back and reconsider your approach.
They're misleading. First of all, they are full-sample statistics, so they have look-ahead bias. Second of all, they're basically Sharpes (empirical mean over empirical vol), but we tend to be much less suspicious of them than of Sharpes (which everyone by second nature heavily discounts because they're probably overfitted).
At the end of the day, it's an indirect way to answer the simple question, "will this make me money out of sample?". Something like train/test splits would be better, and running a backtest with expanding/rolling regressions (no look-ahead bias!) then reporting the (implementable, realized) Sharpe would be better still. And for extra credit, you could not only run a rolling regression and form backtest views based solely on contemporaneously-available information, but also split the investable universe in half (a total of N choose N/2 ways, like in 2-fold cross-validation) and form views based on information from one half but then actually "invest" in the other half to see how well the sign and relative strength of relationships hold up over time and across geographies.

on Types of T-Costs:

Broadly, there are 3 major types of t-cost: (1) a fixed market access fee / liquidity-provision premium, (2) bid-ask spread width (i.e. the fact that you don't get to transact at midmarket against a dealer so that e.g. even if you can predict a security's midmarket price in the next second, you might not be able to trade profitably on that information), (3) market impact (i.e. the fact that each marginal unit you transact can actually move the midmarket itself a little bit against you). Sometimes people attribute the last thing to the fact that the dealer has to "lay off" the inventory you offload onto him, and the more you offload the riskier it is for him to find someone to take it off his hands quickly. I have to think about that interpretation.

on Convex T-Cost Models:

Consider $TC(\Delta x) := \lambda \cdot |\Delta x|^\ell$, where $\Delta x$ is reported as a percentage of AUM. Under a convex t-cost model ($\ell > 1$, as opposed to a linear t-cost model $\ell = 1$ or concave $0 < \ell < 1$ or constant $\ell = 0$), an increase in AUM should lead to an increase in t-cost level $\lambda$, since the t-costs and trade sizes are represented as percentages/fractions of the portfolio's total AUM, but we usually model t-costs as based on trade sizes in dollars. For example, under a quadratic ($\ell=2$) t-cost model, if AUM increases from $A$ to $A'$, t-cost level should increase from $\lambda$ to $\lambda' := \lambda\cdot(A'/A)$. In general, for convexity coefficient $\ell$, the t-cost level $\lambda$ should be multiplied by $(A'/A)^{\ell-1}$. To see this, consider the ``explicit'' model $TC^\$(\Delta x^\$) := \lambda^\$(\Delta x^\$)^\ell$, where the superscript-\$'s indicate that the values are reported in dollar-space (as opposed to percentage-return-space). Our ``implicit'' model is then $TC(\Delta x) := \frac{TC^\$(\Delta x \cdot A)}{A} = \frac{\lambda^\$(\Delta x \cdot A)^\ell}{A}$, so that implicitly $\lambda := \lambda^\$ A^{\ell-1}$.

On Utility and Returns:

Suppose that today's USD-per-local FX spot rate is X0 = $1.00/L1, and that you personally believe next year's spot is equally probable to be either $1.08/L1 or $0.93/L1. For simplicity, let risk-free interest rates be zero everywhere. Then, if you want to maximize your expected wealth (or log-wealth) in USD, you should hold local: the expected value of local next year is (roughly) $1.005, i.e. arithmetic ER is $0.005, geometric ER is 50bps, logarithmic ER is 0.002 (note that whereas in each individual state the log return is approximately equal to the geometric return, on average this isn't true). If you want to maximize your expected wealth in local, you should hold USD: the expected value of USD next year is (roughly) L1.0006, i.e. arithmetic ER is L0.0006, geometric ER is 6bps, logarithmic ER is -0.002... hold on, this is consistent with the previous scenario's figure which is good, but in this scenario maximizing expected log-wealth isn't the same as maximizing expected absolute/"raw" wealth (although of course maximing log of expected wealth is the same) which is annoying! And now that we mention it, holding local has positive geometric ER in USD while simultaneously doing exactly the opposite has positive geometric ER in local, which is also weird! Does this mean that I can guarantee unbounded growth of my wealth just by converting USD and local back and forth over and over and over? Do risk-neutral/no-arb measures help... and if so w.r.t. what numeraire?

on Portfolio Optimization:

This lesson is one point and a question. First of all, I've heard people suggest that maximizing mean-variance utility isn't "rational" because it doesn't obey the von Neumann-Morgenstern axioms. However, I think any objection can be resolved by thinking instead of maximizing Sharpe (or the Information ratio... I can never tell which one is right, although I believe that Sortino-based decisions reduce to Sharpe/IR-based decisions if any strategy can be arbitrarily levered up to hit your desired ER), and then levering up to hit your desired ER (or down to hit your maximum tolerable vol). Second, the question: what's the difference between Kelly betting and MVO/Markowitz betting? If I'm skimming their definitions correctly, the Kelly model (beside maximizing a different utility function, which does not really seem like a spiritually meaningful difference) is more useful if you think the stochastic process you're betting on is stationary and well-described by your past experience, whereas MVO is more useful if ERs/correlations are time-varying and you think you can accurately predict them in advance. I'm surprised some persnickety PhD somewhere hasn't already delved into this and written up a blog post about it.

on The Power of Diversification Benefit:

Diversification benefit is like compounding: it is almost unintuitively powerful. For example, consider two assets, each at standard 10% vol, with a -0.50 correlation (moderately anticorrelated). Asset A has 10% ER (1.0 Sharpe), B has -4% ER (-0.4 Sharpe). Obviously, the optimal portfolio is very long A and slightly short B, but in reality this is not true. In fact, the optimal portfolio is long both A and B, at a ratio of about 89% A and 11% B (then just lever up or down to hit your ER target or vol cap). Because B is anticorrelated to A, even though B has negative ER, it "smooths out" or "dampens" A's vol enough to overcome this: that is, on average B loses money, but it tends to do this when A is winning, and on the other hand it tends to switch up and WIN when A is losing---right when you need it most. Said another way, taking money out of A and putting it into B makes your portfolio less "risky", so that you can safely make up for the lost ER by levering up.

Confirm this in a simple case: you have 2 standardized-vol `rho`-correlated assets, the first with ex-ante Sharpe S and the second with Sharpe 0. The MV optimal combination (confirm this: \Omega^{-1}\mu, where \Omega is the correlation matrix and \mu is Sharpe vector, the output being risk weights) is (proportional to) S worth of the first and -\rho*S---not 0, unless either \rho or S is 0---worth of the second. This has some nice intuitive result: if the second asset is just white noise (\rho = 0), then you should ignore it, but if it's positively correlated, you should short it, because it's an "uncompensated" risk factor: all risk, no reward, and therefore by hedging out (AKA "orthogonalizing") your exposure to it (where did this exposure come from? well, from the fact that you hold some of the first asset, which is positively correlated with it!) you can "distill" down the risk you take in the first asset to just the pure compensated (i.e. positive ER) part.

I think it's the case that a PNL stream is additive (Sharpe-improving) to another if it has alpha over the other. In this case, B's beta to A is -0.5, so its alpha must be 1%. (No t-stats... we're not estimating anything, just doing some probability math!) Which actually brings me to a sort of weird example of regression to the mean... suppose you have 2 assets, each at standard 10% vol, with a +0.50 correlation, each with 10% ER (1.0 Sharpe). To me, the natural "regression line" to draw (if I just sketch the joint PDF, which you can visualize as a scatterplot or "cloud" of random points) is just "y = x" (or "x = y"), which has 1.00 slope and 0% intercept and is symmetric. But in reality, the line will have 0.50 slope and 5% intercept.

P.S. One way to measure this benefit is by reporting the "effective number" of uncorrelated bets. To motivate this, suppose you have $N$ uncorrelated assets with vol $\sigma$ and ER $\mu$. Your "average" asset Sharpe is $A = \mu / \sigma$, but your "aggregate" Sharpe (for the MVO-optimal equal-weight portfolio) will have Sharpe $A*D = A * \sqrt{N} = \mu / (\sigma / \sqrt{N})$, where $D = \sqrt{N}$ is the "diversification factor" and $D^2 = N$ is the "number of uncorrelated bets". Now, suppose you have a set of $N$ possibly correlated assets with vols $\{ \sigma_n \}$ (and let $V = \sum_{n=1}^N \sigma_n$) and Sharpes $\{ S_n \}$ (it's more convenient to think in terms of Sharpe than ER here). Define a vol-weighted "average" asset Sharpe to be $A = \sum_{n=1}^N (\sigma_n / V) S_n$. Consider again an equal-weight portfolio (for simplicity, just consider holding each asset at unit leverage, so that its portfolio weight is unity), the vol will be some $\sigma_P$, and the ER will be $\sum_{n=1}^N \mu_n = \sum_{n=1}^N \sigma_n S_n$. Your portfolio Sharpe will be $ER_P / \sigma_P = \sum_n \sigma_n S_n / \sigma_P = A * V/\sigma_P$, so that now $D = V/\sigma_P$. This was just a motivating example calculation. There are other ways to define the "average" Sharpe too (for example, equal-weighting them), but the idea is similar: you follow the exercise through, and determine what your $D$ is.

One aside to the above discussion about "effective number of uncorrelated assets": I've often heard people argue that "estimating means (ER's) is noisy" so we should just shrink all the Sharpe estimates to our prior (e.g. unity) and focus on the Cov. For example, we could do PCA on the correlation matrix, and report the total number of PC's needed to explain 95% of the total variance, as an "effective number of uncorrelated bets". But this is exactly wrong if you're MV-optimizing! For under MVO, the PC's that explain the most variance are exactly the PC's that will tend to get downweighted, because MVO takes vol-weighted bets on PC's proportional to Sharpe, and Sharpe is inversely proportional to vol (i.e. sqrt variance). That's the whole idea of spread bets: there's only a tiny ER difference between two highly correlated assets, but because the volatility of the "spread" PC is so dramatically smaller than the volatility of the "common" PC, you don't need a ton of ER to get a good Sharpe.

on MVO:

A useful way I've learned to think about MVO is as decomposing return streams into their uncorrelated principal components, and then making vol-weighted bets on each proportional to its Sharpe (of course, the remaining degree of freedom is simply how much you want to lever up/down). Notice that making a vol-weighted bet proportional to Sharpe is the same as making a notional bet proportional to the ratio of ER and variance.

on Some (Not-So-)Simple Regression/Beta Calculations:

$R: TxN$ observed returns over time (assume ground-truth underlying mean is zero always); $W: TxN$ benchmark/market-cap weights over time (hypothetically, if given $W[0,:]$ the market-cap weights at time $0$ then you could use $R[:T-1,:]$ to construct $W[1:,:]$ the market-cap weights at all subsequent times). $z: Tx1 = W*R$, where I'm using "$*$" to represent a row-by-row matrix dot product, is the benchmark/market returns over time.

Recall that for $r ~ z$ (no ones/constant/intercept, assume that ground-truth underlying means of both $r, z$ are zero), $\hat{b}_{OLS} = \hat{corr}(r, z) / \hat{std}(z) * \hat{std}(r) = \hat{cov}(r, z) / \hat{var}(z)$ (don't use stupid estimates for the hats, in particular understand why your MLE changes once you know that the mean is zero). For $R$ multi-dimensional, this can be more succinctly written as $\hat{B}: 1xN = inv(Z'Z)Z'R$. Of course, my point is that $\hat{\beta}_n = \hat{B}[0,n]$ is the estimated beta (to the benchmark/market) of asset $n$. But, I've been given to understand that this well-known financial calculation is a special case of a more general statistical truth (I might be getting some of the dimensions/transposes/operations wrong here): if you know tomorrow's returns $r: Nx1$ will be drawn from some distribution with mean zero and variance-covariance matrix $\Sigma: NxN$, and you have some benchmark weights $w: Nx1$, and you'd like to know the beta (to the benchmark) of your view $v: Nx1$, then you can calculate $(v∘r)'(w∘r) / (w∘r)'(w∘r) = \beta = cov(v'r, w'r) / var(w'r) = v' \Sigma w / w' \Sigma w$ where you could have estimated/approximated the $\hat{\Sigma}$ as $R'R/T$, getting $v'R'Rw / w'R'Rw = (v'R')(Rw) / (w'R')(Rw)$, which sort of looks like that expression with the Hadamard products. Probably one of you can clarify this for me. The point, apparently, is that you can use whatever you want for the $\Sigma$, making this a much more flexible and robust calculation. At the end of the day, you estimate some beta vector b and find the estimated beta of an arbitrary personal view as u'b.

I've now confused myself so we'll move on. If you take your beta = corr / sigma(MKT) * sigma(PNL) and hit it with sigma(MKT) to get corr * sigma(PNL), your "beta*vol", you basically get the effect on your personal portfolio of a 1-STD market move.

And if you have a beta vector b, and you stipulate beta-neutrality, then "acceptable" views would satisfy v'b=0, i.e. be orthogonal to b. In e.g. 2-space, the acceptable subspace would be a line i.e. a 1-plane, in 3-space this would be a 2-plane, etc... so you're constraining your view to lie along/on/within a (N-1)-hyperplane. (A surprising but intuitive example of how this might work is if one of your assets was SPX and the others were totally uncorrelated market-neutral styles like VAL or MOM... then, you'd never be able to take a nonzero position in SPX because there's no other asset that could possibly cancel that beta out... but that's it, that's the degree of freedom/dimension you lose.) One intuitive way to accomplish this is to take your uncon view u and transform by projecting onto the con subspace, i.e. hitting u with the relevant orthogonal projection matrix P. Another way, using a "hedge", is to simply short the appropriate amount of market, that is, take the cap-weight vector w=W[T-1,:] and solve for the minimum "distortion/delta" d (a scalar) such that (u-dw)'b=0. Yet another (if your original portfolio is dollar-neutral) is to simply divide each desired position by the beta of the corresponding asset.

A different more sophisticated (in some cases) way is, given an uncon view $u$, solve for an opt-ed view $v$ that minimizes tracking error to $u$ (TE is $(v-u)' \Sigma (v-u)$, where $\Sigma$ is the asset returns cov) subject to the constraint that $v'b = 0$. Apparently, the solution $v$ is actually something like the vector of residuals from a GLS regression fit of $u$ (edit: NOT the MVO-implied ER's $\Sigma u$) on $b$ using the cov $\Sigma$, i.e. $v = u - (b'\Sigma^{-1} b)^{-1} b'\Sigma^{-1} u = (I - (b'\Sigma^{-1} b)^{-1} b'\Sigma^{-1}) u$, so that this is still a linear transformation represented by a matrix multiplication. This does make sense: think about the simple case of OLS (a simple case of GLS with a diagonal homoscedastic cov), the residuals are by construction orthogonal to the regressor vector, i.e. dot product is zero. It's easy (is it? if it's true, then it's easy.. but it might not be true) to confirm that this still holds in the GLS case where the regressor b is a 1xN vector of asset-level betas, as we can write the residual as "y - other_stuff", and confirm that b(other_stuff) is the same as by.

In fact you can easily imagine wanting to neutralize/orthogonalize your positions to any arbitrary factor (e.g. size/SMB), where a 'factor' is really just a portfolio/vector of asset weights. Assuming each additional factor is itself linearly independent from existing factors, you can keep doing this for up to $N-1$ factors (if you added an $N$th linearly independent factor---after which, notice, it would be impossible to add another---then the only orthogonal portfolio would be empty/zero).

By the way, say you don't require beta-neutrality, but you have a beta limit in your portfolio. One way to attempt to enforce it is to calculate the beta of your desired portfolio, then just short just enough of the market to get you under the limit (the beta-hedge technique i mentioned above). This is in general not as good as directly reducing your position. Why? Well, it's an imperfect hedge (said another way: it forces you to place a potentially undesirable 'basis bet' or 'basis trade' one way or the other, between the two assets). For example, say the beta in your portfolio is coming from crude oil. You want to hold 2 contracts, each contract has a beta of 1, and your limit is 1 (i used stupid units for simplicity). Well, you just short 1 contract of SPX which has by definition beta 1 (yeah yeah same dollar value per contract whatever dude). BUT you had to estimate the beta of crude ex-ante, and you underestimated... in reality, the ground-truth beta of oil in the next market session is going to be 2/contract. Now, you're sitting on a beta-3 portfolio! If instead you'd cut your position to 1 contract of oil, you'd still be over the limit, but only beta-2. Then again, one or the other might be Sharpe-optimal... depending on your objective, you might be better off one way or another. I should explore this.

on The Orthogonality of the Risk-Model noise portfolio (i.e. GLS residuals)

transfer_coefficient = correl = w'\Sigma v / (\sigma_w \sigma_v) \beta = correl * \sigma_w / \sigma_v = Cov(w, v) / Var(v) noise_pflio = resid = w - \beta v = w - Cov(w,v)/Var(v) v resid' \Sigma v = w'\Sigma v - Cov(w,v)/Var(v)v'\Sigma v = Cov(w,v) - Cov(w,v)/Var(v)*Var(v) = Cov(w,v) - Cov(w,v) = 0, QED. And if the final view is on the mean-variance frontier, then this implies that the noise portfolio's ER is zero. If it's ER wasn't zero, you could construct a higher-Sharpe view by blending your "final" view with the noise portfolio (taking risk in each proportional to its standalone Sharpe), contradicting the supposition that the final view was itself on the mean-variance frontier. So, the noise pflio is orthogonal to the final view. BUT, it is NOT in general orthogonal to "innovations" (diffs) in the view. In fact, it usually loads negatively on innovations, since t-cost minimization will tend to stick as close to the held positions as possible (since holding on to already-held positions requires no trading hence pays no cost). If innovations in the view are especially informative e.g. perhaps even more informative than the final view itself, the noise portfolio can end up having a bad negative ER. Said another way: The final view is usually positively correlated with innovations in the view (e.g. if innovations are stationary i.i.d. since you can decompose the final view into yesterday's view plus today's innovation, which is the sum of two uncorrelated random variables, one of which is the innovation) but not perfectly so. Equivalently, we can say that today's innovation is positively correlated with today's final view, but not perfectly so. This implies that there is a component of today's innovations that is orthogonal to today's view. Suppose that component predicts returns. Because it is orthogonal to the uncon view, which the optimizer is assuming is on the mean-variance frontier, the optimizer doesn't know that it predicts returns. Therefore, it sees it simply as uncompensated risk, random noise. It will not be any more averse to making the opted portfolio load on it than it would be to making the opted portfolio load on random perturbations. However, whereas it has no incentive to load on random perturbations, it DOES have an incentive to load (negatively) on innovations in the view (conditional on its loading on today's uncon view): t-costs. By making the opted portfolio load negatively on innovations in the view, (conditional on its loading on today's view) it can make the opted portfolio closer to yesterday's opted portfolio, thereby making the trade smaller, hence cheaper. This negative residual loading on the innovations is---assuming that the entire innovation is uniformly predictive of returns---what would give the noise pflio a negative ER in this case. The solution, of course, is to account for this in the uncon view itself, perhaps by constructing it as the exponentially-weighted moving average of innovations instead of the expanding sum of the same. But one perspective is that the part of the innovation not accepted by the optimizer, is the part that is "unimplementable" because of t-costs or constraints. So you could add it into the uncon view, but this would just make the uncon view doubly unimplementable, forcing the optimizer to doubly penalize you, ending up in the same place as before, just with a better uncon view and an even-worse noise portfolio. I don't really agree with this, I say the optimizer is literally constructed to weigh these tradeoffs optimally. But I can concede that in some cases "nudging" the opt by feeding it a "constrained" uncon e.g. liquidity-tiering etc, might help regularize undesirable instability or oversensitivity i.e. "overfitting" to random noise (taking a massively-levered spread bet to satisfy constraints, let's suppose).

on Additive Sharpe:

Suppose you have two strategies, with return streams r_t, s_t. Suppose you combine them in the mean-variance optimal way. Then, if you run the regression $s_t = \alpha + \beta r_t + u_t$, and let $\sigma$ be the standard deviation of the $u_t$, then define the IR of s to r as \alpha / \sigma, you'll apparently have the following: SR_combined^2 = SR_r^2 + IR^2. Also apparently, if you have `n` 1-Sharpe strategies where each is `rho`-correlated with the others (pairwise), the max Sharpe you can get (as `n` goes to infinity) is sqrt(1/rho). So, for example, the max Sharpe you can get with an infinite number of 1-Sharpe 0.9-correlated strategies is 1.05. Also, something like 5% difference in returns for 2 0.9-correlated 10% vol strategies is counterintuitively only a 1-STD event.

on Scoring Investment Convexity:

In finance, we like scores like Sharpe that are scale-invariant, i.e. don't depend on leverage. Such a score for convexity (`y`-vs-`x`): Split the data sample into tritiles of `x`; Compute the equal-weighted mean of the LHS and RHS tritile; Subtract the mean of the middle tritile; Divide by the equal-weighted standard deviation of the data points against their within-tritile mean.

on Taking Risk Proportional to Sharpe:

I'll add another point. Making huge outsize bets when you think you're in an unattractive/low-Sharpe climate sounds pretty silly, but it's sort of an unavoidable maneuver if you're Cliff Asness's prototypical "5% Solution" investor. One way I've found to reconcile the facts that on the one hand (1) it makes sense to take more risk when Sharpe is high and less when Sharpe is low, because the pain should be offset by some promise of gain (that's the whole tradeoff, after all), and on the other hand (2) at the end of the day, you can't eat a Sharpe ratio and you have to feed yourself somehow, so nobody really cares how attractive/unattractive a strategy is: Suppose that rather than naively targeting some static level of return at every timestep, you want to somehow hedge your losses. A reasonable tactic is to say, e.g. "I never want my estimated/modeled -1STD daily portfolio return/PNL to be worse than -1%". Then, if today Strategy A's ex-ante (daily) Sharpe is 0.5, you should tolerate taking 2% ex-ante (daily) vol (accomplish this by backing out the required amount of leverage).

Aside: Very smart, experienced people will frequently abuse the term "expected Sharpe". This is a stupid term, especially in light of the much better term "ex-ante Sharpe".
     First of all, the formula "expected return" over "expected vol" does NOT in general give you expected "return over vol". For analogy, take two independent standard Uniforms. The ratio of their expectations is obviously unity. However, their ratio distribution/PDF is 1/2 for z from 0 to 1, 1/(2z^2) for z greater than 1, and zero everywhere else. (Don't let it bother you that densities at "symmetrical points" aren't the same, e.g. f(z = 1/2) doesn't equal f(z = 1/(1/2) = 2). Think instead about "symmetrical subsets". For example, you can verify that the probability that the ratio falls between 0 and 1/2 is the same as the probability that the ratio falls between 2 and infinity.) And the mean of this guy is clearly greater than unity. (The distribution is severely right-skewed hence mean is greater than median, and its median is unity.)
     Second of all, you don't even know the expected return or vol! Expectations (i.e. "means") are ground truths. Unless you're god, you are going to have to estimate or model these things.
     There is one saving grace: If you think about ER and vol as estimands (for example, fixed ground-truth parameters), then the ratio of their MLEs will indeed give you the MLE of their ratio. (MLE is parametrization-equivariant.) However, this isn't necessarily true under other popular and reasonable frameworks like minimum-MSE. And in any case, you should then say "estimated" (not "expected") Sharpe.
     Similar but less egregious is the abuse of the word "percent" when you really mean "percentage points". E.g. the difference between 1% and 5% isn't 4 "percent", it's 4 "percentage points". This one's harder though, and it's not always clear to me whether I should be using "percentage points" instead of "percent". I usually write '%' so there's plausible deniability... sure I wrote '%', but you don't know how I meant to pronounce it!

Returning to the point, your modeled mean PNL will be 1%, hence your modeled -1STD PNL will be 1% - 2% = -1%, as desired. On the other hand, if tomorrow Strategy A's ex-ante Sharpe is 0.333, you should tolerate taking only 1.5% vol: your modeled mean PNL will be 0.5%, hence your modeled -1STD PNL will be 0.5% - 1.5% = -1%, again as desired. If A's ex-ante Sharpe is 0.25, you should tolerate only 1.333% vol: your modeled -1STD PNL will be 0.333% - 1.333% = -1%. If ex-ante Sharpe is 0.001, your tolerable vol will be 1.001%. As ex-ante Sharpe goes to zero (from above), your tolerable vol goes to unity (1%). (As ex-ante Sharpe goes to -infinity, your tolerable vol goes to zero.) Going the other way: if ex-ante Sharpe is 0.666, your tolerable vol will be 3%... if ex-ante Sharpe is 0.75, your tolerable vol will be 4%... if ex-ante Sharpe is 0.999, your tolerable vol will be 1,000%... as ex-ante Sharpe goes to unity (from below), your tolerable vol goes to infinity. (Consider: exactly at ex-ante Sharpe unity, no matter how much vol you take, your modeled -1STD PNL will be exactly zero, which is obviously greater than your tolerable level of -1%.)

on Who is Long or Short What:

(Inspired by Cliff Asness's 2014 "Top 10 Peeves -- No 6" + Antti Ilmanen's 2016 "Who Is on the Other Side?".) In aggregate, investors must have net exposure that is exactly long the entire market cap. For example, suppose the entire investable universe comprises a single share of a single stock, and you short it. Just before you borrow the share, the lender is long that one share and you are flat (but presumably hold some cash). Just after you borrow it, the lender is long that one share (because it's still his... plus some rent/cash from you), you are short that one share (because you owe it back) and also long that one share (because you're currently holding on to it) (and again also holding your cash, less the rent you paid). And after you manage to sell it, the lender is long one (and the rent), you are short one (and hold some cash), and your buyer is long one. At every step, in aggregate, investors were net long one share. (And obviously, once you close the short, everything's back to the start as far as exposure to that stock is concerned.)

This means that, in real life, if you deviate even a little from market-cap weighting, even if that deviation is just to overweight Stock A and underweight B, you are throwing the aggregate exposure out of whack and somebody else must be taking the opposite bet to compensate. Really, what you're holding in this case is a market-cap-weighted portfolio (levered as you please... notice that "financial" leverage reduces the number of shares the rest of the equilibrated system can be net long by increasing the number of shares you are net long; but neither short-based nor derivatives-based leverage does so, in the first case because your negative position is exactly balanced by your buyer's positive position, and in the second case because no physical shares actually change hands at all until settlement) PLUS a custom portfolio long A and short B, and somebody else (or some combination of somebody-elses) is holding an exactly opposite custom portfolio short A and long B. (One way to decompose your portfolios, distilling out your "active" bets, is to simply regress your vector of portfolio weights on the market, using OLS without an intercept... the coefficient is your loading/beta on the market, and the residuals represent your orthogonal positions).

The upshot is that, for every active manager who wins on a bet, there is an equal but opposite active manager losing on the other side of that very same bet. Cf William F Sharpe's 1991 "Arithmetic of Active Management" (he implicitly assumes long-only portfolios and asserts that "before costs, the return on the average actively managed dollar will equal the return on the average passively managed dollar"---also note that it's redundant to specify "average" passively managed dollar, by construction there's no dispersion here---but the real point is that the return of the average 'active overlay'---that is, the average custom long/short portfolio---will equal zero), and Lasse H Pedersen's 2018 "Sharpening the Arithmetic of Active Management". So every time you make a trade, you should keep in mind "who is on the other side". Either, you are capitalizing on your risk tolerance (willingness to buy riskier assets or finance/lever less-risky ones in exchange for higher expected returns... traditional asset class premia/betas); or, you are capitalizing on your information edge (seeking out behavioral biases or mispricings... alternative premia/alphas); or, you are capitalizing on your leverage/liquidity edge (charging a liquidity-provision premium to leverage/liquidity-constrained investors... market-making). To paraphrase Antti Ilmanen, VAL + MOM + QUALITY are three commonly-followed long-term factors, which together look for good deals on outperforming companies with strong fundamentals... so you've got to convince yourself that there's still someone out there who, for some discretionary or structural reason, is and will continue to be perpetually selling you those shares so they can allocate their own capital to bad deals on underperforming companies with weak fundamentals.

on Some interesting grammatical parallels between Latin and English:

In Latin, the genitive commonly relates possession and hence can be translated with either a possessive noun or "of" followed by a 'normal' noun. For example, "California's capital city" / "the capital city of California". But consider the genitive of part, which shows a "part of the whole" relationship. Wow! Even in this instance, we use "of" in English. Notice this means that translating back to Latin could make it hard to distinguish between "I can give you but my little all" (all I possess to give you is this little bit) and "I give you all of me" (the entirety of my being belongs to you).

Similarly, we can use the Ablative of Means: "seco gladio meo", "I slice with my sword". We can also use the Ablative Absolute in apposition: "illo praefecto, possumus", "with him in charge, we can do it". Notice that in both instances, the ablative noun translates to English with an implied "with" (no pun intended!).

Finally, my favorite: The gerundive (future passive participle) in the Passive Periphrastic. Famously, we have "QED", "quod erat demonstrandum". In English, we say "that which was to be shown (proved / demonstrated)". In both Latin and English, the past-tense copula plus future passive participle denote not necessarily something that definitely WAS going to be shown but rather something that SHOULD or OUGHT TO or DESERVED to be shown. When we start a proof, we state a claim. This claim needs to be proven, but we might not be able to do it. So, when we complete the proof, we say "We have shown X, QED", meaning "We have shown X, the very thing that we needed to show"!