n1c0.net

/thesis
Section article
Reading time ~10 min
Author Nicola F.

PREDICTING PORTFOLIO PERFORMANCE

What 101,767 random portfolios taught me — and why optimizing your portfolio on past data is mostly a self‑flattering exercise.

Not gonna lie, I've tried to automate everything in my life, and with AI I was trying to replace all the stuff I write to save time. So I injected all my thesis and my “about page” from my website and asked Claude to write this page for me. The result? Was pure crap. Call it skill issues with prompt, but I feel like AGI is still far away from understanding my style. One could call it inefficiency, one could call it preservation of art.

So here goes nothing, me explaining my master thesis, the old fashion way. Pressing every single letter you will read here. In particular you will find what have I done and what I've found. What was nice and what was less nice but most of all what was interesting, and what could be done next.

All these What but let me back up from a Why.


Why.

Theory and Practice are not always the same, they are pretty much always different. Sometimes more and sometimes less.

Same in Finance, beautiful concepts, maximize this and you will get this, minimize that and you will get that. Assume a normal distribution and expect a normal distribution. Is in reality like that? Almost never. Create your beautiful efficient frontier with the possible assets inside your portfolio and you see that the portfolios sitting there are made of the best assets in the recent period. But in most of the cases what's next doesn't depend on what happened before. So the key point is that different metrics are good indicators of past performance, but they do not give any information about possible future results.

This is the key point behind the thesis, which metrics can actually indicate something in the future?

To do an analysis on this either you can analyze the stickiness of metrics in mutual funds, but this introduces a bias — loss aversion, mutual funds that went bankrupt, survivorship — or you can try to generate random portfolios. I went for the last one.

Random portfolios don't lie. They don't have a marketing department. They don't get shut down when they perform badly. You don't selectively publish the ones that looked good. You just generate them, compute the metrics, and see what happens.

Can ex‑ante portfolio characteristics actually predict ex‑post performance?

What have I done.

In order to start my analysis I had to get data. And unfortunately I cannot afford a Bloomberg license, yet. Even tho we have licensed terminals in the university, the data we can collect per day is limited. And my ambition was too humble for that, I just wanted to get the full daily prices of all the stocks on earth. I didn't do the calculations of how many days would have been needed to get all the data.

After some cheap services paid I found out that the quality of the dataset was quite low, so I switched my approach to just: let's scrape directly everything from the web, in particular from yahoo finance to stocks analysis. At the end the skills in scraping are something that I really do enjoy, and something that AI is still not good at doing well (paired with matching my style).

So in few days I built a full stock scraper that took few days to scrape all the stocks daily price available on the internet. Daily prices and other information such as the sector, industry, currency, market cap and other details. At the end I got a database that was half of my disk space on my Mac. Of course the database needed some cleaning, not always all the data is actually helpful in something.

The funnel: 77,204 raw → 41,615 after ISIN dedup → 36,546 with 4+ years of history → 32,838 after density → 32,287 after liquidity.

32,287 securities · 78 countries · 143.4M daily price records · April 1964 → November 2025

Data funnel from raw scrape to final universe of 32,287 securities
Figure :: data funnel

Then I generated the portfolios. 101,767 of them. Random size between 1 and 500 stocks, random weights, random formation dates spread across 1969 to 2023. For each one I computed around 50 metrics on the 12 months before the formation date — the ex‑ante, what an investor would have seen — and the same 50 metrics on the 12 months after — the ex‑post, what actually happened. Then I checked how much one predicts the other.

Distribution of portfolio sizes
Figure :: portfolio size distribution
Distribution of formation dates
Figure :: formation date distribution

What I found.

The short version: risk is predictable, returns are not. The slightly longer version is in the table below. Click a column header to sort.

Metric Correlation
Beta0.666+0.82
Diversification Ratio0.383+0.62
VaR 95%0.185+0.43
Avg. Correlation0.157+0.40
Volatility (annual)0.098+0.31
Sharpe ratio0.024−0.16
Alpha0.003−0.06

Beta has an R² of 0.67. Across 54 years of data, across 78 countries, across bull markets and crashes. If a portfolio is sensitive to the market today, it will be sensitive to the market tomorrow. That's useful.

Sharpe ratio has an R² of 0.02. And the correlation is negative. Past Sharpe doesn't fail to predict future Sharpe — it anti‑predicts it. The portfolios that looked the best last year systematically delivered the worst the year after.

Sharpe mean‑reversion.

If you ranked every portfolio by how good it looked, and then bet on the top ones, you would have been systematically wrong. Hover a row to compare ex‑ante vs ex‑post.

Quintile (by ex‑ante Sharpe) Ex‑ante Ex‑post Δ
Q1 (Low)−0.93+1.48+2.41
Q2+0.10+1.05+0.95
Q3+0.96+1.06+0.10
Q4+1.73+0.96−0.77
Q5 (High)+3.15+0.79−2.36

The bottom quintile gained +2.41 on average. The top quintile lost −2.36. The hot fund of last year is the cold fund of this year, almost mechanically.

Sharpe quintile mean reversion chart
Figure :: Sharpe quintile mean reversion

Beta in a crisis gets more predictable, not less.

In normal times, R² for beta is 0.65. During crises, it jumps to 0.79. The noise that usually obscures a portfolio's sensitivity to the market gets crushed when everything is falling. Everyone moves together, and the underlying sensitivity shows through cleanly.

R-squared comparison in normal vs crisis regimes
Figure :: R² in normal vs crisis regimes

Volatility is somewhere in the middle.

It's persistent but decays. Half‑life of about 6.6 months. Which means yesterday's volatile portfolio is probably still volatile today, but in a year the signal is mostly gone.

R²(h) = 0.43 × exp(−h / 9.6) · half‑life ≈ 6.6 months

Predicted R² = 0.39 · strong persistence

Term structure of R-squared
Figure :: term structure of R² across horizons

The ML chapter (or: why looking smart is not the same as being right).

Before closing anything I wanted to make sure I wasn't missing non‑linear patterns that a smarter model would catch. So I ran random forests and gradient boosting on the same dataset. Two ways.

Random split — rows shuffled, train and test mixed across years — looked great. R² of 0.57–0.60 for Sharpe. Looked like ML had cracked it.

TargetLinear R²RF R²XGB R²
Sharpe0.090.570.60
Volatility0.200.530.57
Beta0.710.850.85
Max Drawdown0.080.530.59

Then I did a temporal split. Train on pre‑2010, test on 2010+. Every single target went negative. Not just worse — actively worse than guessing the mean. And the non‑linear models failed harder than the linear ones.

TargetLinear R²RF R²
Sharpe−0.05−0.24
Volatility−0.26−1.18
Max Drawdown−0.48−3.65
Alpha−0.08−0.10

The shuffled split lets the model quietly peek at the future through the training set. The temporal split forbids it. Once you forbid it, the gains vanish. A model trained on 1970–2010 markets has no idea what 2010+ markets are going to do, and confidently being wrong is worse than humbly guessing the average.

Temporal vs random split comparison
Figure :: temporal vs random split, head to head

Diversification is real. Measuring it during a crisis is not.

Everyone knows don't put all eggs in one basket. The interesting question is how you measure how spread the basket is. There are two ways: composition‑based (number of stocks, geographic breadth, sector balance) and correlation‑based (diversification ratio, average pairwise correlation).

The catch: correlations are mechanically biased upward during high‑volatility periods. Which means “diversification broke down in the crisis” is at least partially a measurement artifact.

CrisisNComposition R²Correlation R²Winner
Oil Crisis3,4610.200.36Correlation
1987 Crash7450.000.93Correlation
Dot‑Com5,0370.100.40Correlation
GFC2,7690.070.41Correlation
COVID‑195070.050.78Correlation
Pooled13,2490.260.09Composition

Within every individual crisis, correlation‑based metrics win. Pool all crises together and composition metrics win. That's Simpson's paradox. The variable that wins inside each group loses across groups.

In plain terms: correlation metrics are like a thermometer calibrated differently in every room. If you stay in one room, trust it. If you walk between rooms, trust the one that doesn't change.

Composition advantage by crisis
Figure :: composition advantage by crisis
Crisis specific correlations
Figure :: crisis-specific correlations

What's next.

A few things I'd extend if I had another year.

First, the temporal generalization problem is the real bottleneck. The question is whether you can build features that are regime‑agnostic, not just powerful within a regime. Everything that worked on the random split fell apart on the temporal one, and that's the actual frontier — not finding a better model, finding features that survive a regime change.

Second, the diversification paradox deserves a proper intervention study. If you construct portfolios specifically targeting composition diversity, does the within‑crisis correlation benefit disappear, persist, or flip? Right now I can describe the paradox, I can't tell you which side to bet on when you actually have a choice.

Third, 101,767 portfolios is a lot but they're all equal‑weight or random‑weight. Factor‑tilted construction is a different animal. Would be interesting to see whether the same predictability ranking (risk yes, returns no) holds when the weights are not random.

The honest answer for what's next is: more humility in backtests, more attention to what survives a regime change, and a lot more skepticism toward any optimizer that gives you a Sharpe above 2.

Performance by portfolio size
Figure :: performance by portfolio size

Still building. Still learning. Still here.


← back to home