n1c0.net/thesis |
Section | article |
|
|---|---|---|---|
| Reading time | ~10 min | ||
| Author | Nicola F. | ||
What 101,767 random portfolios taught me — and why optimizing your portfolio on past data is mostly a self‑flattering exercise.
Not gonna lie, I've tried to automate everything in my life, and with AI I was trying to replace all the stuff I write to save time. So I injected all my thesis and my “about page” from my website and asked Claude to write this page for me. The result? Was pure crap. Call it skill issues with prompt, but I feel like AGI is still far away from understanding my style. One could call it inefficiency, one could call it preservation of art.
So here goes nothing, me explaining my master thesis, the old fashion way. Pressing every single letter you will read here. In particular you will find what have I done and what I've found. What was nice and what was less nice but most of all what was interesting, and what could be done next.
All these What but let me back up from a Why.
Theory and Practice are not always the same, they are pretty much always different. Sometimes more and sometimes less.
Same in Finance, beautiful concepts, maximize this and you will get this, minimize that and you will get that. Assume a normal distribution and expect a normal distribution. Is in reality like that? Almost never. Create your beautiful efficient frontier with the possible assets inside your portfolio and you see that the portfolios sitting there are made of the best assets in the recent period. But in most of the cases what's next doesn't depend on what happened before. So the key point is that different metrics are good indicators of past performance, but they do not give any information about possible future results.
This is the key point behind the thesis, which metrics can actually indicate something in the future?
To do an analysis on this either you can analyze the stickiness of metrics in mutual funds, but this introduces a bias — loss aversion, mutual funds that went bankrupt, survivorship — or you can try to generate random portfolios. I went for the last one.
Random portfolios don't lie. They don't have a marketing department. They don't get shut down when they perform badly. You don't selectively publish the ones that looked good. You just generate them, compute the metrics, and see what happens.
Can ex‑ante portfolio characteristics actually predict ex‑post performance?
In order to start my analysis I had to get data. And unfortunately I cannot afford a Bloomberg license, yet. Even tho we have licensed terminals in the university, the data we can collect per day is limited. And my ambition was too humble for that, I just wanted to get the full daily prices of all the stocks on earth. I didn't do the calculations of how many days would have been needed to get all the data.
After some cheap services paid I found out that the quality of the dataset was quite low, so I switched my approach to just: let's scrape directly everything from the web, in particular from yahoo finance to stocks analysis. At the end the skills in scraping are something that I really do enjoy, and something that AI is still not good at doing well (paired with matching my style).
So in few days I built a full stock scraper that took few days to scrape all the stocks daily price available on the internet. Daily prices and other information such as the sector, industry, currency, market cap and other details. At the end I got a database that was half of my disk space on my Mac. Of course the database needed some cleaning, not always all the data is actually helpful in something.
The funnel: 77,204 raw → 41,615 after ISIN dedup → 36,546 with 4+ years of history → 32,838 after density → 32,287 after liquidity.
32,287 securities · 78 countries · 143.4M daily price records · April 1964 → November 2025
Then I generated the portfolios. 101,767 of them. Random size between 1 and 500 stocks, random weights, random formation dates spread across 1969 to 2023. For each one I computed around 50 metrics on the 12 months before the formation date — the ex‑ante, what an investor would have seen — and the same 50 metrics on the 12 months after — the ex‑post, what actually happened. Then I checked how much one predicts the other.
The short version: risk is predictable, returns are not. The slightly longer version is in the table below. Click a column header to sort.
| Metric | R² | Correlation |
|---|---|---|
| Beta | 0.666 | +0.82 |
| Diversification Ratio | 0.383 | +0.62 |
| VaR 95% | 0.185 | +0.43 |
| Avg. Correlation | 0.157 | +0.40 |
| Volatility (annual) | 0.098 | +0.31 |
| Sharpe ratio | 0.024 | −0.16 |
| Alpha | 0.003 | −0.06 |
Beta has an R² of 0.67. Across 54 years of data, across 78 countries, across bull markets and crashes. If a portfolio is sensitive to the market today, it will be sensitive to the market tomorrow. That's useful.
Sharpe ratio has an R² of 0.02. And the correlation is negative. Past Sharpe doesn't fail to predict future Sharpe — it anti‑predicts it. The portfolios that looked the best last year systematically delivered the worst the year after.
If you ranked every portfolio by how good it looked, and then bet on the top ones, you would have been systematically wrong. Hover a row to compare ex‑ante vs ex‑post.
| Quintile (by ex‑ante Sharpe) | Ex‑ante | Ex‑post | Δ |
|---|---|---|---|
| Q1 (Low) | −0.93 | +1.48 | +2.41 |
| Q2 | +0.10 | +1.05 | +0.95 |
| Q3 | +0.96 | +1.06 | +0.10 |
| Q4 | +1.73 | +0.96 | −0.77 |
| Q5 (High) | +3.15 | +0.79 | −2.36 |
The bottom quintile gained +2.41 on average. The top quintile lost −2.36. The hot fund of last year is the cold fund of this year, almost mechanically.
In normal times, R² for beta is 0.65. During crises, it jumps to 0.79. The noise that usually obscures a portfolio's sensitivity to the market gets crushed when everything is falling. Everyone moves together, and the underlying sensitivity shows through cleanly.
It's persistent but decays. Half‑life of about 6.6 months. Which means yesterday's volatile portfolio is probably still volatile today, but in a year the signal is mostly gone.
R²(h) = 0.43 × exp(−h / 9.6) · half‑life ≈ 6.6 months
Predicted R² = 0.39 · strong persistence
Before closing anything I wanted to make sure I wasn't missing non‑linear patterns that a smarter model would catch. So I ran random forests and gradient boosting on the same dataset. Two ways.
Random split — rows shuffled, train and test mixed across years — looked great. R² of 0.57–0.60 for Sharpe. Looked like ML had cracked it.
| Target | Linear R² | RF R² | XGB R² |
|---|---|---|---|
| Sharpe | 0.09 | 0.57 | 0.60 |
| Volatility | 0.20 | 0.53 | 0.57 |
| Beta | 0.71 | 0.85 | 0.85 |
| Max Drawdown | 0.08 | 0.53 | 0.59 |
Then I did a temporal split. Train on pre‑2010, test on 2010+. Every single target went negative. Not just worse — actively worse than guessing the mean. And the non‑linear models failed harder than the linear ones.
| Target | Linear R² | RF R² |
|---|---|---|
| Sharpe | −0.05 | −0.24 |
| Volatility | −0.26 | −1.18 |
| Max Drawdown | −0.48 | −3.65 |
| Alpha | −0.08 | −0.10 |
The shuffled split lets the model quietly peek at the future through the training set. The temporal split forbids it. Once you forbid it, the gains vanish. A model trained on 1970–2010 markets has no idea what 2010+ markets are going to do, and confidently being wrong is worse than humbly guessing the average.
Everyone knows don't put all eggs in one basket. The interesting question is how you measure how spread the basket is. There are two ways: composition‑based (number of stocks, geographic breadth, sector balance) and correlation‑based (diversification ratio, average pairwise correlation).
The catch: correlations are mechanically biased upward during high‑volatility periods. Which means “diversification broke down in the crisis” is at least partially a measurement artifact.
| Crisis | N | Composition R² | Correlation R² | Winner |
|---|---|---|---|---|
| Oil Crisis | 3,461 | 0.20 | 0.36 | Correlation |
| 1987 Crash | 745 | 0.00 | 0.93 | Correlation |
| Dot‑Com | 5,037 | 0.10 | 0.40 | Correlation |
| GFC | 2,769 | 0.07 | 0.41 | Correlation |
| COVID‑19 | 507 | 0.05 | 0.78 | Correlation |
| Pooled | 13,249 | 0.26 | 0.09 | Composition |
Within every individual crisis, correlation‑based metrics win. Pool all crises together and composition metrics win. That's Simpson's paradox. The variable that wins inside each group loses across groups.
In plain terms: correlation metrics are like a thermometer calibrated differently in every room. If you stay in one room, trust it. If you walk between rooms, trust the one that doesn't change.
A few things I'd extend if I had another year.
First, the temporal generalization problem is the real bottleneck. The question is whether you can build features that are regime‑agnostic, not just powerful within a regime. Everything that worked on the random split fell apart on the temporal one, and that's the actual frontier — not finding a better model, finding features that survive a regime change.
Second, the diversification paradox deserves a proper intervention study. If you construct portfolios specifically targeting composition diversity, does the within‑crisis correlation benefit disappear, persist, or flip? Right now I can describe the paradox, I can't tell you which side to bet on when you actually have a choice.
Third, 101,767 portfolios is a lot but they're all equal‑weight or random‑weight. Factor‑tilted construction is a different animal. Would be interesting to see whether the same predictability ranking (risk yes, returns no) holds when the weights are not random.
The honest answer for what's next is: more humility in backtests, more attention to what survives a regime change, and a lot more skepticism toward any optimizer that gives you a Sharpe above 2.
Still building. Still learning. Still here.