Why should not you choose strategy parameters by grid-search in backtest?
Imagine you are going to choose parameters for your trading strategy and you use grid search with backtest to find the best parameter combinations.
Then you are glad to report your parameters with Annual Sharpe Ratio 2.0, max drawdown 0.25 and hit rate 0.57. But in out-the-sample backtest or paper trading, this strategy does not work as well as it promised. Even your carefully chosen parameters are worse than the benchmark parameters.
“Why?” your manager asked.
“Because there is overfitting in the parameter choosing, the performance in the parameter choosing is exaggerated.”
“Why is there overfitting we are not using ML)? How can we choose strategy parameters with less overfitting? How can we get more reliable mature of performance matric in the first stage?”
In this article, I will try to answer these questions with an example.
You are more likely to choose the parameter combinations with the greatest variance instead of the greatest expected value.
For variance, we mean the measurement error (if you regard backtest as a measurement on strategy), or the sensitivity of backtesting matrics to a specific dataset. If history happened many times, or you run backtest in multiple synthetic price paths, in each historical path you have one Sharpe Ratio, totally in all paths you have a set of Sharpe Ratios, the variance of the Sharpe Ratios is the variance we mentioned here. (It is not strategy variance, but the variance of strategy Sharpe Ratio).
If there are total 100 combinations of parameters you were looping, 50 of them in Group A have Sharpe Ratio distribution N(0.5, 1), other 50 of them in Group B have Sharpe Ratio distribution N(0, 3). You want to use the grid search to choose Group A parameter for your strategy because the expected Sharpe Ratio is greater. However, in the backtest, the top 5 Sharpe Ratio would be more likely in Group B, just because measurement variance in Group B is greater than Group A.
Also, if the parameter candidates are not well distributed, for example, 10 of Group A and 90 of Group B, and we choose parameter by max Sharpe Ratio, it is more likely B will be chosen.
Why the variance is different for different parameters? One method to predict future stock performance is to examine its historical behavior and conclude that it will behave similarly in the future. An obvious problem with this technique is that results depend on the length and type of history
used. One of the most common parameters we need to decide is the look-back period, length of the period we trace the history, like rolling window average, rolling window volatility, etc. The shorter the lookback window, the high variance in our measurement (backtest). In other ML fields, hyper-parameter tunning will not significantly influence the variance of accuracy rates. This is the reason Grid Search (Cross Validation) is commonly used in other machine learning field like computer vision and natural language processing. Unfortunately, in finance, those parameters would dramatically influence the variance of Sharpe Ratio.
Why is this phenomenon called “overfitting”? The measured Sharpe Ratio in backtest is highly dependent on the noise in the dataset. For example, one backtest show Sharpe Ratio of one parameter combination in Group A is 0.7 = 0.5 (expectation) + 0.2 (noise), one in Group B is 2 = 0.0 (expectation) + 2.0 (noise). The decision that we choose second parameter combinations for trading strategy is dependent on the noise, in other words, we learned that we should choose the second parameter by learning from noise. In my understanding, there are so many different phenomena called “overfitting”, although they happened in totally different situations and have different reasons, they are called the same name “overfitting”. It can be overfitting in one angle and underfitting in another angle simultaneously.
After understanding the problem, how can we solve the parameter choosing problem?
Sensitivity test to reduce the variance. Sensitivity analysis tests the change of performance after a perturbation of parameters. We would choose these parameters that performance would not drop dramatically after parameters perturbation. Sensitivity analysts can be regarded as multiple correlated tests and take the average to reduce measurement variance. By taking the average, because error term correlation ρ < 1, we reduce our measurement variance and get N(μ, σ² * ρ + σ² * (1-ρ) / N). ρ is the correlation of error between these sensitivity tests and N is the tests numbers. Using the previous example, you just reduce Group A N(0.5, 1) and Group B N(0, 3) into Group A N(0.5, 0.5), Group B (0, 0.5), you can choose the best parameters. According to the Monte Carlo test, you can successfully choose Group 1 parameter as top parameters. Unfortunately, in a real problem, because the correlation is so high that you cannot reduce variance significantly. We will use more advanced techniques.
The Deflated Sharpe Ratio (DSR) provides an adjusted estimate of SR under the influence of multiple trails “false positive”¹. The basic idea is that if you try N times to find the best performance parameters, what if someone tries random strategies N times and also get the best performance in the random experiment, and see whether your carefully chosen strategy significantly beats the random choosing strategies. Following Bailey and Lopez de Prado ¹, DSR can be estimated as
where V is the variance across the trails’ estimated SR, N is the number of independent trials, Z is the CDF of the standard Normal distribution, γ is the Euler-Mascheroni constant. Details can be found in Lopez de Prado ². As he mentioned in Advances in Financial Machine Learning³,
“Every backtest result must be reported in conjunction with all the trails involved in its production. Absent that information, it is impossible to assess the backtest’s ‘false discovery’ probability.”
Then I will show an example that how easy it is to get a perfect but “false positive” backtest and how can we use sensitivity test and DSR to figure out it is “false positive”. Initially, I will use the most common momentum strategy on daily-level SP500. We calculate the shorter exponential moving average (decay factor α1) and longer exponential moving average (decay factor α2>α1). Because we are using a momentum strategy, we want to reduce the chance get in the fluctuation. We use historical volatility and volume to predict whether there might be trending. If the shorter window simple moving average (window1) of realized volatility and trading volume is greater than the longer window simple moving average (window2) times a scale factor, we will enter the position. This strategy was used for more than 50 years and I believe it would not show good performance. But there are 5 parameters, EMA decay factor α1, EMA decay factor α2, filter window1, filter window2, the scale factor, 5 for us to have enough room for “false positive”.
Then we built a mini-backtest system. To make everything simple, we assume there is no transaction cost and we rebalance every day.
Then we can combine mini backtest framework and our EMA and Filter module to get a pipeline, input parameters and return strategy performance.
Then we grid search all the parameters. I searched 7k parameters combination and it takes me 1 hour on my 4 years old Surface Pro 4.
We sort the result by Sharpe Ratio the get the top results, we found we are most likely to choose the very long moving average for longer EMA (0.001 is the longest in parameter candidates, and α is decay factor) but very short moving average for shorter MA (0.4, 0.5, 0.8 are almost the shortest period in parameter candidates). Their variance is very large and the window gives us multiple shots to get the maximum number. If we use these parameters to trade we would definitely lose everything. Then we can calculate the Deflated Sharpe Ratio:
The DSR is 0.873. It means if we try random strategy N=7k times, the expected maximum Sharpe Ratio would be 0.873, which is very close to our strategy Sharpe Ratio 0.827. That indicates there is no significant difference between our momentum strategy and random strategy. It is reasonable because the market is effective and adaptive.
In conclusion, as Marcos Lopez mentioned, backtest is not a reliable research tool because of “false discovery”. I applied his idea to strategy parameter choosing methodology and get the conclusion, we are more likely to choose the parameters with the greatest variance, rather than the greatest expectation.
- Bailey, David H., and Marcos López de Prado. “The deflated Sharpe ratio: correcting for selection bias, backtest overfitting and non-normality.” The Journal of Portfolio Management 40.5 (2014): 94–107.
- Lopez de Prado, Marcos, and Michael J. Lewis. “Detection of false investment strategies using unsupervised learning methods.” Quantitative Finance 19.9 (2019): 1555–1565.
- De Prado, Marcos Lopez. Advances in financial machine learning. John Wiley & Sons, 2018.
Thanks for reading. All the codes can be found on my Github, https://github.com/XiaotongAndyDing/GridSearchBacktestMedium.