y-Randomization#
y-Randomization (also called y-Scrambling) is a tool used in validation of QSPR/QSAR models, whereby the performance of the original model in data description (r2) is compared to that of models built for permuted (randomly shuffled) response, based on the original descriptor pool and the original model building procedure.
One powerful test of a machine-learning model is y-randomization, also known as y-scrambling. The real model is compared with alternative models, which are generated from datasets in which the property values y are repeatedly randomly reassigned amongst the instances. The process of randomization breaks the true chemical link between the features x and the output property y, so that there is no meaningful signal left to model. If the machine learning method is still able to produce good validation statistics for the randomized models then we should be highly suspicious, as we know that it must be modeling noise rather than signal.
In this test, random MLR models are generated by randomly shuffling the dependent variable while keeping the independent variables as it is. The new QSAR models are expected to have significantly low R2 and Q2 values for several trials, which confirm that the developed QSAR models are robust. Another parameter, c is also calculated which should be more then 0.5 for passing this test.
References#
J. Chem. Inf. Model. 2007, 47, 2345-2357
WIREs Comput Mol Sci 2014, 4:468–481. doi: 10.1002/wcms.1183