Data preparation# This section includes topics related to feature engineering, feature selection, data validation, scaling, and more. Quality of a dataset Data validation Standardizing SMILES References Feature engineering Molecular descriptors Further reading References Contents RDKit Set up RDKit Basic usage Computing descriptors/fingerprints References MACCS fingerprints Computing MACCS keys References Morgan ECFP fingerprints Computing Morgan ECFPx References Mordred Measure molecular similarity Computing similarity from ECFP4 Another, more complete, approach References Feature selection Filter methods Collinearity (correlation metric) Wrapper methods Recursive feature elimination Drawbacks Boruta Advantages Drawbacks References Embedded methods Selection by model importance Advantages Drawbacks Dimensionality reduction methods Hybrid methods Train, validation, and test sets Training set Validation set Testing set Conducting the split in sklearn Scaling Is feature scaling always necessary? Normalization Standardization Scaling data before or after train/test split