Skip to main content
Ctrl+K
Logo image

Contents

  • Introduction
    • Math Basics
    • Statistics basics
      • Boxplots
      • Outliers
      • Parametric tests or models
    • Machine Learning Basics
      • Types of ML systems
      • Machine Learning workflow
  • Data preparation
    • Quality of a dataset
    • Data validation
    • Feature engineering
      • RDKit
      • MACCS fingerprints
      • Morgan ECFP fingerprints
      • Mordred
      • Measure molecular similarity
    • Feature selection
    • Train, validation, and test sets
    • Scaling
  • Machine Learning models
    • Linear models
      • Linear regression
    • ML Models
    • k-Nearest Neighbors
    • Decision Tree
    • Gradient Boosting
    • Hyperparameters - Gradient Boosting
  • Model evaluation
    • Model evaluation
    • Performance metrics
    • Train and cross validation
    • y-Randomization
    • Partial dependance plots
    • SHAP (SHapley Additive exPlanations)
  • Scikit-Learn
    • Pipelines
    • Saving Scikit-learn model for reuse

Resources

  • Scientific articles
  • Reading material
  • Python ML tools and packages
  • Show source
  • Suggest edit
  • Open issue
  • .md

Data preparation

Data preparation#

This section includes topics related to feature engineering, feature selection, data validation, scaling, and more.

  • Quality of a dataset
  • Data validation
    • Standardizing SMILES
    • References
  • Feature engineering
    • Molecular descriptors
    • Further reading
    • References
    • Contents
      • RDKit
        • Set up RDKit
        • Basic usage
        • Computing descriptors/fingerprints
        • References
      • MACCS fingerprints
        • Computing MACCS keys
        • References
      • Morgan ECFP fingerprints
        • Computing Morgan ECFPx
        • References
      • Mordred
      • Measure molecular similarity
        • Computing similarity from ECFP4
        • Another, more complete, approach
        • References
  • Feature selection
    • Filter methods
      • Collinearity (correlation metric)
    • Wrapper methods
      • Recursive feature elimination
        • Drawbacks
      • Boruta
        • Advantages
        • Drawbacks
      • References
    • Embedded methods
      • Selection by model importance
        • Advantages
        • Drawbacks
    • Dimensionality reduction methods
    • Hybrid methods
  • Train, validation, and test sets
    • Training set
    • Validation set
    • Testing set
    • Conducting the split in sklearn
  • Scaling
    • Is feature scaling always necessary?
    • Normalization
    • Standardization
    • Scaling data before or after train/test split

previous

Machine Learning workflow

next

Quality of a dataset

By José Aniceto

© Copyright 2023, José Aniceto.