RDKit#

Set up RDKit#

Installing RDKit with conda:

$ conda -c rdkit rdkit

Using in a Jupyter Notebook:

import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem

from rdkit.Chem.Draw import IPythonConsole

Basic usage#

Get a RDKit molecule object from SMILES. From the RDKit molecule object we can draw structures, compute fingerprints/properties, etc.

smiles = 'COC(=O)c1c[nH]c2cc(OC(C)C)c(OC(C)C)cc2c1=O'
mol = Chem.MolFromSmiles(smiles)
print(mol)

# <rdkit.Chem.rdchem.Mol object at 0x000001F84A4CEE90>

Reading a list of SMILES:

smiles = [
    'N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c1ccccc1',
    'c1ccc2c(c1)ccc1c2ccc2c3ccccc3ccc21',
    'C=C(C)C1Cc2c(ccc3c2OC2COc4cc(OC)c(OC)cc4C2C3=O)O1',
    'ClC(Cl)=C(c1ccc(Cl)cc1)c1ccc(Cl)cc1'
]

mols = [Chem.MolFromSmiles(smi) for smi in smiles]

Draw molecules into grid:

from rdkit.Chem import Draw

Draw.MolsToGridImage(mols, molsPerRow=2, subImgSize=(200, 200))

Using PandaTools to allow molecule objects in dataframes:

import pandas as pd
from rdkit.Chem import PandasTools

url = 'https://raw.githubusercontent.com/XinhaoLi74/molds/master/clean_data/ESOL.csv'

df = pd.read_csv(url)

PandasTools.AddMoleculeColumnToFrame(df, smilesCol='smiles')

This adds a column to the dataframe containing a rdchem.Mol object.

To draw the stuctures in a grid:

PandasTools.FrameToGridImage(df.head(8), legendsCol='logSolubility', molsPerRow=4)

To add new columns of properites use Pandas map method.

df['n_Atoms'] = df['ROMol'].map(lambda x: x.GetNumAtoms())

Computing descriptors/fingerprints#

RDKit has a variety of built-in functionality for generating molecular fingerprints/descriptors.

from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
#https://github.com/bp-kelley/descriptastorus

generator = MakeGenerator(("RDKit2D",))

rdkit2d = [generator.process(x)[1:] for x in df['SMILES']]

rdkit2d_name = []
for name, numpy_type in generator.GetColumns():
    rdkit2d_name.append(name)

rdkit2d_df = pd.DataFrame(rdkit2d, index=df.index, columns=rdkit2d_name[1:])

train_rdkit2d_df.shape
# (8221, 200)