MACCS fingerprints#
A chemical fingerprint is a list of binary values (0 or 1) which characterize a molecule. There are several ways to create the list. Here we describe the widely used MACCS (Molecular ACCess System) keys.
The MACCS keys are a set of questions about a chemical structure, for instance:
Are there fewer than 3 oxygens?
Is there a S-S bond?
Is there a ring of size 4?
Is at least one F, Cl, Br, or I present?
The result of this is a list of binary values – either true (1) or false (0). This list of values for a given chemical structure is called the MACCS key fingerprint for that structure.
Here’s an example. If the molecule is C1CCC1
then the answers to those questions are:
0 oxygens < 3 oxygens → True
no S-S bond → False
there is a ring of size 4 → True
there are no halogens → False
The answers are frequently written as a list of bits (also called a bitstring). The bitstring for this molecule is 1010
.
There are 166 public keys (fragment definitions) of MACCS in RDKit implementation. Essentially, it is a binary fingerprint (zeros and ones) that answer 166 fragment related questions.
Computing MACCS keys#
In the following we assume you have a dataframe (df
) with a column containning the molecule rdchem.Mol
object (df['ROMol']
).
import pandas as pd
from rdkit.Chem import MACCSkeys
maccs = [MACCSkeys.GenMACCSKeys(x) for x in df['ROMol']]
maccs_lists = [list(l) for l in maccs]
maccs_name = [f'MACCS_{i}' for i in range(167)]
maccs_df = pd.DataFrame(maccs_lists, index=df.index, columns=maccs_name)
maccs_df.shape
# (8221, 167)