| Title: | Task-Oriented Cheminformatics in R Using 'RDKit' via 'Python' |
|---|---|
| Description: | A task-oriented R interface to the 'RDKit' <https://www.rdkit.org> library through its 'Python' API via 'reticulate'. The package offers high-level cheminformatics functionality, including molecule parsing, descriptor calculation, and fingerprint generation without replicating the native structure of 'RDKit'. |
| Authors: | Andrey Samokhin [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-0223-6087>) |
| Maintainer: | Andrey Samokhin <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.1 |
| Built: | 2026-06-04 07:32:52 UTC |
| Source: | https://github.com/andreysamokhin/rdkitpyr |
Used in examples and tests to determine whether the Python module
rdkit is available via reticulate.
This function is exported to support examples and tests. It is not part of the stable user-facing API and may change without notice.
The result is cached for the duration of the R session to avoid repeated
calls to reticulate::py_module_available().
.IsRdkitAvailable(initialize = TRUE).IsRdkitAvailable(initialize = TRUE)
initialize |
A logical value. If |
A logical value indicating whether the rdkit Python module is
available.
Calculate all molecular descriptors available in RDKit for a set of molecules.
The descriptors are calculated using the CalcMolDescriptors()
function from the rdkit.Chem.Descriptors module in RDKit.
The set of returned descriptors may depend on the installed RDKit version.
Each molecule is represented by a full set of descriptor values returned
as a data frame. Invalid molecules are represented by rows containing
NA values. Row order is preserved so that the output aligns with
the input.
CalculateAllDescriptors(mols, verbose = FALSE)CalculateAllDescriptors(mols, verbose = FALSE)
mols |
A character vector of SMILES or InChI strings, or a list of RDKit Mol objects. |
verbose |
A logical value. If |
A data frame with one row per molecule and one column per descriptor.
Elements corresponding to invalid molecules are returned as NA.
Original RDKit descriptor names are used.
Additionally, the "valid" attribute is attached to indicate which
molecules were successfully processed.
# Calculate all RDKit descriptors smiles <- c("CCO", "c1ccccc1", "invalid_molecule") desc <- CalculateAllDescriptors(smiles) # Inspect names of first three descriptors names(desc)[1:3] #> "MaxAbsEStateIndex" "MaxEStateIndex" "MinAbsEStateIndex" # Display Molecular weight, LogP, TPSA, and molar refractivity desc[c("MolWt", "MolLogP", "TPSA", "MolMR")] #> MolWt MolLogP TPSA MolMR #> 1 46.069 -0.0014 20.23 12.7598 #> 2 78.114 1.6866 0.00 26.4420 #> 3 NA NA NA NA # Check which molecules were successfully processed attr(desc, "valid") #> TRUE TRUE FALSE# Calculate all RDKit descriptors smiles <- c("CCO", "c1ccccc1", "invalid_molecule") desc <- CalculateAllDescriptors(smiles) # Inspect names of first three descriptors names(desc)[1:3] #> "MaxAbsEStateIndex" "MaxEStateIndex" "MinAbsEStateIndex" # Display Molecular weight, LogP, TPSA, and molar refractivity desc[c("MolWt", "MolLogP", "TPSA", "MolMR")] #> MolWt MolLogP TPSA MolMR #> 1 46.069 -0.0014 20.23 12.7598 #> 2 78.114 1.6866 0.00 26.4420 #> 3 NA NA NA NA # Check which molecules were successfully processed attr(desc, "valid") #> TRUE TRUE FALSE
Calculate the exact mass (monoisotopic mass) for a set of molecules.
The calculation is performed using the ExactMolWt() function from
the rdkit.Chem.Descriptors module in RDKit.
CalculateExactMass(mols, verbose = FALSE)CalculateExactMass(mols, verbose = FALSE)
mols |
A character vector of SMILES or InChI strings, or a list of RDKit Mol objects. |
verbose |
A logical value. If |
A numeric vector containing the exact mass for each molecule.
Elements corresponding to invalid molecules are returned as NA.
# Calculate exact mass for a set of molecules smiles <- c("CCO", "c1ccccc1", "invalid_molecule") CalculateExactMass(smiles) #> 46.04186 78.04695 NA# Calculate exact mass for a set of molecules smiles <- c("CCO", "c1ccccc1", "invalid_molecule") CalculateExactMass(smiles) #> 46.04186 78.04695 NA
Calculate MACCS (Molecular ACCess System) fingerprints for a set of molecules. Each fingerprint is a fixed-length binary vector representing the presence or absence of predefined structural features.
The fingerprints are calculated using the GetMACCSKeysFingerprint()
function from the rdkit.Chem.rdMolDescriptors module in RDKit.
Invalid molecules are represented by NA vectors. Row order is
preserved so that the output aligns with the input.
CalculateMaccsFingerprints(mols, verbose = FALSE)CalculateMaccsFingerprints(mols, verbose = FALSE)
mols |
A character vector of SMILES or InChI strings, or a list of RDKit Mol objects. |
verbose |
A logical value. If |
A matrix of integers (0 or 1) with one row per molecule and 167 columns
(MACCS keys). Rows corresponding to invalid molecules contain NA.
Additionally, the "valid" attribute is attached to indicate which
molecules were successfully processed.
# Calculate MACCS fingerprints smiles <- c("CCO", "c1ccccc1", "invalid_molecule") fps <- CalculateMaccsFingerprints(smiles) # Get the number of fingerprints (columns) ncol(fps) #> 167 # Check which molecules were successfully processed attr(fps, "valid") #> TRUE TRUE FALSE# Calculate MACCS fingerprints smiles <- c("CCO", "c1ccccc1", "invalid_molecule") fps <- CalculateMaccsFingerprints(smiles) # Get the number of fingerprints (columns) ncol(fps) #> 167 # Check which molecules were successfully processed attr(fps, "valid") #> TRUE TRUE FALSE
Calculate the molecular weight (average mass) for a set of molecules.
The calculation is performed using the MolWt() function from the
rdkit.Chem.Descriptors module in RDKit.
CalculateMolecularWeight(mols, verbose = FALSE)CalculateMolecularWeight(mols, verbose = FALSE)
mols |
A character vector of SMILES or InChI strings, or a list of RDKit Mol objects. |
verbose |
A logical value. If |
A numeric vector containing the molecular weight for each molecule.
Elements corresponding to invalid molecules are returned as NA.
# Calculate average molecular weight for a set of molecules smiles <- c("CCO", "c1ccccc1", "invalid_molecule") CalculateMolecularWeight(smiles) #> 46.069 78.114 NA# Calculate average molecular weight for a set of molecules smiles <- c("CCO", "c1ccccc1", "invalid_molecule") CalculateMolecularWeight(smiles) #> 46.069 78.114 NA
Calculate Morgan (circular) fingerprints for a set of molecules.
The fingerprints are calculated using the GetMorganGenerator()
function from the rdkit.Chem.rdFingerprintGenerator module in RDKit.
Invalid molecules are represented by NA vectors. Row order is
preserved so that the output aligns with the input.
CalculateMorganFingerprints( mols, radius = 3L, fp_size = 2048L, count_simulation = FALSE, include_chirality = FALSE, use_bond_types = TRUE, include_ring_membership = TRUE, verbose = FALSE )CalculateMorganFingerprints( mols, radius = 3L, fp_size = 2048L, count_simulation = FALSE, include_chirality = FALSE, use_bond_types = TRUE, include_ring_membership = TRUE, verbose = FALSE )
mols |
A character vector of SMILES or InChI strings, or a list of RDKit Mol objects. |
radius |
An integer value. Bond radius defining the size of circular substructures. |
fp_size |
An integer value. Number of bits in the fingerprint. |
count_simulation |
A logical value. If set, use count simulation while generating the fingerprint. |
include_chirality |
A logical value. If set, chirality information will be added to the generated fingerprint. |
use_bond_types |
A logical value. If set, bond types will be included as a part of the default bond invariants. |
include_ring_membership |
A logical value. If set, whether or not the atom is in a ring will be used in the invariant list. |
verbose |
A logical value. If |
A matrix of integers (0 or 1) with one row per molecule and fp_size
columns. Rows corresponding to invalid molecules contain NA.
Additionally, the "valid" attribute is attached to indicate which
molecules were successfully processed.
# Calculate Morgan fingerprints smiles <- c("CCO", "c1ccccc1", "invalid_molecule") fps <- CalculateMorganFingerprints(smiles) # Get the number of fingerprints (columns) ncol(fps) #> 2048 # Check which molecules were successfully processed attr(fps, "valid") #> TRUE TRUE FALSE# Calculate Morgan fingerprints smiles <- c("CCO", "c1ccccc1", "invalid_molecule") fps <- CalculateMorganFingerprints(smiles) # Get the number of fingerprints (columns) ncol(fps) #> 2048 # Check which molecules were successfully processed attr(fps, "valid") #> TRUE TRUE FALSE
Calculate RDKit topological (path-based) fingerprints for a set of molecules.
The fingerprints are calculated using the RDKFingerprint() function
from the rdkit.Chem.rdmolops module in RDKit.
Invalid molecules are represented by NA vectors. Row order is
preserved so that the output aligns with the input.
CalculateRdkitFingerprints( mols, min_path = 1L, max_path = 7L, fp_size = 2048L, n_bits_per_hash = 2L, use_hydrogens = TRUE, target_density = 0, min_size = 128L, branched_paths = TRUE, use_bond_order = TRUE, verbose = FALSE )CalculateRdkitFingerprints( mols, min_path = 1L, max_path = 7L, fp_size = 2048L, n_bits_per_hash = 2L, use_hydrogens = TRUE, target_density = 0, min_size = 128L, branched_paths = TRUE, use_bond_order = TRUE, verbose = FALSE )
mols |
A character vector of SMILES or InChI strings, or a list of RDKit Mol objects. |
min_path |
An integer value. Minimum number of bonds to include in the subgraphs. |
max_path |
An integer value. Maximum number of bonds to include in the subgraphs. |
fp_size |
An integer value. Number of bits in the fingerprint. |
n_bits_per_hash |
An integer value. Number of bits to set per path. |
use_hydrogens |
A logical value. Include paths involving hydrogens in the fingerprint if the molecule has explicit hydrogens. |
target_density |
A numeric value. Fold the fingerprint until this minimum density has been reached. |
min_size |
An integer value. The minimum size the fingerprint will be folded to when
trying to reach |
branched_paths |
A logical value. If set, both branched and unbranched paths will be used in the fingerprint. |
use_bond_order |
A logical value. If set, both bond orders will be used in the path hashes. |
verbose |
A logical value. If |
A matrix of integers (0 or 1) with one row per molecule and fp_size
columns. Rows corresponding to invalid molecules contain NA.
Additionally, the "valid" attribute is attached to indicate which
molecules were successfully processed.
# Calculate RDKit fingerprints smiles <- c("CCO", "c1ccccc1", "invalid_molecule") fps <- CalculateRdkitFingerprints(smiles) # Get the number of fingerprints (columns) ncol(fps) #> 2048 # Check which molecules were successfully processed attr(fps, "valid") #> TRUE TRUE FALSE# Calculate RDKit fingerprints smiles <- c("CCO", "c1ccccc1", "invalid_molecule") fps <- CalculateRdkitFingerprints(smiles) # Get the number of fingerprints (columns) ncol(fps) #> 2048 # Check which molecules were successfully processed attr(fps, "valid") #> TRUE TRUE FALSE
Convert molecules to InChI strings.
InChI identifiers can technically be provided as input. In this case, the output is expected to be identical to the input. This can be useful only to test RDKit's consistency with challenging molecules.
ConvertToInchi(mols, verbose = FALSE)ConvertToInchi(mols, verbose = FALSE)
mols |
A character vector of SMILES or InChI strings, or a list of RDKit Mol objects. |
verbose |
A logical value. If |
A character vector. InChI strings. Elements that cannot be converted are
returned as NA.
# Convert a vector of SMILES to InChI identifiers smiles <- c("CC", "CCC") ConvertToInchi(smiles) #> "InChI=1S/C2H6/c1-2/h1-2H3" #> "InChI=1S/C3H8/c1-3-2/h3H2,1-2H3" # Providing InChI as input returns identical output ConvertToInchi("InChI=1S/CH4/h1H4") #> "InChI=1S/CH4/h1H4"# Convert a vector of SMILES to InChI identifiers smiles <- c("CC", "CCC") ConvertToInchi(smiles) #> "InChI=1S/C2H6/c1-2/h1-2H3" #> "InChI=1S/C3H8/c1-3-2/h3H2,1-2H3" # Providing InChI as input returns identical output ConvertToInchi("InChI=1S/CH4/h1H4") #> "InChI=1S/CH4/h1H4"
Convert molecules to InChIKey strings.
Conversion of an InChI string to an InChIKey relies on the IUPAC library, allowing conversion without creating intermediate RDKit Mol objects.
ConvertToInchikey(mols, verbose = FALSE)ConvertToInchikey(mols, verbose = FALSE)
mols |
A character vector of SMILES or InChI strings, or a list of RDKit Mol objects. |
verbose |
A logical value. If |
A character vector. InChIKey strings. Elements that cannot be converted are
returned as NA.
# Convert a vector of InChI to InChIKey identifiers inchi <- c("InChI=1S/C2H6/c1-2/h1-2H3", "InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H") ConvertToInchikey(inchi) #> "OTMSDBZUPAUEDD-UHFFFAOYSA-N" #> "UHOVQNZJYSORNB-UHFFFAOYSA-N" # Convert a vector of SMILES to InChIKey identifiers smiles <- c("CC", "c1ccccc1") ConvertToInchikey(smiles) #> "OTMSDBZUPAUEDD-UHFFFAOYSA-N" #> "UHOVQNZJYSORNB-UHFFFAOYSA-N"# Convert a vector of InChI to InChIKey identifiers inchi <- c("InChI=1S/C2H6/c1-2/h1-2H3", "InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H") ConvertToInchikey(inchi) #> "OTMSDBZUPAUEDD-UHFFFAOYSA-N" #> "UHOVQNZJYSORNB-UHFFFAOYSA-N" # Convert a vector of SMILES to InChIKey identifiers smiles <- c("CC", "c1ccccc1") ConvertToInchikey(smiles) #> "OTMSDBZUPAUEDD-UHFFFAOYSA-N" #> "UHOVQNZJYSORNB-UHFFFAOYSA-N"
Convert molecules to SMILES strings.
SMILES strings can be provided as input to obtain their canonical form or to remove stereochemistry.
ConvertToSmiles( mols, isomeric = TRUE, kekule = FALSE, canonical = TRUE, explicit_bonds = FALSE, explicit_hydrogens = FALSE, verbose = FALSE )ConvertToSmiles( mols, isomeric = TRUE, kekule = FALSE, canonical = TRUE, explicit_bonds = FALSE, explicit_hydrogens = FALSE, verbose = FALSE )
mols |
A character vector of SMILES or InChI strings, or a list of RDKit Mol objects. |
isomeric |
A logical value. If |
kekule |
A logical value. If |
canonical |
A logical value. If |
explicit_bonds |
A logical value. If |
explicit_hydrogens |
A logical value. If |
verbose |
A logical value. If |
A character vector. SMILES strings. Elements that cannot be converted are
returned as NA.
# Convert a vector of InChI identifiers to canonical SMILES inchi <- c("InChI=1S/C2H6/c1-2/h1-2H3", "InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H") ConvertToSmiles(inchi) #> "CC" #> "c1ccccc1" # Convert a vector of SMILES to SMILES with Kekulized aromatic bonds smiles <- c("c1ccccc1", "c1ccc2ccccc2c1") ConvertToSmiles(smiles, kekule = TRUE) #> "C1=CC=CC=C1" #> "C1=CC=C2C=CC=CC2=C1"# Convert a vector of InChI identifiers to canonical SMILES inchi <- c("InChI=1S/C2H6/c1-2/h1-2H3", "InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H") ConvertToSmiles(inchi) #> "CC" #> "c1ccccc1" # Convert a vector of SMILES to SMILES with Kekulized aromatic bonds smiles <- c("c1ccccc1", "c1ccc2ccccc2c1") ConvertToSmiles(smiles, kekule = TRUE) #> "C1=CC=CC=C1" #> "C1=CC=C2C=CC=CC2=C1"
Get details about the Python interpreter, numpy and rdkit
packages.
GetPythonInfo(verbose = TRUE)GetPythonInfo(verbose = TRUE)
verbose |
A logical value. If |
Invisibly return a named list with the following components:
Full path to the Python executable.
Version of the Python interpreter.
Installed version of the numpy package.
Installed version of the rdkit package.
If applicable, indicates whether the Python
interpreter was forced via RETICULATE_PYTHON, use_*(), or
py_require().
# Print information about the Python environment GetPythonInfo() # Access programmatically py_env <- GetPythonInfo(verbose = FALSE) py_env$python_version py_env$rdkit_version# Print information about the Python environment GetPythonInfo() # Access programmatically py_env <- GetPythonInfo(verbose = FALSE) py_env$python_version py_env$rdkit_version
Parse SMILES and InChI strings into RDKit Mol objects
This function converts a character vector of molecular representations
(SMILES or InChI) into a list of RDKit Mol objects (Python-backed pointers
via reticulate). The resulting objects can be reused in subsequent
operations without repeated conversion from SMILES or InChI. This is
particularly useful when multiple cheminformatics tasks are performed on
the same set of molecules, improving efficiency by avoiding repeated
parsing steps.
ParseMolecules(mols, verbose = FALSE)ParseMolecules(mols, verbose = FALSE)
mols |
A character vector of SMILES or InChI strings. |
verbose |
A logical value. If |
A list of RDKit Mol objects.
# Convert a vector of SMILES to RDKit Mol objects mols <- ParseMolecules(c("CC", "CCC")) print(mols[[1L]]) #> <rdkit.Chem.rdchem.Mol object at 0x000001CC4D60F4C0> # Convert a list of RDKit Mol objects to InChI identifiers ConvertToInchi(mols) #> "InChI=1S/C2H6/c1-2/h1-2H3" #> "InChI=1S/C3H8/c1-3-2/h3H2,1-2H3"# Convert a vector of SMILES to RDKit Mol objects mols <- ParseMolecules(c("CC", "CCC")) print(mols[[1L]]) #> <rdkit.Chem.rdchem.Mol object at 0x000001CC4D60F4C0> # Convert a list of RDKit Mol objects to InChI identifiers ConvertToInchi(mols) #> "InChI=1S/C2H6/c1-2/h1-2H3" #> "InChI=1S/C3H8/c1-3-2/h3H2,1-2H3"
A small reference set of chemical compounds extracted from the PubChem database. Despite its limited size, the dataset covers a broad range of chemical properties, including:
neutral, charged, and radical species;
aromatic and aliphatic compounds;
molecules containing heteroatoms;
isotopically labeled compounds;
stereochemistry;
species with disconnected fragments.
test_compoundstest_compounds
A data frame.
pubchem_cidPubChem compound identifier (CID).
nameCompound name (as reported in the PubChem database).
smilesSMILES identifier.
inchiInChI identifier.
inchikeyInChIKey identifier.
n_labeled_atomsNumber of isotopically labeled atoms.
chargeThe total charge of a molecule.
exact_massExact monoisotopic mass.
molecular_weightAverage molecular weight.
formulaMolecular formula.
formula_isotopesMolecular formula with explicit isotopes.