Molecular modeling made easy
datamol.io is an open-source toolkit that simplifies molecular processing and featurization workflows for ML scientists in drug discovery.
Already used by scientists in leading organizations
Discover the next generation of
open-source tools for molecular modeling
Accelerate molecular processing workflows
Datamol is an elegant, RDKit-powered Python library optimized for molecular machine learning workflows.
Highly intuitive
A familiar Pythonic API with good defaults by design. Get started in one line.
Start NowPowerful
Seamlessly integrated with RDKit to support you in every step. Built-in parallelization to accelerate your workflows.
Experience EfficiencyModern I/O
Read and write multiple formats (sdf, xlsx, csv) with out-of-the-box support.
Try Now
import datamol as dm
# Common functions
mol = dm.to_mol("O=C(C)Oc1ccccc1C(=O)O", sanitize=True)
fp = dm.to_fp(mol)
selfies = dm.to_selfies(mol)
inchi = dm.to_inchi(mol)
# Standardize and sanitize
mol = dm.to_mol("O=C(C)Oc1ccccc1C(=O)O")
mol = dm.fix_mol(mol)
mol = dm.sanitize_mol(mol)
mol = dm.standardize_mol(mol)
# Generate conformers
smiles = "O=C(C)Oc1ccccc1C(=O)O"
mol = dm.to_mol(smiles)
mol_with_conformers = dm.conformers.generate(mol)
# Easy IO
mols = dm.read_sdf("s3://my-awesome-data-lake/smiles.sdf", as_df=False)
dm.to_sdf(mols, "gs://data-bucket/smiles.sdf")
An open-source hub of molecular featurizers
Spending too much time searching for the right featurizer? Don’t know which featurizers are most effective? Molfeat makes it easy to evaluate and implement a wide range of featurizers directly into your workflow.
Incredibly simple
You don't need much to get started with molfeat.
Start NowUnrivaled diversity of featurizers
Descriptors, 2D/3D pharmacophores, graph featurization. You name it, we have it.
ExploreExtendable
Something missing? Contribute your own featurizer.
Contribute
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
from molfeat.store.modelstore import ModelStore
# Load some dummy data
data = dm.data.freesolv().sample(100).smiles.values
# Featurize a single molecule
calc = FPCalculator("ecfp")
calc(data[0])
# Define a parallelized featurization pipeline
mol_transf = MoleculeTransformer(calc, n_jobs=-1)
mol_transf(data)
# Easily save and load featurizers
mol_transf.to_state_yaml_file("state_dict.yml")
mol_transf = MoleculeTransformer.from_state_yaml_file("state_dict.yml")
mol_transf(data)
# List all available featurizers
store = ModelStore()
store.available_models
# Find a featurizer and learn how to use it
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
model_card.usage()
Filter by medicinal chemistry rules
Want to intelligently apply constraints to prioritize more drug-like molecules? Medchem provides an easy and uniform way to try filtering in ways that medicinal chemists do.
Consistent application
Get most alerts, filters and rules through one API.
Start NowApply Eli Lilly, Novartis rules and more
More than 20 different rules for you to consider.
ExploreQuickly triage compounds at scale
Run in parallel in processes or threads.
Accelerate your search
import datamol as dm
import pandas as pd
import medchem as mc
# Create the filter object
rfilter = mc.rules.RuleFilters(
# You can specifiy a rule as a string or as a callable
rule_list=["rule_of_five", "rule_of_oprea", "rule_of_cns", "rule_of_leadlike_soft"],
# You can specify a custom list of names
rule_list_names=["rule_of_five", "rule_of_oprea", "rule_of_cns", "rule_of_leadlike_soft"],
)
# Load a dataset
data = dm.data.solubility()
# Apply rule filters on a list of molecules
results = rfilter(
mols=data["mol"].tolist(),
n_jobs=-1,
progress=True,
progress_leave=True,
scheduler="auto",
keep_props=False,
fail_if_invalid=True,
)
Evaluate your models meaningfully
What does a good model look like for chemistry and biology? Splito provides powerful methods for splitting datasets in a meaningful way considering your downstream application.
Efficient processing
Split your dataset with only two lines.
Start NowGet better generalization
Compare splitting methods to see how representative they are.
Compare splitsExplore chemical space
Easily visualize your train and test distribution.
Get Intuitions
import datamol as dm
from splito import ScaffoldSplit
# Load some data
data = dm.data.chembl_drugs()
# Initialize a splitter
splitter = ScaffoldSplit(smiles=data["smiles"].tolist(), n_jobs=-1, test_size=0.2, random_state=111)
# Generate indices for training set and test set
train_idx, test_idx = next(splitter.split(X=data.smiles.values))