Molecular modeling made easy

datamol.io is an open-source toolkit that simplifies molecular processing and featurization workflows for ML scientists in drug discovery.

Get started

Already used by scientists in leading organizations

Discover the next generation of
open-source tools for molecular modeling

datamol

A Python library built on top of RDKit.

With extensive documentation and tons of tutorials, Datamol streamlines your experience for molecular data workflows.

molfeat

An open-source hub of molecular featurizers.

Access a diverse range of molecular featurizers in a single package. Rapidly test and evaluate which featurizer is best for your workflow.

medchem

A library to prioritize compounds at large scale.

Medchem is a Python library that proposes multiple molecular medchem filters to a wide range of use cases relevant in a drug discovery context.

splito

A machine learning dataset splitting library for life sciences.

Splito proposes a wide range of chemistry and biology-specific machine learning splitting algorithms.

datamol

A Python library built on top of RDKit.

With extensive documentation and tons of tutorials, Datamol streamlines your experience for molecular data workflows.

molfeat

An open-source hub of molecular featurizers.

Access a diverse range of molecular featurizers in a single package. Rapidly test and evaluate which featurizer is best for your workflow.

medchem

splito

A machine learning dataset splitting library for life sciences.

Splito proposes a wide range of chemistry and biology-specific machine learning splitting algorithms.

Accelerate molecular processing workflows

Datamol is an elegant, RDKit-powered Python library optimized for molecular machine learning workflows.

Highly intuitive
A familiar Pythonic API with good defaults by design. Get started in one line.
Start Now
Powerful
Seamlessly integrated with RDKit to support you in every step. Built-in parallelization to accelerate your workflows.
Experience Efficiency
Modern I/O
Read and write multiple formats (sdf, xlsx, csv) with out-of-the-box support.
Try Now

import datamol as dm

# Common functions
mol = dm.to_mol("O=C(C)Oc1ccccc1C(=O)O", sanitize=True)
fp = dm.to_fp(mol)
selfies = dm.to_selfies(mol)
inchi = dm.to_inchi(mol)

# Standardize and sanitize
mol = dm.to_mol("O=C(C)Oc1ccccc1C(=O)O")
mol = dm.fix_mol(mol)
mol = dm.sanitize_mol(mol)
mol = dm.standardize_mol(mol)

# Generate conformers
smiles = "O=C(C)Oc1ccccc1C(=O)O"
mol = dm.to_mol(smiles)
mol_with_conformers = dm.conformers.generate(mol)

# Easy IO
mols = dm.read_sdf("s3://my-awesome-data-lake/smiles.sdf", as_df=False)
dm.to_sdf(mols, "gs://data-bucket/smiles.sdf")

An open-source hub of molecular featurizers

Spending too much time searching for the right featurizer? Don’t know which featurizers are most effective? Molfeat makes it easy to evaluate and implement a wide range of featurizers directly into your workflow.

Incredibly simple
You don't need much to get started with molfeat.
Start Now
Unrivaled diversity of featurizers
Descriptors, 2D/3D pharmacophores, graph featurization. You name it, we have it.
Explore
Extendable
Something missing? Contribute your own featurizer.
Contribute

import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
from molfeat.store.modelstore import ModelStore

# Load some dummy data
data = dm.data.freesolv().sample(100).smiles.values

# Featurize a single molecule
calc = FPCalculator("ecfp")
calc(data[0])

# Define a parallelized featurization pipeline
mol_transf = MoleculeTransformer(calc, n_jobs=-1)
mol_transf(data)

# Easily save and load featurizers
mol_transf.to_state_yaml_file("state_dict.yml")
mol_transf = MoleculeTransformer.from_state_yaml_file("state_dict.yml")
mol_transf(data)

# List all available featurizers
store = ModelStore()
store.available_models

# Find a featurizer and learn how to use it
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
model_card.usage()

Medchem

Filter by medicinal chemistry rules

Want to intelligently apply constraints to prioritize more drug-like molecules? Medchem provides an easy and uniform way to try filtering in ways that medicinal chemists do.

Consistent application
Get most alerts, filters and rules through one API.
Start Now
Apply Eli Lilly, Novartis rules and more
More than 20 different rules for you to consider.
Explore
Quickly triage compounds at scale
Run in parallel in processes or threads.
Accelerate your search

import datamol as dm
import pandas as pd

import medchem as mc

# Create the filter object
rfilter = mc.rules.RuleFilters(
    # You can specifiy a rule as a string or as a callable
    rule_list=["rule_of_five", "rule_of_oprea", "rule_of_cns", "rule_of_leadlike_soft"],
    # You can specify a custom list of names
    rule_list_names=["rule_of_five", "rule_of_oprea", "rule_of_cns", "rule_of_leadlike_soft"],
)

# Load a dataset
data = dm.data.solubility()

# Apply rule filters on a list of molecules
results = rfilter(
    mols=data["mol"].tolist(),
    n_jobs=-1,
    progress=True,
    progress_leave=True,
    scheduler="auto",
    keep_props=False,
    fail_if_invalid=True,
)

Splito

Evaluate your models meaningfully

What does a good model look like for chemistry and biology? Splito provides powerful methods for splitting datasets in a meaningful way considering your downstream application.

Efficient processing
Split your dataset with only two lines.
Start Now
Get better generalization
Compare splitting methods to see how representative they are.
Compare splits
Explore chemical space
Easily visualize your train and test distribution.
Get Intuitions

import datamol as dm
from splito import ScaffoldSplit


# Load some data
data = dm.data.chembl_drugs()

# Initialize a splitter
splitter = ScaffoldSplit(smiles=data["smiles"].tolist(), n_jobs=-1, test_size=0.2, random_state=111)

# Generate indices for training set and test set
train_idx, test_idx = next(splitter.split(X=data.smiles.values))

We’re only just getting started

datamol.io is creating a new, simplified experience for ML scientists working on molecular modeling.

Create a PR, work on an issue, or interact with us on Twitter to let us know what features you want.

GitHub Twitter

Molecular modeling made easy

Discover the next generation of open-source tools for molecular modeling

datamol

molfeat

medchem

splito

datamol

molfeat

medchem

splito

Accelerate molecular processing workflows

Highly intuitive

Powerful

Modern I/O

An open-source hub of molecular featurizers

Incredibly simple

Unrivaled diversity of featurizers

Extendable

Filter by medicinal chemistry rules

Consistent application

Apply Eli Lilly, Novartis rules and more

Quickly triage compounds at scale

Evaluate your models meaningfully

Efficient processing

Get better generalization

Explore chemical space

We’re only just getting started

Discover the next generation of
open-source tools for molecular modeling