Repurpose: A Python-based platform for reproducible similarity-based drug repurposing

Over the past years many methods for similarity-based (a.k.a. knowledge-based, guilt-by-association-based) drug repurposing, yet most of these studies do not provide the code or the model used in the study. To improve reproducibility, we present a Python-platform offering

drug feature data parsing and similarity calculation
data balancing
(disjoint) cross validation
classifier building

Using this platform we investigate the effect using unseen data in the test set in similarity-based classification.

See Jupyter (IPython) Repurpose Notebook for reproducing the analysis presented in the manuscript and example runs.

See DDI Notebook for the analysis of drug-drug interaction prediction using drug similarity.

Requirements

The Python platform has the following dependencies:

Installing & running tests

Just download (i.e. clone) the files to your computer, no additional install is required. Several test cases for the methods in utilities.py are provided in tests.py. To run these, type

python tests.py

It should give an output similar to below

......

Ran 6 tests in 0.002s

OK

Data sets

The data sets used in the analysis are freely available online

We have modified these data sets slightly for parsing in Python by

converting all drug, disease and side effect terms to lowercase
removing the quotations and making the text tab delimited
we also added the 'Drug' text to the header

These modified files are available under data/ folder.

We have also retrieve pharmocokinetic drug-drug interaction (DDI) information from DrugBank database (v4.5.0) and mapped the drugs on the data set above.

Usage

For running the code with the default parameters defined in the default.ini in src/ directory, type

config_file = "default.ini"
config_section = "DEFAULT"
python main.py -c config_file -s config_section

Alternatively, for using the check_ml method that builds a machine learning classifier to predict drug-disease associations using a cross-validation scheme, include the following in the python code

import ml
ml.check_ml(data, n_run, knn, n_fold, n_proportion, n_subset, model_type, prediction_type, features, recalculate_similarity, disjoint_cv, split_both = False, output_file = None, model_fun = None, verbose = False, n_seed = None)

data can be loaded using the following function

import utilities
data = utilities.get_data(drug_disease_file, drug_side_effect_file, drug_structure_file, drug_target_file, drug_interaction_file=None)

See the Repurpose Notebook for several use cases on repurposing drugs using chemical, target profile and side effect similarity. For drug-drug interaction prediction using drug similarity, see the DDI Notebook.

Customizing the experimental settings

The configuration information for the experiments are in default.ini. The path of the data file has to be defined based on your local file structure.

Parameters in default.ini:

drug_disease_file: File containing drug-disease associations (a binary matrix where rows are drugs, columns are diseases)
drug_side_effect_file = File containing drug-side effect associations (a binary matrix where rows are drugs, columns are side effects)
drug_structure_file = File containing drug-chemical sub structure mapping (a binary matrix where rows are drugs, columns are substructures)
drug_target_file: File containing drug-target mapping (a binary matrix where rows are drugs, columns are targets)
output_file: File in which the output AUC and AUPRC values are going to be stored
random_seed: A number to assign use as seed to random package functions (set it an integer for reproducibility, if -1 the output would vary depending on the random selection)
model_type: Machine learning model to be used to build the classifier, either svm | logistic | knn | tree | rf | gbc
prediction_type = Whether the classifier will be build to predict drug-disease ('disease') or drug-side effect ('side effect') associations
features = Features to be used to build the classifier, a combination of chemical | target | phenotype
disjoint: Whether the cross-validation folds contain overlapping drugs (True) or not (False)
pairwise_disjoint : Whether the cross-validation folds should group both of the pairs within the same group
recalculate_similarity = Whether to recalculate k-NN based drug-disease and drug-side effect association score within training and test sets (True: recalculate, default, False: do not recalculate)
knn = Number of most similar drugs to consider while calculating drug-disease and drug-side effect association score
n_fold: Number of cross-validation folds
n_proportion: Proportion of negative instances compared to positives (e.g., 2 means for each positive instance there are 2 negative instances)
n_subset: If not -1, it uses a random subset of size n_subset of the positive instances (to reduce the computational time for large data sets)
n_run = Number of repetitions of cross-validation analysis

Customizing the methods

Data balancing and cross validation (in utilities.py)

balance_data_and_get_cv(pairs, classes, n_fold, n_proportion, n_subset=-1, disjoint=False, split_both=False, n_seed=None)

Input parameters: pairs: all possible drug-disease pairs classes: labels of these drug-disease associations (1: known, 0: unknown) n_fold: number of cross-validation folds n_proportion: proportion of negative instances compared to positives (e.g., 2 means for each positive instance there are 2 negative instances) n_subset: if not -1, it uses a random subset of size n_subset of the positive instances (to reduce the computational time for large data sets) disjoint: whether the cross-validation folds contain overlapping drugs (True) or not (False) split_both: whether the cross-validation folds should group both of the pairs within the same group n_seed: number to feed to the random generator to for reproducibility (of the cross-validation folds)

Output: This function returns (pairs, classes, cv) after balancing the data and creating the cross-validation folds. cv is the cross validation iterator containing train and test splits defined by the indices corresponding to elements in the pairs and classes lists.

Classifier model (in utilities.py)

get_classification_model(model_type, model_fun = None)

Input parameters: model_type: custom | svm | logistic | knn | tree | rf | gbc model_fun: the function implementing classifier when the model_type is custom

The allowed values for model_type are custom, svm, logistic, knn, tree, rf, gbc
corresponding to custom model provided in model_fun by the user or the default 
models in Scikit-learn for support vector machine, k-nearest-neighbor, 
decision tree, random forest and gradient boosting classifiers, respectively.

Output: Returns the classifier object that provides fit and predict_proba methods.

Citation

Guney E., REPRODUCIBLE DRUG REPURPOSING: WHEN SIMILARITY DOES NOT SUFFICE. Pac Symp Biocomput. 2016;22:132-143. Pubmed

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
interaction.ipynb		interaction.ipynb
repurpose.ipynb		repurpose.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repurpose: A Python-based platform for reproducible similarity-based drug repurposing

Requirements

Installing & running tests

......

Data sets

Usage

Customizing the experimental settings

Customizing the methods

Citation

About

Releases

Packages

Languages

emreg00/repurpose

Folders and files

Latest commit

History

Repository files navigation

Repurpose: A Python-based platform for reproducible similarity-based drug repurposing

Requirements

Installing & running tests

......

Data sets

Usage

Customizing the experimental settings

Customizing the methods

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages