Topological Data Analysis (II)

Image by the Author

Persistent homology

Figure 1. Steps in persistence homology.
from gtda.homology import VietorisRipsPersistence""" Connectivity information
0-dimensional homology β0 or H0, measures clusters;
1-dimensional homology1 β1 or H1, measures loops; and
2- dimensional homology β2 or H2, measures voids (empty spaces) """
homology_dimensions = [0, 1, 2]
VR = VietorisRipsPersistence(
homology_dimensions=homology_dimensions,
coeff=3,
n_jobs=-1
)
#plot persistence diagram
diagram =VR.fit_transform(np.array(proj_3d)[None, : , :])
VR.fit_transform_plot(np.array(proj_3d).reshape(1,*np.array(proj_3d).shape))
#scale the persistence diagram
diagramScaler = Scaler()
scaled = diagramScaler.fit_transform(diagram)
diagramScaler.plot(scaled, sample=0)
# persistence entropy
persistence_entropy = PersistenceEntropy()
# calculate topological feature matrix
feat_matrix = persistence_entropy.fit_transform(diagram)# expect shape - (n_point_clouds, n_homology_dims)feat_matrix.shape
#Plot a sample from a collection of Betti curves
BC = BettiCurve()
y_betti_curves = BC.fit_transform(diagram)
BC.plot(y_betti_curves)
Figure 2. Persitence digram.
Figure 3. Betti curves.
Figure 4. Loops and connected components.

Quantitative analysis: train a classifier

X_train, X_test, y_train, y_test = train_test_split(clean.drop(columns=['pIC50']), np.array(clean['pIC50']) , test_size=0.25, random_state=666)# binarize y_train, y_test for creating labels
label_train = pd.cut(y_train, bins=[-3.0, 0.0, 1.5], labels=[0, 1])
label_test = pd.cut(y_test, bins=[-3.0, 0.0, 1.5], labels=[0, 1])
#PIPELINE 1. Binary classification
steps = [
("persistence", VietorisRipsPersistence(metric="euclidean", homology_dimensions=[0, 1, 2], n_jobs=-1)),
("entropy", PersistenceEntropy()),
("model", RandomForestClassifier(n_estimators=500)
)]
pipeline1 = Pipeline(steps, verbose=False)
#PIPELINE 2. Regression classification
steps2 = [
("persistence", VietorisRipsPersistence(metric="euclidean",
homology_dimensions=homology_dimensions, n_jobs=6)),
("entropy", PersistenceEntropy()),
("model", RandomForestRegressor(n_estimators=2000),
)]
pipeline2 = Pipeline(steps2)
pipeline1["model"].fit(np.array(X_train), np.array(label_train))
y_pred1 = pipeline1["model"].predict(np.array(X_test))
test_mse1 = accuracy_score(np.array(label_test), y_pred1)
print(f'Classification Accuracy Score = {test_mse1*100:.1f} %')
Result:
Classification Accuracy Score = 91.3 %
pipeline2["model"].fit(np.array(X_train),np.array(y_train))
y_pred2 = pipeline2["model"].predict(np.array(X_test))
test_mse2 = mean_squared_error(np.array(y_test), y_pred2)
print(f'Regression Mean Squared Error = {test_mse2:.3f}')
Result:
Regression Mean Squared Error = 0.407
rf = RandomForestClassifier(n_estimators=500)
rf.fit(X_train,label_train)
y_std_pred = rf.predict(X_test)
test_mse_std = mean_squared_error(label_test, y_std_pred)
print(f'Classification Accuracy Score = {test_mse_std*100:.1f} %')
Result:
Classification Accuracy Score = 8.7 %

Now we could scream ‘WOW’. Classification accuracy using topological data is 91.3% and using a standard sklearn.ensemble algorithm the accuracy drops to 8.7% .

Predicting blinded data in the original dataset

y_pred_blinded = pipeline2["model"].predict(np.array(predict.drop(columns =['pIC50'])))
y_pred_blinded_2 = rf.predict(np.array(predict.drop(columns=['pIC50'])))
y_pred_blinded_2
Result:
array([-1.24751886, -1.73934086, -1.52643676, -1.79271169, -0.49413739,
0.63826836, -0.74593707, -0.77189003, -0.54458047])
Figure 5. Only molecule in blind dataset with pIC50 > 0

Conclusions

References:

AI Evangelist