Topological Data Analysis (II)

Javier Marin
7 min readMar 30, 2021

--

Image by the Author

This article is the continuation to TOPOLOGICAL DATA ANALISYS -PART 1- (https://javier-marin.medium.com/topological-data-analisys-f8ff5fa0703a). We recommend that the reader read the first section where we introduced some topology concepts and the dataset.

One can’t do mathematics for more than ten minutes without grappling, in some way or other, with the slippery notion of equality.

BarryMazur

Persistent homology

Let’s remember the routine we have introduced in part 1 : We come across a topological space and we need to find its fundamental group. But our data set is an unfamiliar space and it’s too difficult to look at explicit loops and relations. Then we look for another space that is homotopy equivalent to our, and whose fundamental group is much easier to compute. Since both spaces are homotopy equivalent, we know fundamental group in our space is isomorphic to fundamental group in the new space.

Figure 1. Steps in persistence homology.

Steps in persistence homology are illustrated in figure 1. First, we define a point cloud. Then we use a filtration method to create simplicical complexes and, finally we identify the topological signatures of data (for example Betti numbers). gtda allows to perform this calculation with ease.

  1. The first step is to instantiate a VietorisRipsPersistence transformer and calculate so-called persistence diagrams for this collection of point clouds. Vietoris-Rips complex is a filtration method. The VietorisRipsPersistence is the persistent simplicial homology of our metric space. The result is what we call persistence diagrams.
  2. Compute Betti curves and persistence entropy from persistence diagrams to find topological signatures.
from gtda.homology import VietorisRipsPersistence""" Connectivity information
0-dimensional homology β0 or H0, measures clusters;
1-dimensional homology1 β1 or H1, measures loops; and
2- dimensional homology β2 or H2, measures voids (empty spaces) """
homology_dimensions = [0, 1, 2]
VR = VietorisRipsPersistence(
homology_dimensions=homology_dimensions,
coeff=3,
n_jobs=-1
)
#plot persistence diagram
diagram =VR.fit_transform(np.array(proj_3d)[None, : , :])
VR.fit_transform_plot(np.array(proj_3d).reshape(1,*np.array(proj_3d).shape))
#scale the persistence diagram
diagramScaler = Scaler()
scaled = diagramScaler.fit_transform(diagram)
diagramScaler.plot(scaled, sample=0)
# persistence entropy
persistence_entropy = PersistenceEntropy()
# calculate topological feature matrix
feat_matrix = persistence_entropy.fit_transform(diagram)# expect shape - (n_point_clouds, n_homology_dims)feat_matrix.shape
#Plot a sample from a collection of Betti curves
BC = BettiCurve()
y_betti_curves = BC.fit_transform(diagram)
BC.plot(y_betti_curves)

Results of step one and two can be seen in figures 2 and 3. In figure 2 we see the perstistence diagram we got with VietorisRipsPersistence . Axis show the “birth” and “death” of the path connected components.

Figure 2. Persitence digram.

We can see H0 (or β0) that refers to the number of connected components (zeroth-level connectivity information) , H1 (or β1) is the number of one dimensional or “circular” holes (level-one connectivity information) and H2 (or β2) is the number of two-dimensional “voids” or “cavities”. We consider Points on the dotted line as noise. For β0 (in orange) we can see that all components start at 0. The highest point corresponds to the limit of our point cloud. All orange points are very close but last three ones that are little separated. Green points corresponds to β1. We see basically two points out of noise level. It corresponds to two circles. Remember from part 1? When we represented our point cloud we identified two “loops” in the data (figure 4).

Figure 3. Betti curves.

Our persistence diagram now confirms that we have two loops (figure 4). Betti curves are shown in Figure 3. The filtration parameter (or the growing radius value used for filtration) is shown on the x-axis, and the Betti is shown on the y-axis. This curve is eligible for zeroth-level, level-one, and level-two connectivity details, so we have a vector of three Betti numbers (β0, β1, β2) for each filtration parameter.

As seen in Figure 4, all persistence diagrams and Betti curves illustrate two loops in the point cloud and no detail (β2=0) for level-two connectivity. We can see that with a filtration parameter value of 1, we have almost connected all of the points in the point cloud for level-zeroth connectivity. There are only a few remaining points that require a higher filtration parameter (the higher orange points slightly separated from the rest in persistence diagram). The highest value corresponds to the highest point in the persistence diagram which is equal to the “maximum” distance (or minimum “closeness”) of all points in the cloud.

Figure 4. Loops and connected components.

Our dataset is distinguished by three single topological signature values: Betti numbers. We found a relationship between these numbers and the filtration parameter or “closeness” between points. We don’t care which metric is in our data set because we’re calculating distances between elements using an arbitrary “closeness” metric function. This is why category theory [1] is important in mathematics: it helps one to think about relations between apparently unrelated objects.The notion that there are fundamental relations between elements just waiting to be found drives the strength of Topological Data Analysis.

Quantitative analysis: train a classifier

In part one, we saw how TDA provides some essential qualitative knowledge for the concept of “closeness.” In this section, we formalized the computation of data topological signatures as a means of characterizing data. With this rationale we can also get quantitative information we can use to perform a statistical classification.

gtda , with the module gtda.pipeline extends scikit-learn’s module by defining pipelines in order to perform such computations. We have built a basic pipeline allowing us to use sklearn classification algorithms to get this statistics.

X_train, X_test, y_train, y_test = train_test_split(clean.drop(columns=['pIC50']), np.array(clean['pIC50']) , test_size=0.25, random_state=666)# binarize y_train, y_test for creating labels
label_train = pd.cut(y_train, bins=[-3.0, 0.0, 1.5], labels=[0, 1])
label_test = pd.cut(y_test, bins=[-3.0, 0.0, 1.5], labels=[0, 1])
#PIPELINE 1. Binary classification
steps = [
("persistence", VietorisRipsPersistence(metric="euclidean", homology_dimensions=[0, 1, 2], n_jobs=-1)),
("entropy", PersistenceEntropy()),
("model", RandomForestClassifier(n_estimators=500)
)]
pipeline1 = Pipeline(steps, verbose=False)
#PIPELINE 2. Regression classification
steps2 = [
("persistence", VietorisRipsPersistence(metric="euclidean",
homology_dimensions=homology_dimensions, n_jobs=6)),
("entropy", PersistenceEntropy()),
("model", RandomForestRegressor(n_estimators=2000),
)]
pipeline2 = Pipeline(steps2)

First we compute the persistence diagrams and it’s persistent entropy. Following we look for a statisctical model in sklearn.ensemble in order to get a classification task with its corresponding classification accuracy score. To perform this classification we have choosen RandomForestClassifier and RandomForestRegressor . For using the classifier (pipeline 1) we have first binarized our data, selecting 1 for “pIC50” values higher than zero, and 0 for negative “pIC50” values. For pipeline 2 we have have used RandomForestRegressor to predict the values of feature “pIC50”.

pipeline1["model"].fit(np.array(X_train), np.array(label_train))
y_pred1 = pipeline1["model"].predict(np.array(X_test))
test_mse1 = accuracy_score(np.array(label_test), y_pred1)
print(f'Classification Accuracy Score = {test_mse1*100:.1f} %')
Result:
Classification Accuracy Score = 91.3 %
pipeline2["model"].fit(np.array(X_train),np.array(y_train))
y_pred2 = pipeline2["model"].predict(np.array(X_test))
test_mse2 = mean_squared_error(np.array(y_test), y_pred2)
print(f'Regression Mean Squared Error = {test_mse2:.3f}')
Result:
Regression Mean Squared Error = 0.407

Classification accuracy is quite interesting meanwhile regression mean squared error can easily improved. With pipeline 1 we can have classified the dataset between molecules that are more effective against Sars-CoV-2 virus and pipeline 2 allows to predict the value of “pIC50” feature (refer to part one of this article for more info). We have used topological information to perform a statistic classification of our dataset. Let’s check witch result we could get performing this classification directly with sklearn.ensemble on the dataset. We are going to use the same RandomForestClassifier .

rf = RandomForestClassifier(n_estimators=500)
rf.fit(X_train,label_train)
y_std_pred = rf.predict(X_test)
test_mse_std = mean_squared_error(label_test, y_std_pred)
print(f'Classification Accuracy Score = {test_mse_std*100:.1f} %')
Result:
Classification Accuracy Score = 8.7 %

Now we could scream ‘WOW’. Classification accuracy using topological data is 91.3% and using a standard sklearn.ensemble algorithm the accuracy drops to 8.7% .

Predicting blinded data in the original dataset

Dataset provided by the Government of India as a part of their Drug Discovery Hackathon 2020 has nine blinded data molecules, it’s, non providing feature “pIC50" values. We are going to use pipeline 2 to estimate pIC50 values from this molecules.

y_pred_blinded = pipeline2["model"].predict(np.array(predict.drop(columns =['pIC50'])))
y_pred_blinded_2 = rf.predict(np.array(predict.drop(columns=['pIC50'])))
y_pred_blinded_2
Result:
array([-1.24751886, -1.73934086, -1.52643676, -1.79271169, -0.49413739,
0.63826836, -0.74593707, -0.77189003, -0.54458047])

Our model predicts only a positive value (pIC50 = 0.64) for the 6th blinded molecule. The rest of molecules seems to show very low efficacy against Sars-CoV-2 virus. In figure 5 we can meet this molecule. It’s IUPAC (International Union of Pure and Applied Chemistry) name is (5-chloropyridin-3-yl) 4-chlorobenzoate.

Figure 5. Only molecule in blind dataset with pIC50 > 0

Conclusions

We have seen that perform classification tasks in datasets can improve accuracy if we use data representation in topological spaces. Mathematical objects in topological spaces provides a richer comprehension of relations between apparently unrelated objects than other techniques. Recently, researchers have discovered a new application for providing topological knowledge to deep neural networks in order to increase classification accuracy [2], [3]. In our examples, we showed that working with topological spaces increases classification accuracy even more than regular algorithms. In addition, we have presented samples of qualitative information relevant to the “arbitray closeness” principle and its benefits in maximizing the importance of quantitative information.
In our example, we showed how this “closeness” principle can be helpful to organic chemistry and biochemistry experts in deciding the molecules and groups are more appropriate to deal with in terms of virus blocking capabilities.
We used the giotto-tda library, a cutting-edge Python package for topological data processing.

References:

[1] E. Riehl, Category theory in context. Courier Dover Publications, 2017.

[2] Christopher Olah, Neural networks, manifolds, and topology, Blog post (2014).

[3] Gunnar Carlsson and Rickard Brüel Gabrielsson, Topological approaches to deep learning, Topological data analysis, 2020, pp. 119–146.

--

--

Javier Marin
Javier Marin

Written by Javier Marin

Stats, ML/AI, data, management, strategy. All views are my own applied research. Sign up here: https://canai.ghost.io/ Business inquiries: javier@jmarin.info

No responses yet