Topological Data Analysis (I)

Discovering data’s qualitative information using machine learning

16 min readMar 29, 2021

In this article, we present the findings of evaluating a novel machine learning model for qualitative data processing, stressing its importance in boosting quantitative data. The approach is focused on gathering information from data topological structures. Topology can be thought of as a kind of “qualitative geometry” that helps one to avoid concerns like strict lengths or angles. Data topology is supported in the basic idea of replacing the role of data objects by its network of relationships.

The mapper algorithm, developed by Gurjeet Singh, Facundo Mémoli, and Gunnar Carlsson (4) for simplification and visualization of high-dimensional data sets, will be introduced in the first part. We will also use persistent homology methods to quantify topological features of shapes and functions in a dataset to obtain both qualitative and quantitative information in the second part. In our examples we will use a dataset made publically available by the Government of India as a part of their Drug Discovery Hackathon 2020 .

Github repo with code and data

Topological Data Analysis

Algebra is generous; she often gives more than is asked for.
Jean d’Alembert (1717–1783)

Topological Data Analysis (TDA) is a new approach to deal with data from a different point of view. According Gunnar Carlsson (1) there are some good reasons to implement this state-of-the-art orientation. First of all, quantitative analysis is missing important information encoded in data. Second, in many systems is not clear how much significance has actual data distances, so metrics are not always justified. Certainly, TDA does not care about this metrics but measures distances and clusters to get a representation of data in topological spaces.

TDA combines algebraic topology and other tools from pure mathematics to allow a useful study of shape of the data. High dimensionality data sets severely restricts our ability to visualize them. This is why TDA can improve our ability to illustrate and analyze this data. The most widely discussed topologies of data include connected components, tunnels, voids, etc.

Mathematical formalism developed for TDA implies advanced geometry and topography techniques. An intuitive definition of this techniques is not allways possible but we have sought for simple explanations in order to understand it’s basics. For a complete introduction to TDA we suggest to read Gunnar Carlsson’s work (1).

Simplicical Complexes, the origin of everyhting.

The effect of connecting points in a space as increasing some radius around them results in the creation of geometric objects called simplices. Intuitively, a simplicial complex structure on a space is an expression of the space as a union of points, intervals, triangles and other higher dimensional analogues (1). This action of connecting points is called filtration. Filtration will control how data points are transformed into simplicial complexes in order to transform sets of points into a topological space. The building blocks of a simplicial complex are called simplices (plural of simplex).

Construction of a simplicical complex — Representation of a simplicical complex (image from https://towardsdatascience.com/the-shape-that-survives-the-noise-f0a2a89018c6)

Persistence Homology basics

Homology refers to a class of methods using simplicical complexes to study topologies. It allows us to infer topological information from intrinsic geometric properties of the objects. A we already pointed in our introduction, what we do is to replace the role of data objects by its network of relationships. The decomposition of a space into path-connected components is a qualitative property of this space. As we can see in above figure, when increasing radius some connections between points appears. First connections draws lines as goes from pont to point. With higher radius values we can get new expressions as for example loops. Let’s use an example to explain what is homology.

Figure 2. Space Homology — Figure 1. Geometric objects homology.

We have two different spaces that are connected internally by paths, but they are qualitatively different. Number “8” on the right has two different loops and number “4” on the left has only one. This property is preserved under continuous deformations of space (shown in a friendly way in figure 1-b). This is because, while the distances between points may change, the local “arbitrary closeness” is unaffected and so when you forget about distance and only remember this arbitrary closeness, they are the same object.

Some mathematicians defines homology as “counting” or “detecting” holes in a varing dimension. I found a description from Eric Weisstein that helped me understand this idea:

“A hole in a mathematical object is a topological structure which prevents the object from being continuously shrunk to a point.”

Loops remain after deforming both spaces in figure 1 b, and none of these objects can be shrunk to a point. Information about loops and higher dimensional analogues in a space is known as connectivity information. Path connected components (lines) would be regarded as zeroth-level connectivity information, loops as level-one connectivity information, voids as level-two connectivity information and so on.

The idea behind persistence comes from not choosing a single radius to build a simplicical complex: instead, we use a range of incresing values in order to detect expressions that “persists” at different radius values. Persistence homology methods increases the radius and looks for analogues that prevail during the process. The relationships which are useful involve continuous maps between the different geometric objects, and therefore become a manifestation of the notion of functoriality, it is the notion that invariants should be related not just to objects being studied, but also to the maps between these objects. We could naturally say that connectivity information in geometric objects space is conveyed to topological space.

Figure 2. Definition of a **functor**. Image from https://www.math3ma.com/blog/what-is-a-functor-part-1

Carlsson holds the idea that functoriality is central in TDA(1). A functor 𝐹, can be described as some data satisfying certain properties. In figure 2, 𝑓 is an isomorfism in 𝐶, and 𝐹 respects the composition 𝐹(𝑔∘𝑓)=𝐹(𝑔)∘𝐹(𝑓). There is a functor that connects each topological space to a group known as the fundamental group. A fundamental group is an invariant of a topological space. For example loops in figure 2 (in both spaces “4” and “8”) are invariants.

The input data for persistent homology computation is represented as a point cloud (2), which is a set of data points identified by a given coordinate system. Point clouds are finite sets of points that are equipped with a distance function.

Betti numbers and persistence entropy

Betti numbers (denoted by β) are a formalization of the count of the number of independent k-dimensional surfaces in space. For example, β0 refers to the number of connected components (zeroth-level connectivity information) , β1 is the number of one dimensional or “circular” holes (level-one connectivity information) and β2 is the number of two-dimensional “voids” or “cavities”. Betti numbers are a vector or signature of integers (β0, β1, β2) descibring the space. This vector is called data’s topologica signature. In figure 3, object “4” signatures are β0=1, β1=1 (1, 1) and in object “8” , β0=1, β1=2 (1, 2). Betti number varies with the value of the radius (or filtration parameter or scale parameter -denoted by ε-).

Figure 3. Topological signatures of numbers.

The representation of different filtration parameters with its observed signatures is called a persistence diagram (3). Finally, persistence entropy is a Shannon-like entropy used to quantify the information found during the TDA’s filtration process.

Mapper algorithm

Mapper algorithm is a topological multi-step method to visualize geometric objects (4). Visualization means qualitative representation, so mapper is not a data analysis algorithm itself. It can be used to generate a topological graph that captures the salient features of the data. The advantages of mapper are it’s numbness to metrics, understanding sensitivity to parameter changes and its multiscale approach.

mapperbegins by filtering data in order to reveal interesting properties of the point cloud. The second step is to cover a topological space in order to map a space to a continuous function, which is also known as covering projection or covering map. The final step is partial clustering, which is the application of standard clustering algorithms to subsets of data and then understanding the interaction between partial clusters. After these steps we get a low-dimensional image which is easy to understand, and which can point to areas of interest (4). Keep in mind that every identity morphism in the point cloud will be present in the final mapper graph, which brings up the already mentioned concepts of functoriality and fundamental group (5).

Figure 5. Mapper algorithm steps. The notion of **functoriality**.

Hands on with COVID-19 drug discovery data set

The Government of India made this dataset publicly accessible as part of their Drug Discovery Hackathon. It includes a list of drugs that have been tested and their effectiveness against Sars-CoV-2. D. Agrawal completed this list by adding some molecule chemical details obtained from PubChem. The final dataset includes 100 checked molecules, each with 40 features. Among features included in dataset we have pIC50, the negative logarithm of the half maximal inhibitory concentration (IC50). This index is used as a measure of the efficacy of a substance in inhibiting a specific biological or biochemical function. Our goal is to find a classification method for this molecules according it’s effectiveness against Sars-CoV-2. The dataset also contains five blinded molecules, which do not show the pIC50 value.

Aside from the information shown in Figure 6, the dataset also includes features such as “CID”, “SMILES”, “MolecularFormula”, “InChI”, “InChIKey”, and “IUPACName”, which will enable us to visualize molecules and even draw them in 2d or 3d. “pIC50” feature will be our target feature. Higher pIC50 values indicate that the molecule is more selective against Sars-CoV-2.

Python libraries

We will use the Giotto-tda library to perform computations. gtda is a high-performance topological machine learning toolbox written in Python.
The real advantage of using gtda, in my opinion, is its direct integration with scikit-learn and support for tabular data, time series, graphs, or pictures.
Both assets make it simple to construct machine learning pipelines.

STEP 1. Building a point cloud

We have a high dimensional space (100 x 40) that can be considered as a point cloud. To visualize this dataset, we have two options: visualize only the first two or three features (for a 2d or 3d plot) and compute the high dimensional data withmapper , or perform a dimensionality reduction to two or three components. Let’s begin with strategy number two.

Despite gtda connectedness with scikit-learn, we are going to use another library for dimensionality reduction: Uniform Manifold Approximation and Projection (UMAP). It is a new technique that has many advantages over other algorithms, most notably increased speed and better preservation of the global structure of the data. umap is an stochastic algorithm that uses graph layout algorithms to arrange data in low-dimensional space. umap builds up a high dimensional graph representation of the data and optimizes a low-dimensional graph to be as structurally similar as possible to data. In order to construct the initial high-dimensional graph, umap builds something called fuzzy simplicial complex. A fuzzy simplicial complex is a representation of a weighted graph, with edge weights representing the likelihood that two points are connected. umap, as a simplicical complex, extends a radius from each point, connecting points when those radii overlap. Then it “fuzzes” the graph by decreasing the probability of interaction as the radius increases. Behind the scenes, by requiring that each point be connected to at least one of its nearest neighbors, umap ensures that local structure is maintained in balance with global structure (5).

features = clean.loc[:, :'pIC50']
n_neighbors=10
min_dist=0.5umap_2d = umap.UMAP(n_neighbors=n_neighbors
                    n_components=2,
                    min_dist=min_dist,
                    init='random',
                    random_state=0) umap_3d = umap.UMAP(n_neighbors=n_neighbors,
                    n_components=3,
                    min_dist=min_dist,
                    init='random',
                    random_state=0)proj_2d = umap_2d.fit_transform(clean.drop(columns='pIC50'))
proj_3d = umap_3d.fit_transform(clean.drop(columns='pIC50'))fig_2d = px.scatter( proj_2d, x=0, y=1,
                   color=clean['pIC50'], 
                   labels={'color': 'pIC50'}
)fig_3d = px.scatter_3d(proj_3d, x=0, y=1, z=2,
                      color=clean['pIC50'], 
                      labels={'color': 'pIC50'}
)fig_2d.update_layout(title='UMAP projection 2D and 3D')
fig_3d.update_traces(marker_size=5)fig_2d.update_layout({'plot_bgcolor': 'aliceblue' , 'paper_bgcolor': 'white',}, template='plotly_white')fig_3d.update_layout({'plot_bgcolor': 'aliceblue' , 'paper_bgcolor': 'white',}, template='plotly_white')fig_2d.show()
fig_3d.show()

Figure 8. Dataset weighted graph with UMAP. Color scale corresponds to pIC50 values

We see in figure 8 the result of our data set dimnesionality reduction with umap for three components. Points are coloured with pIC50 values so that yellow and orange points represents higher effectivity molecules. These points are not easily separables from the blue scale points in the point cloud. But there is something curious in the point cloud: we can see at leas two loops¡. As we have seen before, simplicical complexes gives us a method to compute persistent homology of point clouds.

Figure 9. Dataset weighted graph with UMAP. Color scale corresponds to binarized pIC50 values

We see in figure 9 the same plot but the points color corresponds to binarized pIC50 values. Positve values has orange colors and negative values blue color. With this view we note that orange color points are concentrated mainly in a region. though the most important thing: we have two loops and we have effective points (orange color) very close to each loop.

STEP 2. Visualization with Mapper algorithm

We come across a topological space (our data set) and we need to find its fundamental group. But our data set is an unfamiliar space and it’s too difficult to look at explicit loops and relations. The we look for another space that is homotopy equivalent to our, and whose fundamental group is much easier to compute. Since both spaces are homotopy equivalent, we know fundamental group in our space is isomorphic to fundamental group in the new space.

gtda has available features to compute mapper algorithm with point clouds. Mapper has a tricky tunning process. We are going to briefly describe the set of hyperparameters. According to steps shown in figure 5, we have to:

Choose a filter function,
define a covering,
and compute clustering.

Filter function

gtda offers three alternatives: projection , entropy and eccentrycity . My suggestion for a point cloud is to choose eccentrycity as offers better filtration resolution. Default distance funtion metric for eccentrycity is euclidean distance. We have chosen this default metric.

Covering

We have two options: CubicalCoverning for multi-dimensional data and OneDimensionCover for one-dimension data. CubicalCoverning has to important choices: number of intervales and overlap fractions.

If we look at example in figure 5, we have covered with four intervals. Higher number of intervals means higher number of points to cluster and higher numbers of nodes in graph. Overlap fraction sets the limit betwen two intervals. We are interested in having at least ten or fifteen intervals to have a rich data representation. To keep hidden local structures from data, we are going to choose overlap fractions betwen 20 and 40%.

Cluster

The default cluster algorithm is scikit-learn DBSCAN. The most important choice we have to make is epsilon parameter eps, it is, the maximum distance between two samples for one to be considered as in the neighborhood of the other. Results from mapper algorithm depends entirely from this parameter. Due to its importance, we are going to use some metrics to check the separation distance between the resulting DBSCAN clusters. In our example we are going to use Silhouette analysis.

Mapper

Mapper algorith builds a pipeline with parameters described above:

#build a pipeline for mapper algorithm
make_mapper_pipeline(filter_func,
                     cover,
                     clusterer)

For ploting the graph we can choose betwen plot a static mapper or an interactive mapper. Interactive maper allows interactivity on pipeline parameters. For our example we are going to use plot_interactive_mapper_graph . With this option we can choose the graph layout as specified in igraph library and also the color variable we want to visualize. In our case we are going to use pIC50 value continuous.

""" 1. Define filter function – can be any scikit-learn transformer.It is returning a selection of columns of the data """filter_func = Eccentricity(metric= 'euclidean') #Eccentricities of points in a point cloud or abstract metric space.""" 2. Define cover """
cover = CubicalCover(n_intervals=30, overlap_frac=0.3)""" 3. Choose clustering algorithm – default is DBSCAN """
clusterer = DBSCAN(eps=8, min_samples=3, metric='euclidean')""" 4. Initialise pipeline """
pipe_mapper = make_mapper_pipeline( 
filter_func=filter_func, cover=cover, clusterer=clusterer, verbose=False, n_jobs=-1
)

Cluster silhouette analysis:

data = clean.drop(columns='pIC50')#Check the cluster performance
db = clusterer.fit(data)
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)"""The best value of Silhouette score is 1, and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar."""print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(proj_3d, labels))

Check result:

Estimated number of clusters: 1
Estimated number of noise points: 88
Silhouette Coefficient: 0.003

Plot graph:

plotly_params = {"node_trace": {"marker_colorscale": "RdBu"}}fig = plot_static_mapper_graph(
pipe_mapper, data, layout='fruchterman_reingold', color_by_columns_dropdown=True, color_variable =clean['pIC50'], node_scale =20, plotly_params=plotly_params
)
fig.show(config={'scrollZoom': True})

In mapper graph node size is a metric for representing the number of points in the node. Node color represents the average value of pIC50 of points included in it.

Figure 10 ilustrates our resulting graph. We see three connected components and four non-connected nodes. Theres is a main component 1 (that goups most points according it’s nodes sizes) that has to flairs at the end: one for lower pIC50 values -flair 1- and another to higher pIC50 values -flair 2- . In component 2, we can see a scale of three nodes from lower pIC50 values to higher values. Component 3 hasn’t apparently any order and looks like just connected nodes with different number of points.

Figure 10. Mapper algorith of Sars-CoV-2 drug discovery data set.

A value of interactive visualization is that we can choose to color nodes by each feature in the dataset. In figure 10 we can see the same graph colored by “pIC50”, “HydrophobeCount3D” and “DonorCount3D”. Therefore we can explore the graph feature by feature looking for correlations getting data insights. Also, by clicking in each node, we can access the data points included i in order to manage information directly from points in the point cloud. As an example we have choose some points from the most interesting nodes and we have selected a feature for visualization molecules: one of the features included in the dataset is SMILES, meaning of Simplified Molecular Input Line Entry Specification. With SMILES we can plot 2d or 3d drawings of the molecules. We have select some sample points from relevant nodes and represented them as molecule drawings (we have used Chemdraw and molvieg.org to do the representation). The results are the following:

Figure 12 and 13 shows the resulting graph coloured by “pIC50” and “DonorCount3D” , a measure of Hydrogen Bond Donor Acceptor. We can see that pIC50 relation is inverse with DonorCount3D, illustrating that, in general, compounds that are H-bond donors are less effective against Sars-CoV-2. It means that molecules with sime highly polar hydrogen atom bonded to a strongly electronegative atom (Nitrogen N, Oxigen O or Fluoride F) does not inhibits the capability of virus . It has been demonstrated that a single hidrogen bonded interaction can decide the potency of drug-like molecules for a target when all other interactions stay constant (6). Hidrogen bonds can be categorized with distances: 2.2–2.5 Å as strong, mostly covalent bond, 2.5–3.2 Å as moderate, mostly electrostatic bond and 3.2–4.0 Å as weak, mostly electrostatic bond (7). So, another adding a “distance” feature to dataset could illustrate more details about the role of H-bonds in molecule effectivenes against Sars-CoV-2 virus.

Figure 13. Molecules scales resulting from graph.

One distinguishing feature of mapper is that it shows both clusters and links (abstract distances) at the same time.
Component 2 in Figure 10 depicts a pIC50 values scale as well as three associated nodes with corresponding points. Figure 13 shows several molecule drawings from node samplings, sorted by pIC50 value and ‘distance’ to other molecules. Figures 14 and 15 show 2d and 3d representations of certain molecules found in the most important nodes of the main branch and both flairs. The idea is that we can classify these molecules based on pIC50 (or some other feature) and assume that these molecules are closer together than others.

Figure 16 depicts an example of qualitative data. We see molecules that are linked and those that are not connected.
We can see that non-connected molecules (those that belong to nodes that are not connected to any other node) are farther away from other molecules, whereas connected molecules are closer. We’ve shown a selection of molecules from these nodes. We must note that distance is a function of meaning.
There are often many equally valid notions of distance in the same space.
The benefit of TDA is that we don’t have to worry about this concept to obtain insights. This depiction can be very useful for experts to understand visually which reaction mechanism of drugs against Sars-CoV-2 virus may occur.

Conclusions

There is a fascinating line of inquiry for designing potential antivirals. Drug classification based on tested effectiveness against Sars-CoV-2 is a critical problem today.We have a dataset that includes a list of medications and their corresponding effectivity measurements. We classified these potential drugs based on certain characteristics (40 features in total). As a consequence, we have a large dataset and a classic classification problem. We ask a machine learning classification problem for an accurate classification model, so that when we test a new potential drug in our model, it can return its efficiency rate based on what our model has learned.

TDA might be able to provide us with additional information. It can provide answers to the why and how questions, allowing for a reinterpretation of the findings. TDA can also provide us with quantitative data (second part of this article). TDA, and qualitative analysis in general, is useful when machine learning classification models are insufficiently accurate.

References:

[1] Topology and data. Author: Gunnar Carlsson Journal: Bull. Amer. Math. Soc. 46 (2009), 255–308. DOI: https://doi.org/10.1090/S0273-0979-09-01249-X.

[2] Cohen-Steiner, D., Edelsbrunner, H. & Harer, J. Stability of Persistence Diagrams. Discrete Comput Geom 37, 103–120 (2007). https://doi.org/10.1007/s00454-006-1276-5

[3] H. Edelsbrunner , D. Letscher , A. Zomorodian, Topological Persistence and Simplification, 2000.

[4] G. Singh, F. Memoli and G. Carlsson, Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague, September 2007.

[5] E. Riehl, Category theory in context, Courier Dover Publications, 2017.

[6] Bauer, C.A., Schneider, G. & Göller, A.H. Machine learning models for hydrogen bond donor and acceptor strengths using large and diverse training data generated by first-principles interaction free energies. J Cheminform 11, 59 (2019). https://doi.org/10.1186/s13321-019-0381-4

[7] Jeffrey, George A.; An introduction to hydrogen bonding, Oxford University Press, 1997.

Github repository:

Readers can find the code and data used for this article in https://github.com/Javihaus/KAGGLE-COVID-19-Drug-Discovery.