Database page: ‘A dynamic programming algorithm for a maximum \(s\)-clique set on trees’

Authors: José Alberto Fernández-Zepeda, Alejandro Flores-Lamas, Matthew Hague, Joel Antonio Trejo-Sánchez.

Description

This webpage contains the dataset for the experiments in the article “A dynamic programming algorithm for a maximum \(s\)-clique set on trees”. This presentation-only paper is currently under consideration for acceptance for The 16th International Conference on Reachability Problems (RP’22).

Abstract

Given an undirected graph \(G = (V_G, E_G)\), a clique \(C\) is a complete subgraph of \(G\). In social networks analysis, the unit distance of the clique makes it challenging to model certain social concepts. For those applications, a relaxed variant of the clique is more appropriate. A \(s\)-clique \(Q\) is a maximal subgraph of \(G\), such that the distance in \(G\) between any pair of vertices of \(Q\) is less than or equal to some positive integer \(s\). In this sense, a clique is also a \(1\)-clique. The maximum \(s\)-clique problem consists of finding a \(s\)-clique with the greatest amount of vertices in \(G\). Such a problem is NP-hard for arbitrary graphs and any \(s\). In this work, we propose a dynamic programming algorithm that solves this problem on a tree of order \(n\) in \(O(s \cdot n)\) time, for \(s \geq 2\). Our algorithm improves, theoretically and experimentally, the performance of previous algorithms that compute a maximum \(s\)-clique on trees.

Keywords: \(s\)-clique, Graph algorithms, Trees and Dynamic programming.

Data documentation

The data set consists of six classes of trees.

The subset \({\cal T}_{22\_16}\) from Professor Brendan McKay. We chose the dataset ‘tree22.16.txt’ containing \(12\,761\) different trees where each graph has \(22\) vertices and a diameter of \(16\). The experiments used this set with values of \(k \in \{8, 9, \ldots, 16\}\). In Table 2, such a set appears only in Row 2 since the cardinality of this set is large and the order of each graph is small.
The subset \({\cal T}_{Ph}\) is a set of phylogenetic trees chosen from the catalogue of gene phylogenies PhylomeDB. The file ‘phylomones/phylomone_0003/ best_trees.txt.gz’ contains \(5\,722\) phylogenetic trees. The order of the largest tree is \(297\), and the graphs’ diameters range from \(2\) to \(64\). The experiments used this set with values of \(k\) from the set \(\{2, 3, 4, 7, 8, 13, 16, 25,\) \(32, 49, 64\}\). In Table 2, such a set appears only in Row since the cardinality of this set is large and the order of each graph is small.
The graph \(T_D\) is a doubly logarithmic tree of height 5 (\(|V_{T_D}| = 119\,041\)). The experiments used this graph with values of \(k = \{4, 5\}\). See JáJá, J.: An introduction to parallel algorithms. Addison Wesley Longman Publishing Co., Inc., USA (1992)
The graph \(T_L\) is a linear tree with \(10\,000\) vertices. We could think of \(T_L\) as an ‘opposite’ of \(T_D\) since \(T_L\) is high and each level has precisely one vertex. The experiments used this graph with values of \(k\) from the set \(\{10, 100, 1\,000, 10\,000\}\). See path_graph(n, create_using=None).
The graph \(T_B\) is a full balanced binary tree with \(2^{16}\) leaves. The experiments used this tree with values of \(k = \{5,\) \(10,\) \(15,\) \(20,\) \(25,\) \(30\}\). See balanced_tree(r, h, create_using=None).
Trees \(T_{\eta\mathrm{e}{6}}\), for \(\eta \in \{0.3, 0.5, 0.75, 1\}\), and \(\mathrm{e}{6} = 10^6\): these graphs are uniformly random trees built with the instruction ‘nx.random_tree(…)’ from NetworkX. Their orders are \(0.3\times 10^6,\) \(0.5 \times 10^6,\) \(0.75 \times 10^6,\) and \(1 \times 10^6\) vertices, respectively. For each tree, we used the following values of \(k = \{10, 100, 500\}\). See random_tree(n, seed=None, create_using=None)

Table 1. Source files (in GML format) for the cases of studies; files are zipped.

Case of study	File	Case of study	File
\({\cal T}_{22\_16}\)	T_22_16.zip	\(T_L\)	T_L.zip
\({\cal T}_{Ph}\)	T_Ph.zip	\(T_B\)	T_B.zip
\(T_D\)	T_D.zip	\(T_{\eta\mathrm{e}{6}}\)	T_eta.zip
All files	DB.zip

Metrics

We used the wall-clock running time required by Schäfer’s¹ and Max-DkT² implementations. We reported and compared the time required for computing an MDkC³ size on each input graph; i.e., we did not account for operations such as reading/writing files, data structure initialisation, and MDkC vertices identification. Since both algorithms are correct, we did not consider the size of the MDkC.

Computing environment

We implemented both algorithms with Python⁴ 3.10 and the Python package NetworkX⁵ version 2.8. Experiments ran on a 2012 MacBook Pro @2.3 GHz Quad-Core Intel Core i7, running a 64 bit macOS Catalina version 10.15.7 with a total memory of 8 GB (1600 MHz DDR3). This computer executed each algorithm implementation sequentially without any resource allocation. The graph format is GML⁶.

Experimental results

Table 2. Wall-clock running time of Schäfer’s implementation and MaxDkT on the six cases of studies. We denote ‘timeouts’ and ‘not applicable’ by ‘—’ and ‘n/a’, respectively.

		Running time	in seconds
Row	Graph	\(t_{\text{Sch{\"a}fer}}\)	\(t_{\text{Max-D}k\text{T}}\)	\(\frac{t_{\text{Sch{\"a}fer}}}{t_{\text{Max-D}k\text{T}}}\)
1	\({\cal T}_{22\_16}\)	132.4492	60.8545	2.1
2	\(T_D,\; k = 4\)	109.9832	1.8577	59.2
3	\(T_D,\; k = 5\)	116.5945	1.5734	74.1
4	\(T_L, \; k = 10\)	1.5289	0.5584	2.7
5	\(T_L, \; k = 100\)	12.6688	4.8969	2.5
6	\(T_L, \; k = 1\,000\)	602.9626	45.2492	13.3
7	\(T_L, \; k = 10\,000\)	19,859.2017	241.7339	82.1
8	\(T_B, \; k = 5\)	117.9299	2.0445	57.6
9	\(T_B, \; k = 10\)	121.7708	1.9870	61.2
10	\(T_B, \; k = 15\)	125.8261	1.9811	63.5
11	\(T_B, \; k = 20\)	125.4565	1.9592	64.0
12	\(T_B, \; k = 25\)	126.7537	1.9809	63.9
13	\(T_B, \; k = 30\)	148.4265	1.9831	74.8
14	\({\cal T}_{Ph}\)	719.4773	185.4435	3.8
15	\(T_{0.3\mathrm{e}{6}},\; k = 10\)	1,022.6843	9.4978	107.7
16	\(T_{0.3\mathrm{e}{6}},\; k = 100\)	1,439.42	17.3046	83.1
17	\(T_{0.3\mathrm{e}{6}},\; k = 500\)	5,989.8246	74.7689	80.1
18	\(T_{0.5\mathrm{e}{6}},\; k = 10\)	2,945.4381	15.6059	188.7
19	\(T_{0.5\mathrm{e}{6}},\; k = 100\)	3,654.6010	44.1117	82.8
20	\(T_{0.5\mathrm{e}{6}},\; k = 500\)	—	2,661.9666	n/a
21	\(T_{0.75\mathrm{e}{6}},\; k = 10\)	6,436.4398	22.7611	282.7
22	\(T_{0.75\mathrm{e}{6}},\; k = 100\)	7,807.3684	94.7391	82.4
23	\(T_{0.75\mathrm{e}{6}},\; k = 500\)	—	5,657.6981	n/a
24	\(T_{1\mathrm{e}{6}},\; k = 10\)	11,529.7250	31.8786	361.6
25	\(T_{1\mathrm{e}{6}},\; k = 100\)	14,167.638	198.9311	71.2
26	\(T_{1\mathrm{e}{6}},\; k = 500\)	—	11,422.2738	n/a

Table 3. Additional information and raw data.

Case of study	File	Case of study	File
\({\cal T}_{22\_16}\)	T_22_16.tsv	\(T_L\)	T_L.tsv
\({\cal T}_{Ph}\)	T_Ph.tsv	\(T_B\)	T_L.tsv
\(T_D\)	T_D.tsv	\(T_{\eta\mathrm{e}{6}}\)	Random.tsv

Note: In the .tsv files, the column labelled as ‘diameter’ represents the value for the integer \(k\).

License

Free

Schäfer, A.: Exact algorithms for s-club finding and related problems. Ph.D. thesis, Friedrich-Schiller-University Jena (2009).↩︎
Our dynamic programming algorithm for a maximum distance \(s\)-clique set on trees.↩︎
A maximum distance \(s\)-clique on trees.↩︎
Van Rossum, G., Drake, F.L.: Python 3 Reference Manual. CreateSpace, Scotts Valley, CA (2009)↩︎
Hagberg, A., Swart, P., S Chult, D.: Exploring network structure, dynamics, and function using networkx. Tech. rep., Los Alamos National Lab.(LANL), Los Alamos, NM (United States) (2008)↩︎
Himsolt, M., 1997. GML: A portable graph file format. Technical report, Universitat Passau.↩︎