Authors: José Alberto Fernández-Zepeda, Alejandro Flores-Lamas, Matthew Hague, Joel Antonio Trejo-Sánchez.
This webpage contains the dataset for the experiments in the article “A dynamic programming algorithm for a maximum \(s\)-clique set on trees”. This presentation-only paper is currently under consideration for acceptance for The 16th International Conference on Reachability Problems (RP’22).
Given an undirected graph \(G = (V_G, E_G)\), a clique \(C\) is a complete subgraph of \(G\). In social networks analysis, the unit distance of the clique makes it challenging to model certain social concepts. For those applications, a relaxed variant of the clique is more appropriate. A \(s\)-clique \(Q\) is a maximal subgraph of \(G\), such that the distance in \(G\) between any pair of vertices of \(Q\) is less than or equal to some positive integer \(s\). In this sense, a clique is also a \(1\)-clique. The maximum \(s\)-clique problem consists of finding a \(s\)-clique with the greatest amount of vertices in \(G\). Such a problem is NP-hard for arbitrary graphs and any \(s\). In this work, we propose a dynamic programming algorithm that solves this problem on a tree of order \(n\) in \(O(s \cdot n)\) time, for \(s \geq 2\). Our algorithm improves, theoretically and experimentally, the performance of previous algorithms that compute a maximum \(s\)-clique on trees.
Keywords: \(s\)-clique, Graph algorithms, Trees and Dynamic programming.
The data set consists of six classes of trees.
The subset \({\cal T}_{22\_16}\) from Professor Brendan McKay. We chose the dataset ‘tree22.16.txt’ containing \(12\,761\) different trees where each graph has \(22\) vertices and a diameter of \(16\). The experiments used this set with values of \(k \in \{8, 9, \ldots, 16\}\). In Table 2, such a set appears only in Row 2 since the cardinality of this set is large and the order of each graph is small.
The subset \({\cal T}_{Ph}\) is a set of phylogenetic trees chosen from the catalogue of gene phylogenies PhylomeDB. The file ‘phylomones/phylomone_0003/ best_trees.txt.gz’ contains \(5\,722\) phylogenetic trees. The order of the largest tree is \(297\), and the graphs’ diameters range from \(2\) to \(64\). The experiments used this set with values of \(k\) from the set \(\{2, 3, 4, 7, 8, 13, 16, 25,\) \(32, 49, 64\}\). In Table 2, such a set appears only in Row since the cardinality of this set is large and the order of each graph is small.
The graph \(T_D\) is a doubly logarithmic tree of height 5 (\(|V_{T_D}| = 119\,041\)). The experiments used this graph with values of \(k = \{4, 5\}\). See JáJá, J.: An introduction to parallel algorithms. Addison Wesley Longman Publishing Co., Inc., USA (1992)
The graph \(T_L\) is a linear tree with \(10\,000\) vertices. We could think of \(T_L\) as an ‘opposite’ of \(T_D\) since \(T_L\) is high and each level has precisely one vertex. The experiments used this graph with values of \(k\) from the set \(\{10, 100, 1\,000, 10\,000\}\). See path_graph(n, create_using=None).
The graph \(T_B\) is a full balanced binary tree with \(2^{16}\) leaves. The experiments used this tree with values of \(k = \{5,\) \(10,\) \(15,\) \(20,\) \(25,\) \(30\}\). See balanced_tree(r, h, create_using=None).
Trees \(T_{\eta\mathrm{e}{6}}\), for \(\eta \in \{0.3, 0.5, 0.75, 1\}\), and \(\mathrm{e}{6} = 10^6\): these graphs are uniformly random trees built with the instruction ‘nx.random_tree(…)’ from NetworkX. Their orders are \(0.3\times 10^6,\) \(0.5 \times 10^6,\) \(0.75 \times 10^6,\) and \(1 \times 10^6\) vertices, respectively. For each tree, we used the following values of \(k = \{10, 100, 500\}\). See random_tree(n, seed=None, create_using=None)
Table 1. Source files (in GML format) for the cases of studies; files are zipped.
Case of study | File | Case of study | File |
---|---|---|---|
\({\cal T}_{22\_16}\) | T_22_16.zip | \(T_L\) | T_L.zip |
\({\cal T}_{Ph}\) | T_Ph.zip | \(T_B\) | T_B.zip |
\(T_D\) | T_D.zip | \(T_{\eta\mathrm{e}{6}}\) | T_eta.zip |
All files | DB.zip |
We used the wall-clock running time required by Schäfer’s1 and Max-DkT2 implementations. We reported and compared the time required for computing an MDkC3 size on each input graph; i.e., we did not account for operations such as reading/writing files, data structure initialisation, and MDkC vertices identification. Since both algorithms are correct, we did not consider the size of the MDkC.
We implemented both algorithms with Python4 3.10 and the Python package NetworkX5 version 2.8. Experiments ran on a 2012 MacBook Pro @2.3 GHz Quad-Core Intel Core i7, running a 64 bit macOS Catalina version 10.15.7 with a total memory of 8 GB (1600 MHz DDR3). This computer executed each algorithm implementation sequentially without any resource allocation. The graph format is GML6.
Table 2. Wall-clock running time of Schäfer’s implementation and MaxDkT on the six cases of studies. We denote ‘timeouts’ and ‘not applicable’ by ‘—’ and ‘n/a’, respectively.
Running time | in seconds | |||
---|---|---|---|---|
Row | Graph | \(t_{\text{Sch{\"a}fer}}\) | \(t_{\text{Max-D}k\text{T}}\) | \(\frac{t_{\text{Sch{\"a}fer}}}{t_{\text{Max-D}k\text{T}}}\) |
1 | \({\cal T}_{22\_16}\) | 132.4492 | 60.8545 | 2.1 |
2 | \(T_D,\; k = 4\) | 109.9832 | 1.8577 | 59.2 |
3 | \(T_D,\; k = 5\) | 116.5945 | 1.5734 | 74.1 |
4 | \(T_L, \; k = 10\) | 1.5289 | 0.5584 | 2.7 |
5 | \(T_L, \; k = 100\) | 12.6688 | 4.8969 | 2.5 |
6 | \(T_L, \; k = 1\,000\) | 602.9626 | 45.2492 | 13.3 |
7 | \(T_L, \; k = 10\,000\) | 19,859.2017 | 241.7339 | 82.1 |
8 | \(T_B, \; k = 5\) | 117.9299 | 2.0445 | 57.6 |
9 | \(T_B, \; k = 10\) | 121.7708 | 1.9870 | 61.2 |
10 | \(T_B, \; k = 15\) | 125.8261 | 1.9811 | 63.5 |
11 | \(T_B, \; k = 20\) | 125.4565 | 1.9592 | 64.0 |
12 | \(T_B, \; k = 25\) | 126.7537 | 1.9809 | 63.9 |
13 | \(T_B, \; k = 30\) | 148.4265 | 1.9831 | 74.8 |
14 | \({\cal T}_{Ph}\) | 719.4773 | 185.4435 | 3.8 |
15 | \(T_{0.3\mathrm{e}{6}},\; k = 10\) | 1,022.6843 | 9.4978 | 107.7 |
16 | \(T_{0.3\mathrm{e}{6}},\; k = 100\) | 1,439.42 | 17.3046 | 83.1 |
17 | \(T_{0.3\mathrm{e}{6}},\; k = 500\) | 5,989.8246 | 74.7689 | 80.1 |
18 | \(T_{0.5\mathrm{e}{6}},\; k = 10\) | 2,945.4381 | 15.6059 | 188.7 |
19 | \(T_{0.5\mathrm{e}{6}},\; k = 100\) | 3,654.6010 | 44.1117 | 82.8 |
20 | \(T_{0.5\mathrm{e}{6}},\; k = 500\) | — | 2,661.9666 | n/a |
21 | \(T_{0.75\mathrm{e}{6}},\; k = 10\) | 6,436.4398 | 22.7611 | 282.7 |
22 | \(T_{0.75\mathrm{e}{6}},\; k = 100\) | 7,807.3684 | 94.7391 | 82.4 |
23 | \(T_{0.75\mathrm{e}{6}},\; k = 500\) | — | 5,657.6981 | n/a |
24 | \(T_{1\mathrm{e}{6}},\; k = 10\) | 11,529.7250 | 31.8786 | 361.6 |
25 | \(T_{1\mathrm{e}{6}},\; k = 100\) | 14,167.638 | 198.9311 | 71.2 |
26 | \(T_{1\mathrm{e}{6}},\; k = 500\) | — | 11,422.2738 | n/a |
Table 3. Additional information and raw data.
Case of study | File | Case of study | File |
---|---|---|---|
\({\cal T}_{22\_16}\) | T_22_16.tsv | \(T_L\) | T_L.tsv |
\({\cal T}_{Ph}\) | T_Ph.tsv | \(T_B\) | T_L.tsv |
\(T_D\) | T_D.tsv | \(T_{\eta\mathrm{e}{6}}\) | Random.tsv |
Note: In the .tsv files
, the column labelled as ‘diameter’ represents the value for the integer \(k\).
Schäfer, A.: Exact algorithms for s-club finding and related problems. Ph.D. thesis, Friedrich-Schiller-University Jena (2009).↩︎
Our dynamic programming algorithm for a maximum distance \(s\)-clique set on trees.↩︎
A maximum distance \(s\)-clique on trees.↩︎
Van Rossum, G., Drake, F.L.: Python 3 Reference Manual. CreateSpace, Scotts Valley, CA (2009)↩︎
Hagberg, A., Swart, P., S Chult, D.: Exploring network structure, dynamics, and function using networkx. Tech. rep., Los Alamos National Lab.(LANL), Los Alamos, NM (United States) (2008)↩︎
Himsolt, M., 1997. GML: A portable graph file format. Technical report, Universitat Passau.↩︎