Below we provide information and pointers to datasets that are either already represented as a graph, or are relational in nature and lend themselves to a graph representation. Ultimately, we plan to evolve this resource into a collection of graph datasets that can be used as a testbed or benchmark for graph-based algorithms.
Before we list the datasets, we note that issues of file format and semantics will need to be addressed in order to make this collection useful for comparisons.
The Mutagenesis data describes several chemical compounds and classifies them as mutagenic or not mutagenic. A Prolog format version of the data is available, which consists of 188 examples in a regression-friendly format (mutagenesis-f.pl) and a regression-unfriendly format (mutagenesis-u.pl).
Another version of the mutagenesis data is available as part of the 2000 PAKDD Challenge. This data is in the Structure Data File (SDFile) format, which is popular in chemistry.
The Predictive Toxicology Challenge (PTC) data consists of information about chemical compounds known to cause or not cause cancer in rats and mice. The data is available at the above link.
The authors have developed a program that converts the atomic structure within Protein Data Bank (PDB) files into SUBDUE graph format. The source code and sample data are available here. More protein data is available from the Protein Data Bank (PDB).
Citation graphs have become a popular domain for graph-based data mining. We provide pointers to two sets of citation data that can be converted to graph form.
The MovieLens database represents a recommender system version of collaborative filtering. The data contains information about users, movies, and users' ratings of the movies. The data is available here.
The Internet Movie Database (IMDb) contains information about movies, actors, directors, etc. See their page on interfaces for more information on obtaining the data.
The authors have developed a program that extracts movie data from the IMDb and converts it to a SUBDUE-formatted graph. See the readme.txt file and the source code imdb2graph.c.
Still looking for data on web structure. The authors have developed some web crawlers that can convert portions of the web to a graph, which can be made available upon request.
Synthetically-generated graphs offer the ability to control various properties (e.g., size, average degree, etc.). However, generating graphs with specific properties (e.g., power law) is not always easy. We need to answer two questions:
Here are some existing synthetic graph datasets.