Yibo's Home

Collection and Streaming of Graph Datasets

Yibo Yao

Advisor: Dr. Larry Holder

School of Electrical Engineering and Computer Science

Washington State University, Pullman, WA 99164

I have collected several graph datasets from various application domains. Those graph datasets are dynamic in nature with nodes and edges being added, deleted or modified over time. I converted the original datasets into GraphML format and streamed them using time-series representations with proper time-window sizes. This task will help people understand the implicit relations of entities within each dataset and facilitate the development of graph mining algorithms on dynamic graph datasets.

The GraphML-Attributes extension mechanism has been used to describe properties of entities and their relations in these datasets. Most of the collected datasets exhibit additions of nodes and edges over time as their dynamic nature. So in each GraphML representation, I declared an attribute named 'modification' whose default value is 'add' for a node or an edge. However, some other datasets are characterized by both additions and deletions of nodes and edges over time. Therefore, I explicitly use key/data label to indicate that the value of 'modification' attribute is 'delete' when a node or an edge is removed at a certain time window.

The report can be found here. In the following sections, some detail information of the datasets is described. I have also distrubuted the source code (written in Python) by which I used to perform the streaming.

Autonomous System

This Autonomous System (AS) dataset was downloaded from Stanford SNAP project, which was originally collected from University of Oregon Route Views Project. It contains 733 daily instances which span an interval of 785 days from Nov 8, 1997 to Jan 2, 2000. Each daily instance describes a communication network of who-talks-to-whom in the system. I have converted these original data files into 733 GraphML instances with a 1-day time-window size.

  • Data link: AS@SNAP (dynamics: addition and deletion of nodes and edges)
  • Source code: AS.py (available time-window size: day)
  • Read me: readme
  • Sample GraphML: sample
Citation Network
  1. HighPh
  2. The HepPh dataset was downloaded from Stanford SNAP project, which was originally released at 2003-KDD Cup. The papers included in this dataset were collected from the e-print arXiv and covered in the period from Feb 1992 to Mar 2002 (122 months).

  3. HighTh
  4. The HepTh dataset was downloaded from Stanford SNAP project, which was originally released at 2003-KDD Cup. The papers included in this dataset were collected from the e-print arXiv and covered in the period from Feb 1992 to Mar 2002 (122 months).

  5. USPatents
  6. The original data files were downloaded from NBER (the National Bureau of Ecnomic Research). The data comprises detail information on almost 3 million U.S. patents granted between January 1963 and December 1999, and all citations made to these patents between 1975 and 1999 (over 16 million).

Movie Database
  1. MovieLens
  2. This dataset was downloaded from MovieLens 1M Dataset distributed by GroupLens research group. It describes 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000.

  3. Hetrec
  4. The original data files were downloaded from HetRec 2011 Dataset. It is an extension of MovieLens 10M dataset, published by GroupLens research group. The original dataset links the movies of MovieLens dataset with their corresponding web pages at Internet Movie Database (IMDb) and Rotten Tomatoes movie review systems, and it contains all detail information of the movies (like actors, actresses, directors, countries, genres, etc.)

Social Network Growth

The Social Network Growth data consists of three independent datasets: Facebook-Growth, Flickr-Growth and Youtube-Growth. The original files were collected from Online Social Network Research group. These growth datasets are focusing on the ways in which new user-user links are created.

Data link: Social Networks (dynamics: addition of nodes and edges)
Read me: ReadMe
  1. Facebook
  2. Flickr
  3. Youtube
Tencent Weibo

The original data files were downloaded from 2012-KDD CUP. The data represents a sampled snapshot of the Tencent Weibo users' preferences for various items - the recommendation to users and follow-relation history.

Yahoo! Instant Message

The original dataset is provided as part of the Yahoo! Webscope program for use solely under the terms of a signed Yahoo! Data Sharing Agreement. It contains data generated by a small subset of Yahoo! Messenger users from different zip codes for 28 days starting from April 1st 2008, with some modifications and additions by Yahoo! Research.