Node Classification in Uncertain Graphs

Source Code, Datasets, and Comparative Experimental Results

Node Classification in Uncertain Graphs

Michele Dallachiesa, Charu Aggarwal, and Themis Palpanas

In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model, and show the benefits of incorporating uncertainty in the classification process as a first-class citizen. The experimental results demonstrate the effectiveness of our approach.

Conference Publication

Michele Dallachiesa, Charu Aggarwal, Themis Palpanas. Node Classification in Uncertain Graphs. International Conference on Scientific and Statistical Database Management (SSDBM), Aalborg, Denmark, June 2014.

Source Code

You may freely use this code for research purposes, provided that you acknowledge the authors with the following reference:

Michele Dallachiesa, Charu Aggarwal, Themis Palpanas. Node Classification in Uncertain Graphs. International Conference on Scientific and Statistical Database Management (SSDBM), Aalborg, Denmark, June 2014.

@INPROCEEDINGS{NodeClassifUncertainGraphsSSDBM,
   AUTHOR={Michele Dallachiesa and Charu Aggarwal and Themis Palpanas},
   TITLE={{Node Classification in Uncertain Graphs}},
   BOOKTITLE={International Conference on Scientific and Statistical Database Management (SSDBM)},
   ADDRESS={Aalborg, Denmark},
   YEAR=2014}

Zip file with source code for all the algorithms used in the paper.

Real Datasets

1) DBLP dataset downloaded on Aug13,2012 from http://dblp.uni-trier.de/xml/

2) dblp-orig.xml is the original file. dblp.xml has been obtained with:
$ sed 's/></>#</g' dblp.xml | tr '#' '\n' > dblp1.xml
to split 1+ lines with multiple tags, not supported by our ultra-fast ultra-naive xml parser.
an example is "</proceedings><article md...". We expect "<article..." to always start on a new line.

3) there may be articles with zero authors, e.g.,

<article mdate="2011-12-29" key="tr/trier/MI92-17" publtype="informal publication">
<title>18. Workshop über Komplexitätstheorie, effiziente Algorithmen und Datenstrukturen, Universität Trier, 20. Oktober 1992 (Abstracts)</title>
<journal>Universität Trier, Mathematik/Informatik, Forschungsbericht</journal>
<volume>92-17</volume>
<year>1992</year>
</article>

4) Some stats:

Total number of processed articles: 850128
Graph stats: 677098 nodes, 4122556 edges, 14 labels.
Total number of labeled nodes: 153022 (~22%)
Total number of components: 73588

5) Nodes are authors. Nodes are labeled using conference keywords in different research fields of computer science.

6) To generate dblp.dat, edit xml2dat.py, set dblpPathnamePrefix = "dblp", and exec $ ./xml2dat.py with no parameters.

7) Total number of nodes that can be reached from labeled nodes: 547857 (~81%). This means that ~19% of the nodes can be discarded/ignored.