Nearest Neighbor Networks

Welcome to NNN - we hope you find our software useful! NNN is licensed under the Creative Commons Attributions 2.5 license, which means you can use it for pretty much anything as long as you cite us. The relevant publication is:

Huttenhower, C., Flamholz, A., Landis, J., Sahi, S., Myers, C., Olszewski, K., Hibbs, M., Siemers, N., Troyanskaya, O., Coller, H., "Nearest Neighbor Networks: Clustering Expression Data Based on Gene Neighborhoods", BMC Bioinformatics 8:250, 2007

You can find more information about the NNN algorithm and its performance in this paper, or you can check out the labs involved in its creation:

Now, on to the good part...

Table of Contents

  1. I just want to cluster genes!
    1. java.lang.OutOfMemoryError
  2. Graphical interface
    1. Main panel
    2. Advanced options
  3. Command line interface
  4. Input and output formats
    1. Inputs
    2. Outputs
  5. Source code
  6. Version history

I just want to cluster genes!

If you just want to get up and running with NNN, you need two things:

NNN is fairly generous in its parsing of PCL files; specifically, it will ignore an EWEIGHT row if present (but it doesn't care if there's no EWEIGHT row), and it will attempt to guess the number of initial non-data columns (in other words, in addition to the first ID column identifying your genes, you can include any number of labeling columns such as NAME, GWEIGHT, and so forth). Missing values will be handled correctly, but it is recommended that you first filter and/or impute missing data for better clustering performance.

When you start up NNN, you should see the following interface: To cluster your data:

  1. Click the "Open" button and browse to your PCL file.
  2. Adjust the four options (distance measure, clique size g, neighborhood size n, and biconnection processing) to best suit your data set.
  3. Click the "Cluster" button and keep an eye on the progress bar! The initial 2/3 are the slowest; the last 1/3 is just biconnection processing (if selected), which is fast.
  4. Once your clustering is ready, you can save in either TreeView or PCL formats.

If you made it this far successfully, congratulations - you've got Nearest Neighbor Networks! If the output doesn't quite meet your expectations, fret not - see below for additional options that can be manipulated to further improve clustering performance.

java.lang.OutOfMemoryError

Under certain circumstances (generally only when using the command line interface), you might get a java.lang.OutOfMemoryError error while using NNN. This is spectacularly easy to fix, however; from the command line (either the Command Prompt on Windows, Terminal on Mac OS, or your console of choice on Linux), start up NNN using the command:

java -Xmx1024m -jar nnn<your version number here>.jar

You can substitute larger numbers than 1024 if you have lots of free memory sitting around; well-behaved JVMs won't use the extra memory unless they need it.

Graphical interface

Main panel

The elements of the default graphical interface are (hopefully) fairly self-explanatory (at least if you've read our paper!) In order, these are:

  1. Distance Measure. Three options are included by default for measuring gene pair similarity: Pearson correlation, Euclidean distance (L2 norm), and Manhattan distance (L1 norm). If you're interested in seeing more distance measures, let us know!
  2. Group Size (g). As described in the paper, this is the clique size NNN will search for in its preliminary interaction network while combining overlapping cliques into clusters. Values below two don't make much sense, and values above six will be slooooow for any reasonably sized data set. In general, there's no huge benefit in increasing g above five, and decreasing it will only hurt performance.
  3. Neighborhood Size (n). As described in the paper, this is the number of nearest neighbors NNN will use while forming its preliminary mutual nearest neighbor interaction network. This is the parameter most relevant to performance tuning, and it is directly related to the number of conditions in your data set. Data sets with more conditions might want to use a larger n, but using too large of a value will result in overlarge clusters. Conversely, too small of an n will generate lots of tiny clusters. Larger values of n will make things somewhat slower, but not critically so in general.
  4. Biconnection Processing. If this option is activated, NNN will split clusters containing cut-vertices into multiple connected components. In other words, this removes hubs from the clusters that may be joining clusters that are otherwise functionally unrelated. It's a fast process, so there's rarely any reason to deactivate it.
  5. The progress bar and status bar indicate, well, the progress and status of clustering, respectively. The progress bar will advance during a clustering run, and the status bar will display various text that's potentially relevant to the status of data input and clustering results.

Advanced options

If you activated the "Advanced" checkbox under the "File" menu, NNN displays a few additional options: These extra options are:

  1. Skip Columns. If NNN fails to correctly guess the column in which the expression data begins in your PCL file, you can force it to skip a specific number of columns. This is the number of columns to skip after the first column, which must contain unique gene IDs. For example, if your column headers were GID, NAME, GWEIGHT, Time point 1, Time point 2, and so forth, the skip value should be two (to account for NAME and GWEIGHT).
  2. Noise Deviation. In data sets containing a large number of genes, it is sometimes difficult to balance the neighborhood size to retrieve many medium sized clusters (rather than a few huge ones or a bunch of tiny ones). NNN offers the option of using a very basic simulated annealing approach by adding some noise to the initial neighborhood distance calculations, in effect "jittering apart" the nearest neighbors that aren't particularly tight (and thus probably not functionally meaningful) and breaking up overlarge clusters. This parameter controls the standard deviation of normally distributed random noise to be added to the distance calculations - which means it should be small! A deviation of 0.05 or 0.1 is generally sufficient for Pearson correlation, while an appropriate value for other measures will depend on the characteristics of your data.
  3. Include unclustered genes. By default, when saving a TreeView file, NNN only includes genes that have been included in at least one NNN cluster. If this option is enabled, all genes will be included in the TreeView file (although only NNN clusters will be colored). Activating this option can increase save times (since a full hierarchical clustering needs to be calculated). The PCL output format always includes all genes.

Command line interface

In addition to the graphical interface, NNN has a command line interface that 95% of its users probably won't care about. But if you're interested in getting into the nitty gritty of NNN's capabilities, read on! Using the command line, you can obtain additional output formats or provide additional input to influence NNN's clustering behavior.

If you call NNN with no command line arguments, you'll get the graphical interface described above. But if you provide any command line arguments, the console-based version will run instead. In particular, providing the "-h" argument produces the following information:

 -g N       : Group size (5)
              0 produces NN distance only
 -n N       : Minimum neighborhood size (20)
 -N N       : Maximum neighborhood size (n)
 -e N       : Neighborhood size step (5)

 -m MEASURE : Distance measure (Pearson)
              Peason, Euclidean, Manhattan
 -d N       : Deviation of added Gaussian noise (0)
 -z FILE    : Precalculated distance file

 -s         : Separate biconnected components (true)
 -l         : Remove overlarge networks (true)
 -u         : Hierarchically cluster genes that aren't clustered by NNN (false)

 -i FILE    : Input file (standard in)
 -k N       : Columns to skip after the initial gene ID in input PCL (auto)
 -o FILE    : Output file (standard out)
 -t FILE    : Produce a TreeView CDT/GTR (none)

 -p FILE    : TFBS profile PCL file (none)
 -K N       : Columns to skip after the initial gene ID in profile PCL (auto)
 -b N       : Weight of TFBS profiles (0)

Many of these options are available through the GUI, and the default behavior of the command line interface is essentially identical to the default behavior of the GUI (default values are indicated in parentheses). However, there are two significantly different paradigms supported by the command line:

There are a lot of options there, so let's try to explain them:

Input and output formats

Inputs

NNN's primary input format is the PCL file, a standard tab-separated tabular microarray data format. In brief, the first row contains headers labeling each column. The second row may contain an EWEIGHT line listing relative weights of individual conditions (which NNN ignores), or it may contain the first data record. Each subsequent row represents a single gene or probe's data record, consisting of an initial unique ID, zero or more label columns (such as NAME, GWEIGHT, and so forth), and one or more data columns (individual microarray conditions).

NNN will attempt to guess the correct number of label columns to skip; in general, PCL files of the form:

GID	NAME	GWEIGHT	Condition 1	Condition 2	...

and PCL files of the form:

GID	NAME	Custom label 1	Custom label 2	GWEIGHT	Condition 1	Condition 2	...

and PCL files of the form:

GID	Condition 1	Condition 2	...

will all work, so long as the custom label values aren't all numbers (since that will trick NNN's guesser into thinking they're expression values). NNN should deal equally well with PCL files resulting from one- or two-channel arrays, although you should ensure that one-channel array values are log transformed before clustering (this is a semi-standard practice anyhow).

NNN will correctly parse missing values in its input PCLs, although too many missing values may degrade clustering performance. We recommend removing any gene with too many missing values (>30% of the conditions) and imputing any that remain.

Finally, NNN uses a custom DAT or DAB format for advanced users wishing to provide precalculated distances through the command line interface. For more information, see the command line and source code sections below.

Outputs

The NNN GUI and command line interfaces both provide two main output formats. The graphical interface defaults to a .cdt/.gtr file pair suitable for viewing in Java TreeView. The command line interface defaults to a single .pcl file annotated with one new column, "Networks", indicating which clusters (if any) each gene was placed into. Both of these formats are standards that are described in detail elsewhere:

The command line interface also offers a pairwise output format for advanced users of the form:

GENE1	GENE2	SCORE
GENE1	GENE3	SCORE
GENE2	GENE3	SCORE

and so forth. Each row contains a pair of gene identifiers and some score (distance, neighborhood size, etc.) between them. Each line is unique (i.e. each gene pair is contained only once in the file), and each column is tab separated. This format is sometimes referred to as a .dat file, the textual version of a .dab file (see the source section for details).

Source code

NNN is provided with source code containing fairly extensive JavaDoc and non-JavaDoc comments, so I won't spend much time on it here. However, as an overview, the code consists of three main pieces:

The troyanskaya.lib package in particular contains several classes which may be of general use, including:

Feel free to use any of these classes in your own applications, modified or unmodified, so long as you cite us as per the Creative Commons Attributions 2.5 license. Thanks for being interested enough in NNN to make it through all of these details, and happy clustering!

Version history