Variance-modeled Posterior Inference with Regional Exponentials
Updated 19 February 2005
VAMPIRE is a collection of Java tools designed to perform Bayesian statistical analysis of gene expression array data. At present, it consists of command-line console tools that perform each of the procedures outlined in a paper currently under review for Bioinformatics.
There are three Java-based tools available for download and analysis of array data. The first tool, VarianceModel, determines the amount of gene expression-dependent, and gene expression-independent variance. With this information, the second tool, BayesExpSimulator, models the distribution of gene expression measurements across the array. The third tool, BayesExpSignificanceTest, then performs a statistical test to determine whether individual "genes" on the array are differentially expressed.
Since the computational procedures are rather complex (and probably unoptimized), the combined execution time of all three tools is on the order of 12 hours on a Athlon XP 1800+ running Windows XP for an array with 15000 features. For the initial release of VAMPIRE, the interface is console-driven. In future releases, we hope to implement a simpler GUI interface, but for now, the tutorial will hopefully guide you through the necessary steps.
Make sure that the Java runtime is installed on the local machine. Consult your system administrator if you do not have access to install the JRE. Java can be downloaded directly from Sun at http://java.sun.com.
The tutorial files can be downloaded here.
The only data format currently understood by VAMPIRE is the CSV (comma-separated value) format. Each experimental condition should have data from each replicate in a separate CSV file, where the columns designate individual replicates, and each row represents a separate array feature. Only raw data should be included in these files, without labeling columns and rows. Care should be taken to make sure that the rows are consistent between each CSV file. Each feature of the array should have the same row number in each file.
For the sake of convenience, a sample data set is included in the tutorial download, and is located in the tutorial\ subdirectory. This data set contains three control and three experimental replicates for an array with an abbreviated set of 1000 features.
First, decompose the variance into expression-dependent and independent
components for both control and experimental treatments. VarianceModel will
attempt to automatically find a best choice cutoff threshold by scanning
regression estimates for the variance parameters. It chooses the
middle of the broadest interval where the regression estimates are "stable".
Once a cutoff is chosen, the variance parameters are estimated by Markov
Chain Monte Carlo (MCMC) simulation. To do this, move to
the tutorial\ directory and execute:
This should produce the following output:
java bicb.bayes.VarianceModel exp.csv
VarianceModel quickly surveyed the resultant regression estimates of A and B across a variety of possible percentile gene expression cutoffs. It was determined that the parameter estimates were most stable along percentile cutoffs of 37.50 and 65.00 for the control data set. The middle of this interval was chosen, and corresponded to a gene expression of 528.967 for the control data set. At this cutoff, both the regression estimate and MCMC estimate for the variance parameters were computed.
Second, model the distribution of gene expression across the arrays, using the
parameter estimates from VarianceModel. We will specify an output files for
the results as groups_control.txt and groups_exp.txt.
Execute the following:
The following results should appear:
-o groups_exp.txt 0.0541 14408.80 exp.csv
The features on the array are partitioned into expression "regions" where the variance is presumed to be essentially constant. Within each of these "regions", the marginal density for the hyperparameter, lambda, is simulated by MCMC to obtain the most likely value. This value is displayed for each expression region, and is also outputted to the files we specified, groups_control.txt and groups_exp.txt.
Finally, run the statistical test to determine which, if any, genes are
differentially expressed between control and experimental treatments. You
will need to refer BayesExpSignificanceTest to a configuration file. A
sample configuration file (bst_config.cfg) has
been included in the tutorial. Upon execution, the following results
should be displayed:
The results show that 17 genes are differentially expressed.
Opening up sigtest_d, we observe that features 200, 241, 242, 336, 337, 393, 416, 607, 763, 864 are all differentially-expressed. We can then look at sigtest_m and observe that feature 200 corresponds to a 3-fold upregulation and feature 241 corresponds to a 4.3-fold downregulation in RNA.
This work was supported by grants from the National Institutes of Health, NIDDK RO1-DK33651 (JMO), NIDDK KO1-DK62025 (DSW), and NIGMS K54 GM62114 (SS), a grant from Pfizer La Jolla (JMO), and a grant from the Hilblom Foundation (JMO and SS). AH is a graduate student in the UCSD Medical Scientist Training Program and is supported by a Fellowship from the Whitaker Foundation and the UCSD MSTP Training Grant T35-GM07198.