VAMPIRE
Variance-modeled Posterior Inference with Regional Exponentials  

Updated 19 February 2005


 
VAMPIRE
    Introduction
   
Tutorial
        Installation
       
Preparing Data
       
Running VAMPIRE
   
Acknowledgements

PEOPLE
   
A Hsiao
    DS Worrall
    JM Olefsky
   
S Subramaniam

LINKS
    Microarray Home

CONTACT
   
A Hsiao
   
S Subramaniam

Introduction

VAMPIRE is a collection of Java tools designed to perform Bayesian statistical analysis of gene expression array data. At present, it consists of command-line console tools that perform each of the procedures outlined in a paper currently under review for Bioinformatics.

There are three Java-based tools available for download and analysis of array data. The first tool, VarianceModel, determines the amount of gene expression-dependent, and gene expression-independent variance. With this information, the second tool, BayesExpSimulator, models the distribution of gene expression measurements across the array. The third tool, BayesExpSignificanceTest, then performs a statistical test to determine whether individual "genes" on the array are differentially expressed.

Since the computational procedures are rather complex (and probably unoptimized), the combined execution time of all three tools is on the order of 12 hours on a Athlon XP 1800+ running Windows XP for an array with 15000 features.  For the initial release of VAMPIRE, the interface is console-driven.  In future releases, we hope to implement a simpler GUI interface, but for now, the tutorial will hopefully guide you through the necessary steps.

Tutorial

Installation

Make sure that the Java runtime is installed on the local machine. Consult your system administrator if you do not have access to install the JRE. Java can be downloaded directly from Sun at http://java.sun.com.

The following packages are used by VAMPIRE, and must also be installed into the classpath.    

  Hydra   a free Java package for Markov-chain Monte Carlo simulation
  COLT   a free Java package for numerical computing
  VAMPIRE   the vampire JAR file

The tutorial files can be downloaded here.

Preparing Data

The only data format currently understood by VAMPIRE is the CSV (comma-separated value) format.  Each experimental condition should have data from each replicate in a separate CSV file, where the columns designate individual replicates, and each row represents a separate array feature.  Only raw data should be included in these files, without labeling columns and rows.  Care should be taken to make sure that the rows are consistent between each CSV file.  Each feature of the array should have the same row number in each file.

Running VAMPIRE

For the sake of convenience, a sample data set is included in the tutorial download, and is located in the tutorial\ subdirectory.  This data set contains three control and three experimental replicates for an array with an abbreviated set of 1000 features.

First, decompose the variance into expression-dependent and independent components for both control and experimental treatments. VarianceModel will attempt to automatically find a best choice cutoff threshold by scanning regression estimates for the variance parameters.  It chooses the middle of the broadest interval where the regression estimates are "stable".  Once a cutoff is chosen, the variance parameters are estimated by Markov Chain Monte Carlo (MCMC) simulation. To do this, move to the tutorial\ directory and execute:
java bicb.bayes.VarianceModel controls.csv
java bicb.bayes.VarianceModel exp.csv

This should produce the following output:
java bicb.bayes.VarianceModel controls.csv
Stable percentiles: 37.50 to 65.00
Data Threshold: 528.9666666666667
Regression Estimate (A B): (0.03758893015099287 18101.377186367325)
MCMC Estimate (A B): (0.03787423394959204 12325.25602062349)

java bicb.bayes.VarianceModel exp.csv
Stable percentiles: 32.50 to 55.00
Data Threshold: 361.26666666666665
Regression Estimate (A B): (0.047794997872169696 31775.7359952095)
MCMC Estimate (A B): (0.05407905816095158 14408.803141916735)

VarianceModel quickly surveyed the resultant regression estimates of A and B across a variety of possible percentile gene expression cutoffs. It was determined that the parameter estimates were most stable along percentile cutoffs of 37.50 and 65.00 for the control data set. The middle of this interval was chosen, and corresponded to a gene expression of 528.967 for the control data set.  At this cutoff, both the regression estimate and MCMC estimate for the variance parameters were computed.

Second, model the distribution of gene expression across the arrays, using the parameter estimates from VarianceModel. We will specify an output files for the results as groups_control.txt and groups_exp.txt.  Execute the following:
java bicb.bayes.BayesExpSimulator -o groups_control.txt 0.0379 12325.26 controls.csv
java bicb.bayes.BayesExpSimulator -o groups_exp.txt 0.0541 14408.80 exp.csv

The following results should appear:
java bicb.bayes.BayesExpSimulator -o groups_control.txt 0.0379 12325.26 controls.csv
Log-tx accuracy threshold: 275.74429589399534
Untransformed Segments: 3
Transformed Segments: 5

expression group: 0
log-transformed: false
min, max: 0.0, 180.33434208169263
probe sets: 266
estimated var: 12662.841849984101
lambda: -0.002227235280101113

expression group: 1
log-transformed: false
min, max: 180.33434208169263, 261.3292891382012
probe sets: 72
estimated var: 14101.505550072361
lambda: 0.04011952940773122

expression group: 2
log-transformed: false
min, max: 261.3292891382012, 378.7021183715545
probe sets: 82
estimated var: 16168.998819687753
lambda: 0.0038652860085709975

expression group: 3
log-transformed: true
min, max: 378.7021183715545, 592.5834220820889
probe sets: 109
estimated var: 0.09128758343378765
lambda: 4.051250443979697

expression group: 4
log-transformed: true
min, max: 592.5834220820889, 858.7349622937744
probe sets: 96
estimated var: 0.061517977377167354
lambda: 3.079219099712571

expression group: 5
log-transformed: true
min, max: 858.7349622937744, 1244.4251863723866
probe sets: 93
estimated var: 0.04905032140372649
lambda: 1.0355134115395865

expression group: 6
log-transformed: true
min, max: 1244.4251863723866, 1803.3434208169263
probe sets: 67
estimated var: 0.04332420990589762
lambda: 2.8251006949398887

expression group: 7
log-transformed: true
min, max: 1803.3434208169263, Infinity
probe sets: 215
estimated var: 0.03808523724516124
lambda: 1.0789732014262141

java bicb.bayes.BayesExpSimulator -o groups_exp.txt 0.0541 14408.80 exp.csv
Log-tx accuracy threshold: 314.25794096772614
Untransformed Segments: 3
Transformed Segments: 5

expression group: 0
log-transformed: false
min, max: 0.0, 163.19816605834123
probe sets: 240
estimated var: 14841.040248715626
lambda: -0.005953042865101398

expression group: 1
log-transformed: false
min, max: 163.19816605834123, 236.49661086386087
probe sets: 75
estimated var: 16533.95656599919
lambda: 0.03178862613571836

expression group: 2
log-transformed: false
min, max: 236.49661086386087, 342.7161487225166
probe sets: 94
estimated var: 18519.782827357827
lambda: 0.05574751029122116

expression group: 3
log-transformed: true
min, max: 342.7161487225166, 536.2734940223601
probe sets: 102
estimated var: 0.13004735883129484
lambda: 8.399143971834272

expression group: 4
log-transformed: true
min, max: 536.2734940223601, 777.134124087339
probe sets: 102
estimated var: 0.08873944122936486
lambda: 9.628798174776064

expression group: 5
log-transformed: true
min, max: 777.134124087339, 1126.1743374469563
probe sets: 101
estimated var: 0.07029424336776093
lambda: 3.2883741017508887

expression group: 6
log-transformed: true
min, max: 1126.1743374469563, 1631.9816605834121
probe sets: 62
estimated var: 0.06181934500930775
lambda: 3.1652523756672566

expression group: 7
log-transformed: true
min, max: 1631.9816605834121, Infinity
probe sets: 224
estimated var: 0.05429282501203127
lambda: 1.0026159887332047

The features on the array are partitioned into expression "regions" where the variance is presumed to be essentially constant. Within each of these "regions", the marginal density for the hyperparameter, lambda, is simulated by MCMC to obtain the most likely value. This value is displayed for each expression region, and is also outputted to the files we specified, groups_control.txt and groups_exp.txt.

Finally, run the statistical test to determine which, if any, genes are differentially expressed between control and experimental treatments. You will need to refer BayesExpSignificanceTest to a configuration file.  A sample configuration file (bst_config.cfg) has been included in the tutorial.  Upon execution, the following results should be displayed:
java bicb.bayes.BayesExpSignificanceTest -r -o sigtest bst_config.cfg
Variances are recomputed from posterior means.
Using Bonferroni-corrected threshold of: 2.5E-5
Comparing exp.csv to controls.csv: 17 significant changes.

The results show that 17 genes are differentially expressed.

Three files are generated by BayesExpSignificanceTest with the following file names:
  sigtest_m   a list of "true means" computed by VAMPIRE for each feature in the array
  sigtest_p   a list of p-values for each gene expression comparison performed
  sigtest_d   a list of "down-regulated" (D) or "up-regulated" (U) genes given the specified significance threshold

Opening up sigtest_d, we observe that features 200, 241, 242, 336, 337, 393, 416, 607, 763, 864 are all differentially-expressed.  We can then look at sigtest_m and observe that feature 200 corresponds to a 3-fold upregulation and feature 241 corresponds to a 4.3-fold downregulation in RNA.

Acknowledgements

This work was supported by grants from the National Institutes of Health,  NIDDK RO1-DK33651 (JMO), NIDDK KO1-DK62025 (DSW),  and NIGMS K54 GM62114 (SS), a grant from Pfizer La Jolla (JMO), and a grant from the Hilblom Foundation (JMO and SS).  AH is a graduate student in the UCSD Medical Scientist Training Program and is supported by a Fellowship from the Whitaker Foundation and the UCSD MSTP Training Grant T35-GM07198.