In addition to posted FAQs and help issues, we are happy to add
any general questions about using the web page and interpreting the data. Send your questions via our comments
page and we'll post the answers here.
Queries can be made in several different ways. On the front page, the code for a specific gene may be entered. Alternately, it is possible to explore the signatures in a particular region by clicking on the chromosome map at the bottom. Some of our other organism-specific pages offer alternative ways to access the data which will be implemented here as soon as possible.
MPSS stands for Massively Parallel Signature Sequencing, a technique invented and commercialized by Solexa, Inc. of Hayward, California. MPSS and related technologies have been described in publications by Brenner et al. (Nature Biotechnol. [2000] 18:630-634, and PNAS [2000] 97:1665-1670). Like SAGE (Serial Analysis of Gene Expression), MPSS identifies short sequence signatures produced from a defined position within an mRNA, and the relative abundance of these signatures in a given library represents a quantitative estimate of expression of that gene. The MPSS signatures are 17 bp in length, and can uniquely identify >95% of all genes in rice
SBS stands for sequencing by synthesis, a technique invented and commercialized by Illumina, Inc. of Hayward, California. The millions of short reads produced from a single channel of the sequencing reaction are a perfect technological match for deep sequencing of small RNAs. Since most small RNAs are 21-24 nt but the reads are ~35 nt, we trim off the adapter at the 3' end before we display the data.
Two major types of small RNAs (21 to 24 nucleotides in size), known as small interfering RNAs (siRNAs), and microRNAs (miRNAs) are present in a wide variety of eukaryotic organisms. Small RNAs play important regulatory roles in most eukaryotes but only a small proportion of these molecules have been identified. In order to investigate the full complexity of small RNAs, we adapted the different technologies (starting with the older "MPSS") for the sequencing of these molecules and have demonstrated the technology in rice. Most of the different sequences represent small-interfering RNAs (siRNAs) that match repetitive sequences, intergenic regions, and genes. We have now made the small RNA data available on our website, indicated by black triangles, and we have indicated repeats shown as colored blocks in the background of the viewer. We hope that you find this to be a useful addition to the rice genomic data.
A good place to start is the chromosome viewer, where you can surf the chromosomes and see how the tags align with genomic data. Also, read through these FAQs to get a better understanding of how the data work and what tools we offer.
The viewer is launched from the image of the chromosomes, under the basic query page. Clicking on these chromosomes takes you to a second level view, and allows you zoom one more level to a 100 kb region. In this final page, the viewer an image above indicating the location of the current image relative to the entire chromosome, the centromere, and the telomeres. You can click on this image to display a different region of the chromosome. Below, the image shows the ORFs on the strand (Watson) in red, and those on the bottom strand (Crick) in blue. The tRNAs are green, rRNAs are beige, snRNAs are dark purple, other RNA genes are gray, transposons are yellow and LTRs are fuchsia. The 5' ends of the top and bottom strands are indicated in red on the left and right ends, respectively. A legend is provided at the bottom of the page. The chromosomal coordinates are listed at the left and right of each line (in multiples of 20,000 bp).
We are extremely grateful to Mike Cherry and his group at the Saccharomyces Genome Database.
They provided the source code for their SAGE viewer. We re-wrote this in PHP and modified
it for rice, but couldn't have done it without using their code as a guide.
This is relevant only when there is mRNA expression data on the site, which not all of our sites have. If you see colored triangles pointing left and right, that's mRNA data. There are four entry points for mRNA analysis. 1) Enter the gene identifier in the basic query page. This will take you to the chromosome viewer, showing that gene and the flanking genomic DNA. 2) Enter a BAC clone in the basic query page. This will take you to that clone in the chromosome viewer. 3) Enter the sequence on the Query by Sequence page. This will extract the potential tags from your gene sequence and compare them against the database. 4) Find the gene in the chromosome viewer by clicking through to its physical location.
Signatures are normalized to a nice round number like 1 million or 2 million to facilitate comparisons among libraries. The number of signatures per library depends on sequencing results, and in our case the total number of signatures has been ~2 million per library. The expression of a gene is measured by the determining the abundance of tags derived from the transcript in a given library. Normalization is necessary to ensure that comparisons across libraries accurately reflect biological differences and not merely differences in the total number of tags sequenced.
The position of the MPSS tag for a given gene or transcript should coincide with the first Sau3A site 5' of the polyadenylation site. There are only two biological reasons for variation in the location of this site: small variations in the polyadenylation site, shifting it 5' or 3' of a Sau3A site (and thus splitting the total abundance count for that transcript among two or more signatures; or alternative polyadenylation due to alternative splicing and variable stop codons. The latter case is easier to demonstrate if you find two abundant signatures separated by an unused or less-used potential signature.
The signatures in our MPSS libraries were sequenced to both 17 bp and 20 bp; the 20 bp signatures are exactly the same libraries with a simple 3 bp extension of sequence from every bead in the library. While 20 bp is longer and should therefore provide more specificity for some genes, the extra 3 bp also slightly increases the probability of failure of the sequencing reaction due to palindromic sequences in the signature. We are currently investigating the difference between the 17 and 20 bp signatures and should have more details in the future.
The output of MPSS is similar to SAGE, but the method of obtaining the data is dramatically different. SAGE uses concatemerized tags that are sequenced using a traditional automated DNA sequencing method. SAGE tags are ~9 to 14 bp in length, and a good library may contain ~50,000 tags. A good place to read about SAGE is the NCBI SAGEmap web site. In contrast, MPSS uses a novel cloning and sequencing method whereby hundreds of thousands of sequences are obtained simultaneously by sequencing off of beads using a technique of enzymatic digestion and hybridization. This method is described in more detail (with a nice movie) at the Solexa home page. The libraries we obtained from Solexa contain ~2 million signatures per library.
Actually, it's unlike either the oligo or cDNA microarrays. If you're familiar with SAGE, it's a very similar concept, but the signatures are longer (17-20 bp) and we obtain many more of them (more than a million signatures per library). The cost is high, so it's expensive to analyze many timepoints or treatments. Solexa is presently working to bring this cost down to more reasonable levels (contact them for cost details). There is quite a bit of research value in the first few libraries for genome annotation, given that the expression levels are extremely precisely determined and the sequence of the signature is associated with this expression level information. Because such large numbers of signatures are sequenced, you can also find genes expressed at extremely low levels that may never show up in an EST library or be detectable on a microarray. Genes discovered using MPSS may be analyzed using other methods (quantitative PCR, microarrays, etc.) under a range of different conditions.
These unusual locations may include:
1) At the 5' end of the gene, either upstream or very close to the ATG.
2) Signatures found in the introns.
3) Anti-sense signatures, or those found on the "wrong" strand.
4) Signatures in a genomic region that don't match an annotated gene.
5) In multiple locations in the genome.
The answers:
1) An MPSS signature may occur anywhere in a transcript or 3' untranslated region (3'UTR). Therefore, a signature just upstream of a gene may result from the next gene upstream. You should check to see how far away this other gene is. If a signature appears unusually close to the ATG, remember that the location of the signature depends on the number of Sau3A sites in the gene. If only one potential signature is found in a gene + 3'UTR, then this is the signature that would show up in the library, independent of the position of the signature within the gene.
2) Signatures should not match to introns because they are derived from mRNA; if you find a signature in an intron that shows evidence of expression, the best explanation is alternative poly-adenylation. In other words, the signature that you observed is most likely part of an un-annotated exon or 3'UTR.
3) Anti-sense signatures may indicate the presence of an anti-sense transcript, although they could also result from mis-priming of the oligo-dT during first strand cDNA synthesis.
4) Signatures outside of a genomic region may correspond to an un-annotated gene. However, remember that expressed genes produce MPSS signatures in the 3' UTR, so you should look upstream of the signature to see if the next most 5' gene could be contributing those signatures.
5) Not all signatures are unique in the genome. ~90% do occur only once in the genome, while 10% occur two or more times. Signatures that are not unique may result from duplications of genomic regions or of individual genes. It is always worthwhile to check any signatures that you're studying to make sure that they are NOT duplicated (this can be done in our sequenced-based query page ). We have determined the complete set of duplicated signatures, and these are marked in our database. If you are interested in a signature that is duplicated, remember that most signatures should be found at the first Sau3A site 5' of the poly-A site; with this assumption, you may be able to identify which of the duplicated potential signatures was most likely to be transcribed and identified by MPSS.
Several problems can prevent identification of a specific gene in the libraries. These are all sequence-specific and so only affect a small percentage of the total genes. However, if your gene is one of those that are affected, you're out of luck. (Sorry!) The potential problems are:
1) Not all genes (including 3'UTRs) contain Sau3A sites (GATC), which was used for our libraries and hence is necessary for your gene of interest to appear in this MPSS analysis. Solexa may use other restriction sites, but for this database, no Sau3A site = no signature.
2) Some signatures identified from the mRNA may span splice sites in the genomic sequence. Currently, we cannot insert gaps in signatures to span these sites. The only way to remedy this is to extract potential signatures from a full-length cDNA, then compare these to the genomic sequence to see if any of them occur close to splice sites. The 5' and 3' ends can be obtained from the RIKEN full-length cDNA collection; the 3' ends may be useful for identifying novel MPSS signatures. The full inserts of these cDNAs have been sequenced by the SSP consortium, and the SSP sequences can be accessed via the Salk Institute web interface.
3) Sequence artifacts affect a small number of signatures. If a potential MPSS signature (e.g. GATC+13 or +16 bp) contains a palindrome that is in frame with the 4 bp sequencing frame, a hairpin structure may block sequencing of this signature. To get around this problem, Solexa splits the library and sequences in two frames (+2 and +4), so the same palindrome is not exposed in each sequencing run. After the sequencing runs, the results are compared; signatures that are significantly different between the two runs are 'corrected' based on an interpretation of the failure. Please see the question about the 2 and 4 step sequencing.
4) Sequencing errors in the genomic sequence. These are rare but exist; for example, the gene At4g05320 (UBQ10) has (had?) an error in its 3' most signature, making it difficult to measure it's expression. The genomic sequence of the signature is GATCCAGGACAAGGA_GG in the databases, whereas the correct sequence is GATCCAGGACAAGGAAG. If you find such errors, please report them to the databases!
5) If there is an unusually long distance (e.g. > 800 bp) between the 3' end of the transcript (e.g. poly-A site) and the first Sau3A site (GATC), then these genes may not appear in the library.
6) Solexa performs a size selection on the cDNA at around 400 bp, eliminating most transcripts or fragments below this threshold. Although these may be quite interesting, biologically, they will likely be less abundant in the library.
MPSS offers several advantages over microarrays, although cost and turn-around time are certainly not included in those advantages.
Microarray expression analysis is subject to inherent limitations that include: sensitivity to the quantity of RNA hybridized to the chip; background intensities rivaling signals for weakly expressed transcripts; and difficulty distinguishing between closely related sequences. MPSS is less sensitive to RNA quantities, because as many molecules are sequenced as needed. Background is significantly lower in MPSS, because it results only from errors in sequencing; we use a cutoff of ~3 TPM as the background. MPSS can often but not always distinguish closely related sequences; because it depends on sequencing, it is only necessary that the gene have a UNIQUE 17 bp signature in the proper location relative to it's most close relatives.
Microarrays require standardization and calibration to ensure proper comparisons of hybridization patterns are made across diverse tissues or chips. While there is no substitute for proper experimental design, MPSS provides a quantitative assessment of differential expression without the need for repetition or standardization in each experiment; the sum of the tags is a direct assessment of the abundance of each transcript. For accuracy in both methods, the treatment of the tissue must be performed correctly and no bias must be introduced in cDNA library construction.
Yes. As good as it is, there are errors in the genomic sequence. If one of these rare errors falls in the signature, you may not be looking for the right signature.
Background is relatively low in MPSS. Background is essentially the erroneous identification of signatures that are assigned to a particular gene. These may result from errors during the sequencing process that mis-calls a particular base. With the genomic sequence, we can 'filter' most of the errors by comparing the signatures and removing those that aren't found in the genome. If by chance the signature is found in the genome, errors are infrequent enough that mis-identified signatures have a very low abundance. We use a cutoff of ~3-5 TPM to separate low background from signal.
Because the MPSS signature abundances have been normalized to transcripts per million (TPM), differential expression may be detected by applying a simple binomial test that produces a P-value, with the P-value indicating the strength of the evidence that the two abundances are different. The smaller the p-value, the greater the difference between the expression levels in the two libraries.
MPSS is a digital method of analyzing gene expression; in other words, it is based on a direct count of sequences from a given cDNA library, using sequence tags to determine the abundance of cognate transcripts in the library. In digital expression analyses, comparisons between large libraries facilitate the detection of significant differential expression for genes expressed at low levels. For a gene expressed at a given rate, increasing the sampling size leads to higher tag counts, and allows more stringent statistical inferences to be made for the same proportional variation (Audic and Claverie, 1997). To cite an example from Audic and Claverie (1997), variation from 4 to 12 counts is enough to be significant at P < 0.05, and a variation from 7 to 21 is significant at P < 0.01. The power of the test is increased by adjusting the sample size, with more reliable inferences can be made from tags that are identified with a higher absolute frequency (Audic and Claverie, 1997). A given sample size will target all of the genes for which the tags occur at reasonable frequencies. There is no theoretical limit to the detection of small variations in the comparison of digital expression patterns. (Audic and Claverie, 1997).
SBS is a digital method of analyzing gene expression or small RNAs; in other words, it is based on a direct count of sequences from a given cDNA library, using sequence tags to determine the abundance of cognate transcripts in the library. In digital expression analyses, comparisons between large libraries facilitate the detection of significant differential expression for genes expressed at low levels. It is far better at detecting low levels of transcripts than RNA gel blots, simply because it's possible to sequence so deeply.
The data is absolutely free and publicly available. This web page and the data contained therein represent your tax dollars at work. The research is funded by the National Science Foundation, Plant Genome Research Program.
Please tell the funding agency and your congressman if you think it's worthwhile (or not). You are welcome to publish using this data, but we'd like to know, only so that we can measure the utility of the data! The more people use it, and publish with it, the better. Please send us an email if you have found the data to be useful. And please cite one of our publications that describe this work. The most appropriate paper to cite might be our description of the website in Nucleic Acids Research (Nakano et al., 2006).
Please contact Illumina
(www.illumina.com). This is their business, and
they perform the sequencing for a fee.