Functions to annotate yeast genom data


Given a GEO accession number for a yease data set and the extensions for annotation data files names that are available from Yeast Genom web site, the functions generates a data package with containing annoatation data for yeast genes in the GEO data set.


yeastAnn(base = "", yGenoUrl =
                 yGenoNames =
                 "literature_curation/gene_association.sgd"), toKeep =
                 list(c(6, 1), c(9, 2, 5, 6, 8, 11, 3), c(2, 5, 7)),
                 colNames = list(c("sgdid", "pmid"), c("sgdid",
                 "genename", "chr", "chrloc", "chrori", "description",
                 "alias"), c("sgdid", "go")), seps = c("\t", "\t",
                 "\t"), by = "sgdid")
getProbe2SGD(probe2ORF = "", yGenoUrl =
             fileName = "literature_curation/",
             toKeep = c(1, 7), colNames = c("orf", "sgdid"), sep = "\t",
             by = "orf")
procYeastGeno(baseURL =
"", fileName,
toKeep, colNames, seps = "\t")
getGEOYeast(GEOAccNum, GEOUrl =
"", geoCols = c(1, 8),
yGenoUrl = "") 
formatGO(gos, evis)
formatChrLoc(chr, chrloc, chrori)
           yGenoName = "chromosomal_feature/", sep = "\t")  


base base a matrix with two columns. The first column is probe ids and the second one are the mappings to SGD ids used by all the Yeast Genome data files. If base = "", the whole genome will be mapped based on a data file that contains mappings between all the ORFs and SGD ids
GEOAccNum GEOAccNum a character string for the accession number given by GEO for a yeast data set
GEOUrl GEOUrl a character string for the url that contains a common CGI for all the GEO data. Currently it is
geoCols geoCols a vector of integers for the coloumn numbers of the source file from GEO that maps yeast probe ids to ORF ids
yGenoUrl yGenoUrl a character string for the url that is a directory in Yeast Genom web site that contains directories for yeast annotation data. Currently it is
baseURL see yGenoUrl
yGenoNames yGenoNames a vector of character strings for the names of yeast annotation data. Each of the strings can be appended to yGenoUrl to make a complete url for a data file
fileName a character string for the extension part of the source data file that can be used to target genes to SGD ids
toKeep toKeep a list of vector of integers with numbers corresponding to column numbers of yeast genom data files that will kept when data files are processed. The length of toKeep must be the same as yGenoName (a vector for each file)
colNames colNames a list of vectors of character strings for the names to be given to the columns to keep when processing the data. Again, the length of colNames must be the same as yGenoNames
seps seps a vector of characters for the separators used by the data files included in yGenoNames
sep singular version of seps
by by a character string for the column that is common in all data files to be processed. The column will be used to merge separate data files
probe2ORF probe2ORF a matrix with mappings of yease target genes to ORF ids that in turn can be mapped to SGD ids
gos gos a vector of character strings for GO ids retrieved from Yeast Genome Project
evis evis a vector of character string for the evidence code associated with go ids
chr chr a vector of character strings for chromosome numbers
chrloc chrloc a vector of integers for chromosomal locations
chrori chrori a vector of characters that can either be w or c that are used for strand of yeast chromosomes
srcUrl srcUrl a character string for the url where source yeast genome data are stroed
yGenoName yGenoName a character string for the yeast genome file name to be processed


To merge files, the system has to map the target genes in the base file to SGD ids and then use SGD ids to map traget genes to annotation data from different sources.

formatGO adds leading 0s to goids when needed and then append the evidence code to the end of a goid following a "@".

formatChrLoc assigns a + or - sing to chrloc depending on whether the corresponding chrori is w or c and then append chr to the end of chrloc following a "@".

getGEOYeast gets yeast data from GEO for the columns specified.


yeastAnn returns a matrix with traget genes annotated by data from selected data columns in different data sources.
getProbe2SGD returns a matrix with mappings between target genes and SGD ids.
procYeastGeno returns a data matrix.
formatGO returns a vector of character strings.
formatChrLoc returns a vector of character strings.
getGEOYeast returns a matrix with the number of columns specified.


The functions are part of the Bioconductor project at Dana-Farber Cancer Institute to provide Bioinformatics functionalities through R


Jianhua Zhang


# The following code will take a while to run and is turned off 
yeastData <- yeastAnn(GEOAccNum = "GPL90")

