Two resources that compile information from the many, many yeast expression microarray experiments that have been published:

SPELL (Serial Pattern of Expression Levels Locator): Looks for genes that are highly correlated in their expression patterns to the set of genes that you input as your search query. Delivers this nice graphical output showing the red-green “heat” map of the fold changes for each gene in each microarray experiment. On the other hand, my P.I. says that it’s prone to false positives: almost any set of genes you input will give you a result, regardless of whether it’s meaningful or not.

YFGdb (Yeast Functional Genomics Database): Has not only most of the .pcl files for the microarray experiments that are curated there, but also SGD Lite (a faster way to browse the yeast genome), Yeast SNP Genome Browser, and data for download.

Also, just for my own reference, here’s all the types of annotations given in SGD for yeast genome features:

  • ARS
  • binding site
  • CDS
  • chromosome
  • centromere
  • external transcribed spacer region
  • five prime UTR intron
  • gene
  • gene cassette
  • internal transcribed spacer region
  • intron
  • long terminal repeat
  • LTR retrotransposon
  • ncRNA
  • noncoding exon
  • nucleotide match
  • pseudogene
  • region
  • repeat region
  • rRNA
  • snoRNA
  • snRNA
  • telomere
  • transposable element gene
  • tRNA

gsub(pattern, replacement, x)

I don’t know why it took me so long to discover that this function existed in R.

There was a New York Times article on the growing popularity of R as statistical software: Data Analysts Captivated by R’s Power. I think it’s a little misleading because while it is extremely versatile for all sorts of statistical analysis, it’s not necessarily the most powerful tool, especially if you compare it to other programming languages, which process data more quickly. That being said, I do like using R, and its syntax feels really intuitive to me.

I also found a post on Andrew Gelman’s blog on good programming habits in R: Style guide for R code. I don’t follow a lot of these guidelines, although I really should.

While looking up how to implement generalized linear models (GLM) in R, I came across this really useful resource: Generalised Linear Models. I think the site also has links to course materials on other statistical methods as well.

a : b
Corresponds to the term for interaction between variables a and b.

a * b
Corresponds to a + b + a:b

Really counterintuitive. I worry that I messed up my previous analysis by not realizing this point of syntax sooner.

Lessons in efficiency while coding Perl:

  • Use regular expressions to split tab-delimited fields in large data files.
  • Minimize number of elements in arrays and hashes.
  • Use indexing variables.
  • Use seek function rather than iteratively reading through filelines

Lessons in efficiency while coding R:

  • Avoid loops; use sapply (or variant thereof).
  • Label row names and column names of data frames for ease of subsetting.
  • Download packages that have useful functions instead of attempting to reinvent the wheel.

Just noticed this paper from the Walter lab.

Aragón et al., 2008: 3′ UTR regulatory element targets HAC1 mRNA to Ire1p oligomers in the ER to be spliced for translation.

I haven’t updated here in a while, but hopefully that will change once I start studying for my qualifying exams. Speaking of which, I need to start narrowing down to a specific research question for my outside proposal.

CTRL-C to suspend a process.

CTRL-Z to switch back to shell.

fg to return to program running in the foreground.

& to run program in the background (append to end of the command).

renice +19 PID to reduce process priority to minimum (useful for minimizing memory load).

Some useful Unix commands:

ssh -X username@hostname
The -X option enables X11 forwarding, which means that you can remotely open OpenOffice documents or GNU Emacs windows or R graphics devices.

sort -k i,i -k j,j filename
The sort command can sort on multiple fields, specified in order by the -k option. (Add n to sort by numerical value.)

cat filename | sed 's/find_text/replace_text/g' > output
Global search and replace using the stream editor in Unix. Same command in vim but need to preface with %.

I downloaded the Molecular Biologist’s Toolbar for Firefox from Bitesize Bio today, and I’m quite pleased with it so far. It’s nice to have an email notifier for both of my email accounts (with automatic log-in), and I have quick access to various web-based tools that I use quite often (especially sequence related tools like primer design and reverse complement). I hadn’t known that there were tools for “cleaning up” DNA sequence (basically, removing numbers and spaces), which is enormously useful and saves me the trouble of writing code for it myself.

Bitesize Bio also had a post up on biology-related iPhone applications. I don’t have an iPhone, but I do own a Mac, which is why I found this post on Free Mac Software for Molecular Biologists to be useful. I already use Papers and Geneious (albeit in its Linux form), but I’m interested in trying out Mekentosj’s other programs, especially Lab Assistant.

Your Lab Data has a nice set of iGoogle gadgets that I currently use. They include Primer3, restriction enzyme site finders, melting temperature calculator, etc.

Several interesting papers found today, including:

Collins et al., 2008: The authors pulled out a new drug out of a small molecule screen that lengthens lifespan in C. elegans. Ethosuximide works by inhibiting chemosensory neurons, which apparently regulate aging. (Genetic mutants in chemosensory neurons also have extended lifespans.) Probably not going to read this paper in much more detail, but I thought the result was interesting. The hypothesis proposed by the authors is that inhibiting chemosensory perception means that the worms are unable to find food and thus reduce their dietary intake.

Breitling et al., 2008: A critical look at the identification of eQTL “hotspots”, which are basically polymorphisms that are responsible for variation in expression in a wide number of genes (usually by trans effects). They argue that many of the “hotspots” identified thus far are due to coregulated genes: a putative “hotspot” is not actually directly affecting the expression of several genes, but rather a few genes that in turn affect the expression of other genes with related function.

There are many nongenetic mechanisms that can create strongly correlated clusters of functionally related genes. On the one hand, such clusters may be a result of a concerted response to some uncontrolled environmental factor. On the other hand, dissected tissue samples can contain slightly varying fractions of individual cell types, leading to cell-type–specific gene clusters, which vary in a correlated manner.

They propose a better method for assessing the statistical significance of a potential eQTL. They also go on to speculate why genuine “hotspots” are so difficult to identify (more so, they claim, than eQTL that affect variation in their own expression, i.e. by cis effects) and often have small effect sizes.

This rarity of convincing hotspots in genetical genomics studies is intriguing. It could be due to the limited power of the initial studies, but it could also have a more profound reason. For example, it might well be that biological systems are so robust against subtle genetic perturbations that the majority of heritable gene expression variation is effectively “buffered” and does not lead to downstream effects on other genes, protein, metabolites, or phenotypes. Experimental evidence for phenotypic buffering of protein coding polymorphisms is well established.

In fact, it has been shown that phenotypic buffering is a general property of complex gene-regulatory networks. Also, if small heritable changes in transcript levels were transmitted unbuffered throughout the system, there would be a grave danger that genetic recombination would lead to unhealthy combinations of alleles and, consequently, to systems failure. Hotspots with large pleiotropic effects are thus more likely to be removed by purifying selection. If, as thus expected, common alleles are predominantly buffered by the robust properties of the system and hence largely inconsequential for the rest of the molecules in the system, this will have profound consequences for the design and interpretation of genetical genomics studies of complex diseases.

Which, if you think about it, is the cis versus trans argument all over again (complete with the discussion of rare disease alleles that was also made in the Lemos et al paper that I discussed in the last post).

Lee et al., 2008: A paper that tries to predict complex phenotypes that will result from a particular combination of SNPs! Isn’t that biology’s Holy Grail: to predict phenotype from genotype alone? Well, okay, I overstate the case: they do go in with a certain set of phenotypes they’re trying to predict, and they know something about the inheritance of that particular phenotype, whether it be the number of causal QTL or the heritability. I can’t assess how impressive their results were, but they do claim to be able to predict unobserved phenotypes for individuals based on just the genotype data alone, and they also seem to map the QTL affecting the trait with fairly good resolution.

Mitrophanov et al., 2008: Really need to come back to this paper, but it talks about the evolution of a feedforward network motif that is mediated at the post-translational level in an antibiotic resistance pathway in bacteria.

Berry & Gasch, 2008: Another stress response paper from the Gasch lab. Labmate referred me to this paper, which addresses the troubling issue brought up by those deletion library competition experiments that found that genes needed to survive stresses were not the same genes that changed expression in response to stress. The authors of this paper argue that the genes involved in the transcriptional response are not essential for initial stress survival but rather protecting the cell against future/chronic stress. Makes me think there’s promise for my pet idea of “cellular memory” in stress responses…a pet idea that has no experimental justification whatsoever.