Heterosis Mouse Project
Heterosis Mouse Project
The project was funded by NSF EAGER grant # 1248090.
The hypothesis for this project is presented in "Goff SA (2011) A unifying theory for general multigenic heterosis: energy efficiency, protein metabolism, and implications for molecular breeding. New Phytol 189: 923-937." Download
  • The dataset is 20 libraries (5 strains x 4 tissues). The strains are two mouse inbreds (C57BL/6J (B6) and BALB/cByJ (Bc)) and three B6/Bc hybrids (two young and one old). The tissues are brain, kidney, liver and muscle.
  • The libraries were sequenced by postdoc Qi Cai (Goff lab) in collaboration with Arizona Genomics Institute.
  • The reads were aligned using the iPlant cyberinfrastructure and transferred to the Soderlund lab, where further processing was performed with the Allele Workbench (AW) and the results entered into the mouse AW database.
  • The AW software was used to select ASE (allele specific expression) transcripts, and the sequence pairs (i.e. the two inbred sequences) have been sent to the Cheng lab to test for folding.
The data can be queried from the following two Java applets:


  1. C57BL/6J is the reference sequence (GRCm38p2), and was downloaded from Genbank.
  2. The reference annotation was downloaded from Ensembl.
  3. The alternative BALB/cJ (a strain closely related to cByJ) SNPs and Indels were downloaded from the Sanger Centre.
  4. The C57BL/6J protein sequences were downloaded from Ensembl.


  1. fastx_trimmer and Sickle were used for trimming.
  2. TopHat aligndd the reads to the genome for subsequent variant processing.
    Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105-1111.
  3. Samtool mpileup determine the read coverage for the variants.
    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078-2079.
  4. GATK was used to find unique SNPs for cByJ, since the SNPs and Indels are from the BALB/cB genome.
    DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43: 491-498.
  5. Ensembl Variant Predictor determined the effect of the SNPs: (1) consequence (e.g. missense), (2) SIFT and PolyPen scores for changes to protein sequences (referred to as 'damaging' in HW).
    McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F (2010) Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26: 2069-2070.
  6. STAR aligned the reads to the genome for subsequent read calling, as it reports multiple mapped reads necessary for eXpress.
    Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29: 15-21.
  7. eXpress assigned reads to transcripts, where it called both total reads (for TCW) and transcript allele reads (for HW).
    Roberts A, Pachter L (2013) Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods 10: 71-73.
  8. edgeR computed the differential expression between the reference and alterative counts for both the HW SNPs and reads, and between TCW transcript libraries.
    Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139-140.
  9. The alternative transcripts were created using a script in the RSEM package (the reference transcripts were downloaded from Ensembl).
    Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12: 323.
  10. BEDtools was used for various reformatting.
    Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841-842.
  11. The AW pipeline was used to perform various parts of the processing not covered by the above software packages, e.g. masking the SNPs in the reference genome, converting the mpileup output to variant coverage numbers. The AW Java build interface was used to compute allele imbalance and build the database. The AW Java query interface was used to analyze the allele-specific expression. Freely available at AW.
    Soderlund C, Nelson W, Goeff S (2014) Allele Workbench: transcriptome pipeline and interactive graphics for allele-specific expression. PLoS ONE. Link
  12. TCW was used to find differential expressed libraries. Freely available at TCW.
    Soderlund C, Nelson W, Willer M, Gang DR (2013) TCW: transcriptome computational workbench. PLoS One 8: e69401.