Compgen Tool Suite

Problem Set 1/2:

In this problem set you will generate a genome alignment and examine it for genomic variants. Parts of this problem set require significant computation time, so you are advised to start it well before the 11/18 deadline. You are allowed to collaborate with your classmates, however, you must turn in your own work.

You will be aligning a paired-end short-read dataset to the mouse genome (GRCm38 aka mm10) using the bowtie2 aligner. The dataset that you will be aligning can be found at /proj/mcmillan/BCB716F21/CC035_MRCA_R*.fastq.gz. These FASTQ files are compressed, hence the .gz file extention. The aligner can handle compressed files.

You will also need to specify a reference genome to perform you alignment. Typically, these are provided as compressed indices that are specially formatted for a particular aligner. In your case you should use /proj/mcmillanlab/BCB716F21/genome/Mouse/%s/genome. This is actually a prefix for a set of files that make up the reference genome index.

I have prepared a SLURM script for performing the alignments. You should use this script as it is unlikely that you can stay logged on for the duration of the alignment process. Moreover, the interactive terminals on longleaf are not intended for running large jobs. A script can be found at /proj/mcmillan/runBowtie2. I suggest that you copy this file to your home directory and submit your SLURM job from there.

The alignment process will generate many large temporary files during processing. You should avoid creating these files in your home directory. Instead, you should save these on the scratch disk partition provided on longleaf. Each user has their own scratch file area, which is accessed as follows: Suppose that your ONYEN is "guest", then your scratch partition would be /pine/scr/g/u/guest. I suggest that you create a directory there call alignments using the following command: mkdir /pine/scr/g/u/guest/alignments

So the first slurm command that you will need to run is:

sbatch runBowtie2 /proj/mcmillanlab/BCB716F21/genome/Mouse/bowtie2/genome -1 /proj/mcmillanlab/BCB716F21/sequence/CC035_MRCA_R1.fastq.gz -2 /proj/mcmillanlab/BCB716F21/sequence/CC035_MRCA_R2.fastq.gz /pine/scr/g/u/guest/alignments/CC035.sam

This first step will likely take many hours. You can monitor the progress of the script using the command squeue -u guest. When the alignment finishes a file with status information will be created with a name like slurm-*.out. You can also examine this file in case the alignment fails to debug your script.

When the alignment finishes, you should have a SAM file in your alignments scratch directory. You can examine this file using commands like head, more, and grep, but it is a bit unwieldy. The next steps are to convert it to a BAM file, sort the BAM file, and create an index. These steps also probably take too long to execute on the commandline (cummulatively they will require 2-3 hours). I have provided a second SLURM batch file to preform them /proj/mcmillanlab/runSamToBam, which should be invoked as follows:
sbatch runSamToBam /pine/scr/g/u/guest/alignments/CC035

The resulting scratch files should now contain a SAM file, two BAM files, one sorted, and an BAI index file. You can use these to examine you alignment

More instructions to come

Logged in as: guest Log in

	Home	Research	Courses	Publications
Problem Set 1/2: In this problem set you will generate a genome alignment and examine it for genomic variants. Parts of this problem set require significant computation time, so you are advised to start it well before the 11/18 deadline. You are allowed to collaborate with your classmates, however, you must turn in your own work. You will be aligning a paired-end short-read dataset to the mouse genome (GRCm38 aka mm10) using the bowtie2 aligner. The dataset that you will be aligning can be found at /proj/mcmillan/BCB716F21/CC035_MRCA_R.fastq.gz. These FASTQ files are compressed, hence the .gz file extention. The aligner can handle compressed files. You will also need to specify a reference genome to perform you alignment. Typically, these are provided as compressed indices that are specially formatted for a particular aligner. In your case you should use /proj/mcmillanlab/BCB716F21/genome/Mouse/%s/genome. This is actually a prefix for a set of files that make up the reference genome index. I have prepared a SLURM script for performing the alignments. You should use this script as it is unlikely that you can stay logged on for the duration of the alignment process. Moreover, the interactive terminals on longleaf are not intended for running large jobs. A script can be found at /proj/mcmillan/runBowtie2. I suggest that you copy this file to your home directory and submit your SLURM job from there. The alignment process will generate many large temporary files during processing. You should avoid creating these files in your home directory. Instead, you should save these on the scratch disk partition provided on longleaf. Each user has their own scratch file area, which is accessed as follows: Suppose that your ONYEN is "guest", then your scratch partition would be /pine/scr/g/u/guest. I suggest that you create a directory there call alignments* using the following command: mkdir /pine/scr/g/u/guest/alignments So the first slurm command that you will need to run is: sbatch runBowtie2 /proj/mcmillanlab/BCB716F21/genome/Mouse/bowtie2/genome -1 /proj/mcmillanlab/BCB716F21/sequence/CC035_MRCA_R1.fastq.gz -2 /proj/mcmillanlab/BCB716F21/sequence/CC035_MRCA_R2.fastq.gz /pine/scr/g/u/guest/alignments/CC035.sam This first step will likely take many hours. You can monitor the progress of the script using the command squeue -u guest. When the alignment finishes a file with status information will be created with a name like slurm-.out. You can also examine this file in case the alignment fails to debug your script. When the alignment finishes, you should have a SAM file in your alignments scratch directory. You can examine this file using commands like head, more,* and grep, but it is a bit unwieldy. The next steps are to convert it to a BAM file, sort the BAM file, and create an index. These steps also probably take too long to execute on the commandline (cummulatively they will require 2-3 hours). I have provided a second SLURM batch file to preform them /proj/mcmillanlab/runSamToBam, which should be invoked as follows: sbatch runSamToBam /pine/scr/g/u/guest/alignments/CC035 The resulting scratch files should now contain a SAM file, two BAM files, one sorted, and an BAI index file. You can use these to examine you alignment More instructions to come