Pipeline for Unitary Gene Loss Identification

Yi Zhao, Center for Bioinformatics, Peking University; zhaoy@mail.cbi.pku.edu.cn
Ge Gao,  Center for Bioinformatics, Peking University; gaog@mail.cbi.pku.edu.cn

2014.01.16

Outline

Description

Most of current study to identify gene loss events are based on mutated genomic region of lost gene, also called relic or pseudogene, which can only find gene loss event leaving degraded segments of lost gene. To make a comprehensive profiling of unitary gene loss events in different species or node of Poaceae clade, we developed a new pipeline which can identify unitary gene loss event with or without existing relic after gene loss. This pipeline is written in Perl and wrapped in Bash. To run this pipeline, you need Bash shell, Perl language, and some other software packages, i.e. PseudoPipe, GeneWise, GMAP and BLAST. The whole pipeline is shown below, and it is labeled with amount of filtered gene loss events at each step of unitary gene loss detection in Poaceae clade.

Abbreviation

Prerequisite

  1. First, please make sure which species you want to detect unitary gene loss for. Because this pipeline detects unitary gene loss based on orthologous relationships among multiple species, you should consider several other species related with the one for unitary gene loss detection. Then, you should prepare the orthologous groups containing orthologs in species you selected at the beginning. In this pipeline for unitary gene loss detection in Poaceae clade, we connect to the EnsemblPlants MySQL database ensembl_compara_plants_15_68, retrieve orthologous relationships associated with four Poaceae species, i.e. brachypodium, rice, sorghum and maize, and three outgroup species, i.e. Arabidopsis, poplar and grape, then construct orthologous groups containing orthologs in them.

    Example for the file named orthologous_groups.tsv, which contains constructed orthologous groups. The first column is the name of each constructed orthologous group, showing the group number, amount of members in each species and amount of total members in the group. All of the member names are shown in the following columns. Part of this file is shown below.

    orthologous_group_name	orthologous_group_member
    orthologous_group_1_containing_osm_0_bdi_0_sbi_0_zm_0_viv_1_ath_1_pop_0_total_2	ATMG00030	Vv13s0073g00590
    orthologous_group_2_containing_osm_2_bdi_0_sbi_0_zm_8_viv_2_ath_2_pop_1_total_15	ATMG00060	POPTR_0011s02220	GRMZM5G894515	GRMZM2G405584	GRMZM2G137648	GRMZM2G176216	GRMZM2G358205	GRMZM2G309173	LOC_Os10g21418	LOC_Osp1g00880	GRMZM2G364278	GRMZM2G455403	Vv00s0246g00140	Vv00s0246g00150	ATCG01010
    orthologous_group_3_containing_osm_1_bdi_0_sbi_0_zm_2_viv_0_ath_1_pop_0_total_4	ATMG00070	LOC_Osm1g00590	GRMZM5G839924	GRMZM2G442129
    orthologous_group_4_containing_osm_0_bdi_0_sbi_0_zm_0_viv_0_ath_1_pop_1_total_2	ATMG00080	POPTR_0043s00280
    orthologous_group_5_containing_osm_3_bdi_0_sbi_0_zm_1_viv_0_ath_1_pop_1_total_6	ATMG00090	GRMZM5G861791	LOC_Osm1g00260	LOC_Os12g34054	LOC_Os12g34124	POPTR_0161s00210
    orthologous_group_6_containing_osm_1_bdi_0_sbi_1_zm_3_viv_0_ath_1_pop_0_total_6	ATMG00110	GRMZM5G889905	GRMZM5G860481	LOC_Osm1g00620	GRMZM2G350284	Sb01g047260
    orthologous_group_7_containing_osm_0_bdi_0_sbi_0_zm_0_viv_0_ath_1_pop_0_total_1	ATMG00120
    orthologous_group_8_containing_osm_0_bdi_0_sbi_0_zm_0_viv_0_ath_1_pop_0_total_1	ATMG00150
    orthologous_group_9_containing_osm_1_bdi_0_sbi_0_zm_3_viv_1_ath_1_pop_0_total_6	ATMG00160	Vv00s0438g00010	LOC_Osm1g00330	GRMZM2G109332	GRMZM5G862955	GRMZM2G142109
    
  2. Create a directory (e.g unitary_gene_loss_detection) to run the script unitary_gene_loss_detection_pipeline.sh here.

  3. Copy a file named forbidden_gene_list.tsv, which contains names of forbidden genes in the directory unitary_gene_loss_detection. If so, orthologous groups containing these genes will be filtered out before unitary gene loss detection. In our research, we took transposable-element, chloroplast and mitochondrion related genes as the forbidden ones.

    Example for the file named forbidden_gene_list.tsv. Each gene name each line and the format of listed gene names should be consistent with that in the constructed orthologous groups. Part of this file is shown below.

    LOC_Os03g61120
    LOC_Os03g61890
    LOC_Os03g61940
    LOC_Os03g62650
    LOC_Os03g63330
    LOC_Os03g63950
    LOC_Os04g04230
    LOC_Os04g08350
    LOC_Os04g09900
    LOC_Os04g10060
    
  4. Copy a file containing synteny relationships among the species you selected at the beginning in the directory unitary_gene_loss_detection. In our research, we adopted the file EVS009_Supplemental_dataset_S1_pluslinks.tsv generated by Schnable et al. which contains synteny relationships among brachypodium, rice, sorghum and maize, as the synteny data.

    Example for the file named EVS009_Supplemental_dataset_S1_pluslinks.tsv. The first five columns are genes located in the same synteny region among genomes of brachypodium, rice, sorghum and maize, and the second five ones are genes located in the duplicated synteny region generated in pre-grass whole genome duplication event. Part of this file is shown below.

    Maize1	Maize2	Sorghum	Rice	Brachypodium	Maize1 (pre-grass duplicate)	Maize2 (pre-grass duplicate)	Sorghum (pre-grass duplicate)	Rice (pre-grass duplicate)	Brachypodium (pre-grass duplicate)	GEvo Link
    GRMZM2G030116||GRMZM2G359892	GRMZM2G162109	Sb09g023120||Sb09g016440	LOC_Os05g39500||LOC_Os05g28040	BRADI2G29110||BRADI2G22560	GRMZM2G034810	GRMZM2G135332	Sb03g038670	LOC_Os01g61310	BRADI2G53880	http://genomevolution.org/r/33qk
    GRMZM2G034810	GRMZM2G135332	Sb03g038670	LOC_Os01g61310	BRADI2G53880	GRMZM2G030116||GRMZM2G359892	GRMZM2G162109	Sb09g023120||Sb09g016440	LOC_Os05g39500||LOC_Os05g28040	BRADI2G29110||BRADI2G22560	http://genomevolution.org/r/33ql
    GRMZM2G316635	chr8:144728966-144766755:0:hit	Sb03g028900	LOC_Os01g44310	BRADI2G44490	GRMZM2G438895	chr8:72402481-72501008:1:gap	Sb09g029580	LOC_Os05g50370	BRADI2G14960	http://genomevolution.org/r/33qm
    GRMZM2G438895	chr8:72402481-72501008:1:gap	Sb09g029580	LOC_Os05g50370	BRADI2G14960	GRMZM2G316635	chr8:144728966-144766755:0:hit	Sb03g028900	LOC_Os01g44310	BRADI2G44490	http://genomevolution.org/r/33qn
    chr3:186211860-186243301:0:gap	chr8:170179600-170260580:0:gap	Sb03g037300	LOC_Os01g58760	BRADI2G52590	GRMZM2G175870	GRMZM2G033230	Sb09g024290	LOC_Os05g41540	BRADI2G21200	http://genomevolution.org/r/33qo
    GRMZM2G175870	GRMZM2G033230	Sb09g024290	LOC_Os05g41540	BRADI2G21200	chr3:186211860-186243301:0:gap	chr8:170179600-170260580:0:gap	Sb03g037300	LOC_Os01g58760	BRADI2G52590	http://genomevolution.org/r/33qp
    GRMZM2G175280||GRMZM2G332294	GRMZM2G149150	Sb07g025490||Sb07g025270	LOC_Os08g43090	BRADI3G41980||BRADI3G41820		GRMZM2G037910||GRMZM2G180847	Sb02g030170||Sb02g029870	LOC_Os09g34060||LOC_Os09g34880	BRADI4G35370||BRADI4G35240	http://genomevolution.org/r/33qq
    	GRMZM2G037910||GRMZM2G180847	Sb02g030170||Sb02g029870	LOC_Os09g34060||LOC_Os09g34880	BRADI4G35370||BRADI4G35240	GRMZM2G175280||GRMZM2G332294	GRMZM2G149150	Sb07g025490||Sb07g025270	LOC_Os08g43090	BRADI3G41980||BRADI3G41820	http://genomevolution.org/r/33qr
    chr1:214331670-214398517:0:gap	chr4:88916810-89321558:1:gap	Sb07g020960	LOC_Os08g33340	BRADI3G36460	chr7:104100162-104296780:1:gap	GRMZM2G105772	Sb02g024340	LOC_Os09g24200	BRADI4G29870	http://genomevolution.org/r/33qs
    
  5. Run PseudoPipe for species selected to detect unitary gene loss, and put the directory ppipe_output generated by PseudoPipe into the directory unitary_gene_loss_detection. To run PseudoPipe, you can refer to the README file. Please note that the name of sub-directory within ppipe_input and ppipe_output for detected species should be the abbreviation of different species.

  6. Create a BLAST protein database via formatdb, which is in BLAST package, containing the orthologous counterparts of lost genes, put it in a directory BLAST_database, and move the directory BLAST_database into the directory unitary_gene_loss_detection. The format of protein name in this database should be consistent with that in the constructed orthologous groups. In our pipeline, we create a BLAST protein database containing all longest proteins in four Poaceae species, i.e. brachypodium, rice, sorghum and maize, and three outgroup species, i.e. grape, Arbidopsis and poplar, named Poaceae_viv_ath_pop_longest

  7. Create a GMAP database containing the genome of species selected to detect unitary gene loss via gmap_setup, which is in GMAP package, put it in a directory GMAP_database, and move the directory GMAP_database into the directory unitary_gene_loss_detection. Please note that both of the name of sub-directory within GMAP_database containing the created genome database and the name of created genome database should be the abbreviation of different species.

The data necessary for unitary gene loss detection is shown below. In this manual, we take unitary gene loss detection in sorghum as the example.

Usage

You need download this contracted archive Unitary_Gene_Loss_Detection_Pipeline.tar.gz, uncompress it and add the directory containing all scripts to Linux PATH. Then you make a directory named unitary_gene_loss_detection, put necessary data files in it as mentioned in the prerequisite, and run this pipeline in it.

Major steps

  1. Construct orthologous group
  2. Identify candidate unitary gene loss event
  3. Classify candidate gene loss event as relic-retaining, relic-lacking or false-positive one
  4. Search disablers, i.e. frameshift and premature stop-codon, for relic-retaining gene loss event
  5. Locate relic-lacking gene loss event
  6. Integrate information for detected unitary gene loss event

Detailed scripts

1. Construct orthologous group

2. Identify candidate unitary gene loss events

3. Classify candidate gene loss event

4. Search disablers for relic-retaining gene loss event

5. Locate relic-lacking gene loss event

6. Integrate information for detected unitary gene loss event

Special Case

If you want to detect unitary gene loss event in the MRCA (Most Recent Common Ancestor) of multiple species, e.g. sorghum and maize, you should prepare PseudoPipe result and GMAP genome database for each species as shown below.

Then you can run our pipeline with the parameter sbi and zm in the directory unitary_gene_loss_detection, which is shown below.

Finally, you can determine in which species or MRCA the unitary gene loss event happened based on the output file unitary_gene_loss_info_for_sbi_zm.tsv. From the example shown below you can conclude that the unitary gene loss in orthologous group 8968 is happened in the MRCA of sorghum and maize, the unitary gene loss in orthologous group 4397 is happened in sorghum, while the unitary gene loss in orthologous group 10276 is happened in maize.

orthologous_group_number	orthologous_counterpart	chr	start	end	strand	frameshift_number	premature_stop-codon_number	unitary_gene_loss_classification	unitary_gene_loss_species
4397	LOC_Os02g08140	3	47955604	47956683	+	1	0	relic-retaining	sbi
8968	LOC_Os02g52510	4	63921086	63928674	-	2	0	relic-retaining	sbi
8968	LOC_Os02g52510	5	209754553	209757880	-	2	0	relic-retaining	zm
5146	LOC_Os03g10940	1	6133765	6136732	-	1	0	relic-retaining	sbi
10318	LOC_Os01g74500	3	74386656	74396272	+	NA	NA	relic-lacking	sbi
165	LOC_Os11g48060	4	57749194	57750836	+	2	0	relic-retaining	sbi
10276	LOC_Os10g28230	3	157041566	157041979	+	2	0	relic-retaining	zm
16364	LOC_Os01g55549	3	63381108	63384547	-	4	0	relic-retaining	sbi
11573	LOC_Os04g38470	10	121455394	121492570	+	NA	NA	relic-lacking	zm
9435	LOC_Os12g18650	6	49944588	49946702	-	2	0	relic-retaining	sbi

Appendix

Multiple alignment of promoter regions of relic and its orthologous counterparts

You can download the MAF_file.rar.