Yi Zhao, Center for Bioinformatics, Peking University; zhaoy@mail.cbi.pku.edu.cn
Ge Gao, Center for Bioinformatics, Peking University; gaog@mail.cbi.pku.edu.cn
Most of current study to identify gene loss events are based on mutated genomic region of lost gene, also called relic or pseudogene, which can only find gene loss event leaving degraded segments of lost gene. To make a comprehensive profiling of unitary gene loss events in different species or node of Poaceae clade, we developed a new pipeline which can identify unitary gene loss event with or without existing relic after gene loss. This pipeline is written in Perl and wrapped in Bash. To run this pipeline, you need Bash shell, Perl language, and some other software packages, i.e. PseudoPipe, GeneWise,GMAP and BLAST. The whole pipeline is shown below, and it is labeled with amount of filtered gene loss events at each step of unitary gene loss detection in Poaceae clade.
osm: Oryza sativa MSU version, i.e. rice
bdi: Brachypodium distachchyon, i.e. brachypodium
sbi: Sorghum bicolor, i.e. sorghum
zm: Zea mays, i.e. maize
viv: Vitis vinifera, i.e. grape
ath: Arabidopsis thaliana, i.e. Arabidopsis
pop: Populus trichocarpa, i.e. poplar
First, please make sure which species you want to detect unitary gene loss for. Because this pipeline detects unitary gene loss based on orthologous relationships among multiple species, you should consider several other species related with the one for unitary gene loss detection. Then, you should prepare the orthologous groups containing orthologs in species you selected at the beginning. In this pipeline for unitary gene loss detection in Poaceae clade, we connect to the EnsemblPlants MySQL database ensembl_compara_plants_15_68, retrieve orthologous relationships associated with four Poaceae species,i.e. brachypodium, rice, sorghum and maize, and three outgroup species, i.e. Arabidopsis, poplar and grape, then construct orthologous groups containing orthologs in them.
Example for the file named orthologous_groups.tsv, which contains constructed orthologous groups. The first column is the name of each constructed orthologous group, showing the group number, amount of members in each species and amount of total members in the group. All of the member names are shown in the following columns. Part of this file is shown below.
orthologous_group_name orthologous_group_member orthologous_group_1_containing_osm_0_bdi_0_sbi_0_zm_0_viv_1_ath_1_pop_0_total_2 ATMG00030 Vv13s0073g00590 orthologous_group_2_containing_osm_2_bdi_0_sbi_0_zm_8_viv_2_ath_2_pop_1_total_15 ATMG00060 POPTR_0011s02220 GRMZM5G894515 GRMZM2G405584 GRMZM2G137648 GRMZM2G176216 GRMZM2G358205 GRMZM2G309173 LOC_Os10g21418 LOC_Osp1g00880 GRMZM2G364278 GRMZM2G455403 Vv00s0246g00140 Vv00s0246g00150 ATCG01010 orthologous_group_3_containing_osm_1_bdi_0_sbi_0_zm_2_viv_0_ath_1_pop_0_total_4 ATMG00070 LOC_Osm1g00590 GRMZM5G839924 GRMZM2G442129 orthologous_group_4_containing_osm_0_bdi_0_sbi_0_zm_0_viv_0_ath_1_pop_1_total_2 ATMG00080 POPTR_0043s00280 orthologous_group_5_containing_osm_3_bdi_0_sbi_0_zm_1_viv_0_ath_1_pop_1_total_6 ATMG00090 GRMZM5G861791 LOC_Osm1g00260 LOC_Os12g34054 LOC_Os12g34124 POPTR_0161s00210 orthologous_group_6_containing_osm_1_bdi_0_sbi_1_zm_3_viv_0_ath_1_pop_0_total_6 ATMG00110 GRMZM5G889905 GRMZM5G860481 LOC_Osm1g00620 GRMZM2G350284 Sb01g047260 orthologous_group_7_containing_osm_0_bdi_0_sbi_0_zm_0_viv_0_ath_1_pop_0_total_1 ATMG00120 orthologous_group_8_containing_osm_0_bdi_0_sbi_0_zm_0_viv_0_ath_1_pop_0_total_1 ATMG00150 orthologous_group_9_containing_osm_1_bdi_0_sbi_0_zm_3_viv_1_ath_1_pop_0_total_6 ATMG00160 Vv00s0438g00010 LOC_Osm1g00330 GRMZM2G109332 GRMZM5G862955 GRMZM2G142109
Create a directory (e.g unitary_gene_loss_detection) to run the script unitary_gene_loss_detection_pipeline.sh here.
Copy a file named forbidden_gene_list.tsv, which contains names of forbidden genes in the directory unitary_gene_loss_detection. If so, orthologous groups containing these genes will be filtered out before unitary gene loss detection. In our research, we took transposable-element, chloroplast and mitochondrion related genes as the forbidden ones.
Example for the file named forbidden_gene_list.tsv. Each gene name each line and the format of listed gene names should be consistent with that in the constructed orthologous groups. Part of this file is shown below.
LOC_Os03g61120 LOC_Os03g61890 LOC_Os03g61940 LOC_Os03g62650 LOC_Os03g63330 LOC_Os03g63950 LOC_Os04g04230 LOC_Os04g08350 LOC_Os04g09900 LOC_Os04g10060
Copy a file containing synteny relationships among the species you selected at the beginning in the directory unitary_gene_loss_detection. In our research, we adopted the file EVS009_Supplemental_dataset_S1_pluslinks.tsv generated by Schnable et al. which contains synteny relationships among brachypodium, rice, sorghum and maize, as the synteny data.
Example for the file named EVS009_Supplemental_dataset_S1_pluslinks.tsv. The first five columns are genes located in the same synteny region among genomes of brachypodium, rice, sorghum and maize, and the second five ones are genes located in the duplicated synteny region generated in pre-grass whole genome duplication event. Part of this file is shown below.
Maize1 Maize2 Sorghum Rice Brachypodium Maize1 (pre-grass duplicate) Maize2 (pre-grass duplicate) Sorghum (pre-grass duplicate) Rice (pre-grass duplicate) Brachypodium (pre-grass duplicate) GEvo Link GRMZM2G030116||GRMZM2G359892 GRMZM2G162109 Sb09g023120||Sb09g016440 LOC_Os05g39500||LOC_Os05g28040 BRADI2G29110||BRADI2G22560 GRMZM2G034810 GRMZM2G135332 Sb03g038670 LOC_Os01g61310 BRADI2G53880 http://genomevolution.org/r/33qk GRMZM2G034810 GRMZM2G135332 Sb03g038670 LOC_Os01g61310 BRADI2G53880 GRMZM2G030116||GRMZM2G359892 GRMZM2G162109 Sb09g023120||Sb09g016440 LOC_Os05g39500||LOC_Os05g28040 BRADI2G29110||BRADI2G22560 http://genomevolution.org/r/33ql GRMZM2G316635 chr8:144728966-144766755:0:hit Sb03g028900 LOC_Os01g44310 BRADI2G44490 GRMZM2G438895 chr8:72402481-72501008:1:gap Sb09g029580 LOC_Os05g50370 BRADI2G14960 http://genomevolution.org/r/33qm GRMZM2G438895 chr8:72402481-72501008:1:gap Sb09g029580 LOC_Os05g50370 BRADI2G14960 GRMZM2G316635 chr8:144728966-144766755:0:hit Sb03g028900 LOC_Os01g44310 BRADI2G44490 http://genomevolution.org/r/33qn chr3:186211860-186243301:0:gap chr8:170179600-170260580:0:gap Sb03g037300 LOC_Os01g58760 BRADI2G52590 GRMZM2G175870 GRMZM2G033230 Sb09g024290 LOC_Os05g41540 BRADI2G21200 http://genomevolution.org/r/33qo GRMZM2G175870 GRMZM2G033230 Sb09g024290 LOC_Os05g41540 BRADI2G21200 chr3:186211860-186243301:0:gap chr8:170179600-170260580:0:gap Sb03g037300 LOC_Os01g58760 BRADI2G52590 http://genomevolution.org/r/33qp GRMZM2G175280||GRMZM2G332294 GRMZM2G149150 Sb07g025490||Sb07g025270 LOC_Os08g43090 BRADI3G41980||BRADI3G41820 GRMZM2G037910||GRMZM2G180847 Sb02g030170||Sb02g029870 LOC_Os09g34060||LOC_Os09g34880 BRADI4G35370||BRADI4G35240 http://genomevolution.org/r/33qq GRMZM2G037910||GRMZM2G180847 Sb02g030170||Sb02g029870 LOC_Os09g34060||LOC_Os09g34880 BRADI4G35370||BRADI4G35240 GRMZM2G175280||GRMZM2G332294 GRMZM2G149150 Sb07g025490||Sb07g025270 LOC_Os08g43090 BRADI3G41980||BRADI3G41820 http://genomevolution.org/r/33qr chr1:214331670-214398517:0:gap chr4:88916810-89321558:1:gap Sb07g020960 LOC_Os08g33340 BRADI3G36460 chr7:104100162-104296780:1:gap GRMZM2G105772 Sb02g024340 LOC_Os09g24200 BRADI4G29870 http://genomevolution.org/r/33qs
Run PseudoPipe for species selected to detect unitary gene loss, and put the directory ppipe_output generated by PseudoPipe into the directory unitary_gene_loss_detection. To run PseudoPipe, you can refer to the README file. Please note that the name of sub-directory within ppipe_input and ppipe_output for detected species should be the abbreviation of different species.
Create a BLAST protein database via formatdb, which is in BLAST package, containing the orthologous counterparts of lost genes, put it in a directory BLAST_database, and move the directory BLAST_database into the directory unitary_gene_loss_detection. The format of protein name in this database should be consistent with that in the constructed orthologous groups.In our pipeline, we create a BLAST protein database containing all longest proteins in four Poaceae species, i.e. brachypodium, rice, sorghum and maize, and three outgroup species, i.e. grape, Arbidopsis and poplar, named Poaceae_viv_ath_pop_longest
Create a GMAP database containing the genome of species selected to detect unitary gene loss via gmap_setup, which is in GMAP package, put it in a directory GMAP_database,and move the directory GMAP_database into the directory unitary_gene_loss_detection. Please note that both of the name of sub-directory within GMAP_database containing the created genome database and the name of created genome database should be the abbreviation of different species.
The data necessary for unitary gene loss detection is shown below. In this manual, we take unitary gene loss detection in sorghum as the example.
You need download this contracted archive Unitary_Gene_Loss_Detection_Pipeline.tar.gz,uncompress it and add the directory containing all scripts to Linux PATH. Then you make a directory named unitary_gene_loss_detection, put necessary data files in it as mentioned in the prerequisite, and run this pipeline in it.
Construct orthologous group
Identify candidate unitary gene loss event
Classify candidate gene loss event as relic-retaining, relic-lacking or false-positive one
Search disablers, i.e. frameshift and premature stop-codon, for relic-retaining gene loss event
Locate relic-lacking gene loss event
Integrate information for detected unitary gene loss event
Script: extract_orthologous_group.sh
Function: retrieve orthologous relationships from EnsemblPlants MySQL database and construct the orthologous groups
Output: orthologous_groups.tsv
Note: the format of orthologous_groups.tsv is introduced in the prerequisite
Script: detect_candidate_unitary_gene_loss_based_on_orthologous_group.sh
Function: identify candidate unitary gene loss events from constructed orthologous groups
Input: orthologous_groups.tsv & forbidden_gene_list.tsv
Output: sbi/orthologous_groups_with_candidate_unitary_gene_loss_in_sbi.tsv
Example for file named orthologous_groups_with_candidate_unitary_gene_loss_in_sbi.tsv, which contains candidate unitary gene loss event detected in sbi represented by orthologous group happened in it.
orthologous_group_193_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_2_pop_2_total_9 AT5G01460 Vv08s0007g02270 POPTR_0016s12350 POPTR_0006s10100 GRMZM5G871262 BRADI1G49650 GRMZM2G012284 LOC_Os06g03760 AT3G08930 orthologous_group_290_containing_osm_2_bdi_1_sbi_0_zm_1_viv_2_ath_2_pop_3_total_11 AT5G02502 BRADI1G21220 GRMZM2G079397 LOC_Os03g27424 LOC_Os07g42826 POPTR_0010s21410 Vv13s0019g00870 POPTR_0008s05330 POPTR_0008s05340 Vv08s0007g00090 AT3G12587 orthologous_group_797_containing_osm_2_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_7 AT5G08050 Vv17s0000g05250 POPTR_0015s06950 BRADI4G03820 LOC_Os12g38640 GRMZM2G074393 LOC_Os11g27300 orthologous_group_828_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_2_pop_2_total_8 AT5G08340 Vv10s0071g00290 POPTR_0007s07260 POPTR_0005s09120 LOC_Os03g58710 GRMZM2G166931 BRADI1G05116 AT5G23330 orthologous_group_834_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_2_pop_2_total_9 AT5G08390 POPTR_0008s00290 POPTR_0010s26190 Vv10s0071g01150 GRMZM2G406553 GRMZM2G160702 LOC_Os04g58130 BRADI5G26190 AT5G23430 orthologous_group_955_containing_osm_1_bdi_2_sbi_0_zm_2_viv_1_ath_1_pop_3_total_10 AT5G10020 POPTR_0007s06420 POPTR_0005s08470 Vv07s0031g01440 POPTR_0007s06430 BRADI1G20750 LOC_Os07g43350 GRMZM2G081857 GRMZM2G162781 BRADI1G21337 orthologous_group_1244_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_1_pop_2_total_8 AT5G13470 Vv13s0158g00350 POPTR_0003s20270 POPTR_0001s05810 GRMZM2G172710 LOC_Os01g71262 BRADI2G60370 GRMZM2G024858 orthologous_group_1299_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT5G14080 Vv14s0068g01660 POPTR_0001s33450 LOC_Os03g56960 BRADI1G06650 GRMZM2G131820 orthologous_group_1318_containing_osm_1_bdi_1_sbi_0_zm_2_viv_2_ath_4_pop_5_total_15 AT5G14270 POPTR_0001s36360 Vv14s0171g00380 LOC_Os08g03360 GRMZM2G127299 GRMZM2G351382 BRADI3G14160 AT3G27260 AT3G01770 POPTR_0015s10380 POPTR_0015s10370 POPTR_0012s09600 POPTR_0012s09590 Vv17s0000g01830 AT5G63320 orthologous_group_2096_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_2_pop_2_total_8 AT5G23730 POPTR_0012s13290 POPTR_0015s13210 Vv16s0050g00020 GRMZM2G389155 LOC_Os02g02380 BRADI3G01440 AT5G52250 orthologous_group_2794_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_5_pop_2_total_11 AT5G39450 Vv07s0031g02420 POPTR_0005s06830 POPTR_0007s04530 GRMZM2G360352 LOC_Os08g01430 BRADI3G13220 AT5G39480 AT5G39460 AT5G39490 AT5G39470 orthologous_group_2873_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_1_pop_2_total_8 AT5G40490 Vv14s0066g00480 POPTR_0001s35360 POPTR_0017s10590 GRMZM2G050218 AC233867.1_FG004 BRADI3G39090 LOC_Os08g38410 orthologous_group_2995_containing_osm_1_bdi_2_sbi_0_zm_1_viv_1_ath_1_pop_1_total_7 AT5G41880 Vv02s0025g01350 POPTR_0003s10380 LOC_Os05g29010 GRMZM2G439297 BRADI3G42937 BRADI3G03817 orthologous_group_3179_containing_osm_1_bdi_1_sbi_0_zm_7_viv_1_ath_1_pop_1_total_12 AT5G43850 POPTR_0008s15720 Vv05s0020g04070 GRMZM5G827939 GRMZM2G433446 BRADI3G26930 LOC_Os10g28360 GRMZM2G342829 GRMZM2G419070 GRMZM2G373040 GRMZM2G434572 AC233900.1_FG002 orthologous_group_3359_containing_osm_2_bdi_1_sbi_0_zm_2_viv_1_ath_1_pop_1_total_8 AT5G45790 POPTR_0016s03010 Vv08s0040g00250 BRADI1G57030 LOC_Os02g26550 LOC_Os12g03630 GRMZM2G180166 GRMZM2G089565 orthologous_group_3523_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT5G47710 Vv00s0531g00030 POPTR_0006s00650 POPTR_0016s00700 BRADI4G08660 LOC_Os09g07800 GRMZM2G312838 orthologous_group_3603_containing_osm_1_bdi_1_sbi_0_zm_1_viv_2_ath_2_pop_1_total_8 AT5G48590 POPTR_0002s24910 Vv00s0360g00020 Vv00s2037g00010 LOC_Os11g26890 BRADI4G19497 GRMZM2G171420 AT3G07310 orthologous_group_3965_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT5G52780 Vv16s0050g01260 POPTR_0004s07130 GRMZM5G889520 LOC_Os10g38910 BRADI3G31960 orthologous_group_3967_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT5G52800 POPTR_0017s01050 Vv16s0050g01420 LOC_Os07g28470 BRADI1G28353 GRMZM5G809078 orthologous_group_4233_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT5G55710 Vv19s0015g00520 POPTR_0004s20320 GRMZM2G317451 LOC_Os02g49470 BRADI3G56350 orthologous_group_4348_containing_osm_2_bdi_2_sbi_0_zm_3_viv_2_ath_3_pop_4_total_16 AT5G57060 Vv11s0016g01750 POPTR_0006s14830 POPTR_0018s05410 BRADI2G05310 LOC_Os01g08814 GRMZM2G174570 GRMZM2G099466 LOC_Os05g08854 BRADI2G33657 GRMZM5G856067 AT4G26060 Vv09s0002g01880 POPTR_0001s20590 AT1G54217 POPTR_0003s04100 orthologous_group_5995_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_1_pop_1_total_7 AT4G13650 Vv08s0007g07510 POPTR_0013s05550 BRADI2G22840 GRMZM2G153275 GRMZM2G056996 LOC_Os12g36620 orthologous_group_6026_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_2_pop_3_total_9 AT4G14070 Vv08s0032g00050 POPTR_0017s08700 POPTR_0017s08690 POPTR_0001s32900 BRADI1G01847 GRMZM2G016774 LOC_Os03g62850 AT3G23790 orthologous_group_6308_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT4G17550 POPTR_0001s15240 POPTR_0003s08040 Vv00s0616g00010 LOC_Os02g43620 BRADI3G50397 GRMZM2G124136 orthologous_group_6704_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_1_pop_1_total_7 AT4G24400 Vv18s0001g07980 POPTR_0002s10970 GRMZM2G383240 GRMZM2G021517 BRADI2G40500 LOC_Os01g35184 orthologous_group_6806_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT4G26330 POPTR_0001s03950 Vv06s0004g00210 LOC_Os03g06290 GRMZM2G377780 BRADI1G74547 orthologous_group_6818_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT4G26680 POPTR_0014s16390 Vv12s0057g00740 LOC_Os10g21920 BRADI3G03280 GRMZM2G029514 orthologous_group_6845_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT4G27340 Vv19s0014g03930 POPTR_0011s12780 BRADI2G13350 LOC_Os01g29409 GRMZM2G168959 orthologous_group_7093_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT4G30845 Vv11s0037g00830 POPTR_0006s19390 GRMZM2G113287 LOC_Os03g11250 BRADI1G70380 orthologous_group_7244_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_1_pop_2_total_8 AT4G32820 Vv04s0008g00560 POPTR_0018s04800 POPTR_0006s25130 GRMZM2G412150 GRMZM2G420121 BRADI1G12217 LOC_Os03g48100 orthologous_group_7353_containing_osm_2_bdi_1_sbi_0_zm_2_viv_2_ath_3_pop_5_total_15 AT4G34180 Vv03s0180g00080 POPTR_0001s30890 POPTR_0001s30900 Vv03s0180g00070 POPTR_0001s30910 POPTR_0009s10010 LOC_Os08g23100 GRMZM2G061702 BRADI4G08340 LOC_Os09g02270 GRMZM2G138744 AT4G35220 AT1G44542 POPTR_0009s10020 orthologous_group_7576_containing_osm_2_bdi_1_sbi_0_zm_3_viv_1_ath_1_pop_2_total_10 AT4G37130 Vv07s0255g00010 POPTR_0007s11760 POPTR_0005s21290 LOC_Os09g28019 GRMZM2G024775 GRMZM2G134508 BRADI4G31937 LOC_Os03g55660 GRMZM2G345582 orthologous_group_7605_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT4G37660 Vv00s1232g00020 POPTR_0004s23300 BRADI1G20710 LOC_Os07g43310 GRMZM2G171501 orthologous_group_7646_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT4G38225 POPTR_0004s21720 Vv03s0038g03970 GRMZM2G321053 LOC_Os03g03670 BRADI1G76550 orthologous_group_8076_containing_osm_1_bdi_1_sbi_0_zm_3_viv_1_ath_1_pop_1_total_8 AT3G06290 Vv05s0020g02350 POPTR_0010s02900 BRADI1G19627 LOC_Os07g45160 GRMZM2G429213 GRMZM2G429231 GRMZM2G131389 orthologous_group_8510_containing_osm_1_bdi_1_sbi_0_zm_1_viv_2_ath_1_pop_7_total_13 AT3G13540 Vv08s0007g07230 POPTR_0019s05210 POPTR_0019s05200 POPTR_0019s05190 POPTR_0013s05300 POPTR_0013s05310 LOC_Os01g50110 BRADI2G47367 GRMZM2G395672 Vv06s0004g00570 POPTR_0001s04240 POPTR_0003s21120 orthologous_group_8527_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT3G13700 POPTR_0003s02970 Vv09s0002g02410 GRMZM2G063462 BRADI1G49340 LOC_Os06g05800 orthologous_group_8557_containing_osm_1_bdi_1_sbi_0_zm_1_viv_2_ath_1_pop_2_total_8 AT3G13990 Vv09s0002g01530 POPTR_0001s17070 POPTR_0003s06200 Vv04s0043g00960 LOC_Os03g29750 BRADI1G60080 GRMZM2G004736 orthologous_group_8698_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_2_pop_1_total_7 AT3G15780 Vv09s0002g06800 POPTR_0001s19310 GRMZM2G428356 BRADI2G13422 LOC_Os01g29740 AT1G52550 orthologous_group_8773_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT3G16890 Vv05s0077g01320 LOC_Os12g37100 BRADI4G04840 GRMZM2G420723 POPTR_0010s15110 POPTR_0019s13320 orthologous_group_9201_containing_osm_1_bdi_1_sbi_0_zm_1_viv_2_ath_1_pop_3_total_9 AT3G22670 POPTR_0010s09410 Vv05s0020g03830 Vv05s0020g03800 POPTR_0010s09390 POPTR_0010s09400 LOC_Os03g53490 BRADI1G08840 GRMZM2G078416 orthologous_group_9445_containing_osm_1_bdi_2_sbi_0_zm_1_viv_1_ath_1_pop_3_total_9 AT3G26380 Vv01s0137g00630 POPTR_0010s05840 POPTR_0008s18920 POPTR_0010s05830 BRADI2G13530 LOC_Os01g33420 GRMZM2G028307 BRADI2G13520 orthologous_group_9841_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_2_pop_4_total_11 AT3G46540 POPTR_0009s03580 POPTR_0001s24610 Vv06s0004g04590 BRADI2G54607 LOC_Os01g62370 GRMZM2G463415 GRMZM2G036120 POPTR_0008s05570 POPTR_0010s21160 AT1G08670 orthologous_group_9853_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT3G46790 Vv06s0004g05500 POPTR_0009s04240 GRMZM2G380195 BRADI2G54927 LOC_Os01g62910 orthologous_group_9960_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT3G49142 Vv16s0050g02250 POPTR_0005s00730 LOC_Os03g13830 AC155390.2_FG002 BRADI1G68480 orthologous_group_9963_containing_osm_1_bdi_1_sbi_0_zm_1_viv_4_ath_1_pop_1_total_9 AT3G49170 POPTR_0015s02060 Vv00s2397g00010 Vv00s1944g00010 Vv00s2304g00010 Vv16s0050g02690 GRMZM2G462625 LOC_Os05g12130 BRADI2G32290 orthologous_group_9973_containing_osm_1_bdi_1_sbi_0_zm_5_viv_1_ath_1_pop_2_total_11 AT3G49400 POPTR_0015s00760 Vv16s0098g01030 POPTR_0012s02880 GRMZM2G457467 GRMZM2G155565 BRADI1G55216 LOC_Os01g50690 GRMZM2G160046 GRMZM2G458266 GRMZM2G036726 orthologous_group_10063_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT3G51500 Vv04s0044g01980 POPTR_0011s14940 GRMZM2G326545 BRADI3G06830 LOC_Os02g10010 orthologous_group_10152_containing_osm_2_bdi_2_sbi_0_zm_2_viv_2_ath_1_pop_2_total_11 AT3G52770 Vv08s0007g03480 LOC_Os12g05560 LOC_Os11g05430 BRADI4G42250 BRADI4G42480 AC194104.3_FG010 AC198781.3_FG002 POPTR_0006s08320 POPTR_0010s24410 Vv13s0106g00690 orthologous_group_10489_containing_osm_1_bdi_1_sbi_0_zm_3_viv_1_ath_1_pop_2_total_9 AT3G57600 POPTR_0016s05360 POPTR_0006s05320 Vv08s0007g08150 GRMZM2G419901 LOC_Os03g07830 GRMZM2G399098 GRMZM2G376255 BRADI1G72990 orthologous_group_10901_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT3G63370 Vv00s0179g00220 POPTR_0005s23690 BRADI5G12970 AC203841.3_FG002 LOC_Os04g39410 orthologous_group_11049_containing_osm_1_bdi_1_sbi_0_zm_2_viv_2_ath_2_pop_1_total_9 AT2G03130 Vv01s0011g01450 Vv00s2478g00020 POPTR_0003s07240 BRADI2G17250 LOC_Os05g48050 GRMZM2G101042 GRMZM2G057388 AT1G70190 orthologous_group_11176_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT2G05810 POPTR_0014s15680 POPTR_0002s21720 Vv12s0059g01340 BRADI4G21740 LOC_Os09g05030 GRMZM2G411260 orthologous_group_11297_containing_osm_1_bdi_1_sbi_0_zm_3_viv_1_ath_1_pop_1_total_8 AT2G14260 Vv03s0088g01170 POPTR_0001s29460 LOC_Os05g43830 GRMZM2G176396 BRADI2G19980 GRMZM2G044663 AC199187.3_FG006 orthologous_group_11379_containing_osm_2_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_7 AT2G16630 Vv03s0038g01960 POPTR_0004s17640 GRMZM2G130813 BRADI3G00370 LOC_Os02g01190 LOC_Os05g27970 orthologous_group_11789_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT2G25950 Vv04s0008g00760 POPTR_0018s04960 GRMZM2G090647 BRADI2G41230 LOC_Os01g37832 orthologous_group_11933_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT2G28250 Vv06s0004g02640 POPTR_0004s22260 POPTR_0009s01560 GRMZM2G063897 LOC_Os02g02040 BRADI3G01137 orthologous_group_12256_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT2G33390 POPTR_0010s07760 Vv05s0049g01730 POPTR_0008s17060 BRADI1G75850 GRMZM2G009014 LOC_Os03g04545 orthologous_group_12352_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_1_pop_2_total_8 AT2G34780 POPTR_0001s46800 POPTR_0011s16450 Vv04s0008g03110 LOC_Os01g56290 BRADI2G51070 GRMZM2G029785 GRMZM2G049759 orthologous_group_12423_containing_osm_1_bdi_1_sbi_0_zm_8_viv_1_ath_1_pop_1_total_13 AT2G35920 POPTR_0016s06940 Vv08s0032g01230 BRADI3G28477 GRMZM2G399212 GRMZM2G384369 LOC_Os10g33275 GRMZM2G052798 GRMZM2G119671 GRMZM5G806292 GRMZM5G863698 GRMZM2G395467 GRMZM2G381470 orthologous_group_12461_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_2_pop_1_total_8 AT2G36815 Vv08s0007g04370 POPTR_0006s12020 LOC_Os04g47110 BRADI5G18070 GRMZM2G423734 GRMZM2G091592 AT1G03910 orthologous_group_12579_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT2G39000 BRADI3G54710 LOC_Os02g56219 GRMZM2G055141 Vv13s0019g04360 POPTR_0008s03920 POPTR_0010s22970 orthologous_group_12610_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT2G39650 POPTR_0010s21120 POPTR_0008s05620 Vv13s0019g01100 BRADI2G62310 LOC_Os01g74250 GRMZM2G045779 orthologous_group_12652_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT2G40430 Vv13s0067g02250 POPTR_0008s07560 LOC_Os05g05220 GRMZM2G060987 BRADI2G36460 orthologous_group_12779_containing_osm_1_bdi_1_sbi_0_zm_3_viv_3_ath_1_pop_2_total_11 AT2G42760 Vv12s0028g01730 Vv12s0028g01720 LOC_Os06g35770 BRADI1G38210 GRMZM5G860176 GRMZM5G843446 GRMZM5G838496 Vv10s0003g01580 POPTR_0004s05860 POPTR_0011s07010 orthologous_group_12949_containing_osm_1_bdi_1_sbi_0_zm_4_viv_1_ath_1_pop_1_total_9 AT2G46200 Vv15s0046g01740 POPTR_0014s08710 BRADI2G07530 LOC_Os01g12670 GRMZM2G058366 GRMZM2G401784 GRMZM2G174489 GRMZM2G009438 orthologous_group_13030_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT2G47990 POPTR_0014s13130 POPTR_0002s22790 Vv07s0104g01060 AC212570.3_FG006 LOC_Os07g12320 BRADI1G53790 orthologous_group_13388_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_4_total_9 AT1G08600 POPTR_0019s03810 POPTR_0013s04600 POPTR_0019s03800 POPTR_0013s04610 Vv14s0036g01460 BRADI3G27970 LOC_Os10g31970 GRMZM2G163436 orthologous_group_13440_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_2_pop_2_total_8 AT1G09410 Vv14s0060g00670 POPTR_0013s00680 POPTR_0005s00800 GRMZM2G476593 LOC_Os03g20190 BRADI1G64080 AT1G56690 orthologous_group_13445_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_3_total_8 AT1G09450 Vv14s0060g00450 POPTR_0005s00540 POPTR_0013s00520 GRMZM2G102944 BRADI1G20067 LOC_Os07g44360 POPTR_0005s00530 orthologous_group_13617_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT1G11820 Vv10s0116g01640 LOC_Os07g32600 BRADI1G26510 GRMZM2G325008 POPTR_0021s00450 POPTR_0004s01110 orthologous_group_13836_containing_osm_2_bdi_1_sbi_0_zm_1_viv_1_ath_2_pop_2_total_9 AT1G16000 Vv09s0018g00560 GRMZM2G141499 LOC_Os12g36930 LOC_Os08g17060 BRADI3G19520 POPTR_0003s18170 POPTR_0001s03490 AT1G80890 orthologous_group_14056_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT1G19290 Vv18s0001g11850 POPTR_0014s03970 BRADI1G15820 LOC_Os05g19390 GRMZM2G312954 orthologous_group_14104_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT1G20060 Vv18s0001g14580 POPTR_0013s06490 GRMZM2G315594 LOC_Os02g01180 BRADI3G00360 orthologous_group_14257_containing_osm_2_bdi_2_sbi_0_zm_3_viv_1_ath_2_pop_1_total_11 AT1G22330 POPTR_0002s09830 Vv18s0001g05840 BRADI3G58710 GRMZM2G303915 GRMZM2G169617 BRADI1G45540 LOC_Os02g51890 GRMZM2G073700 LOC_Os06g11730 AT1G78260 orthologous_group_14620_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT1G29790 Vv10s0003g02780 POPTR_0001s34490 POPTR_0011s02820 GRMZM2G457147 BRADI3G34510 LOC_Os10g42770 orthologous_group_14917_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT1G42480 Vv18s0122g01020 POPTR_0005s27500 LOC_Os02g02524 BRADI3G01590 GRMZM2G150193 orthologous_group_15100_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_1_total_6 AT1G49245 POPTR_0013s06980 Vv15s0107g00300 LOC_Os07g26974 GRMZM5G885473 BRADI1G28680 orthologous_group_15612_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_1_pop_2_total_8 AT1G62990 Vv02s0025g04630 POPTR_0001s08540 POPTR_0001s08550 LOC_Os03g03164 BRADI1G76970 GRMZM2G159431 GRMZM2G060507 orthologous_group_15993_containing_osm_1_bdi_1_sbi_0_zm_2_viv_1_ath_1_pop_1_total_7 AT1G71310 Vv18s0089g00930 POPTR_0019s09650 LOC_Os01g65560 BRADI2G56682 GRMZM2G005061 GRMZM2G120814 orthologous_group_16191_containing_osm_2_bdi_1_sbi_0_zm_2_viv_1_ath_1_pop_2_total_9 AT1G76200 Vv18s0001g00230 POPTR_0002s01410 BRADI1G30060 LOC_Os06g50970 GRMZM2G106607 POPTR_0005s26990 LOC_Os05g03150 GRMZM2G181269 orthologous_group_16194_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_3_total_8 AT1G76250 Vv18s0001g00130 POPTR_0005s27100 POPTR_0002s01270 POPTR_0007s09910 LOC_Os02g02730 BRADI3G01830 GRMZM2G006197 orthologous_group_16276_containing_osm_1_bdi_1_sbi_0_zm_4_viv_1_ath_1_pop_1_total_9 AT1G77930 Vv18s0001g04440 POPTR_0002s09120 LOC_Os07g43330 BRADI1G20730 GRMZM2G019419 GRMZM5G840361 AC211512.4_FG002 GRMZM2G054745 orthologous_group_16329_containing_osm_1_bdi_1_sbi_0_zm_1_viv_1_ath_1_pop_2_total_7 AT1G79070 Vv19s0027g00470 POPTR_0011s14760 POPTR_0008s04280 BRADI3G14270 LOC_Os08g03590 GRMZM2G390489
Script: classify_candidate_unitary_gene_loss.sh
Function: classify candidate gene loss event into relic-retaining, relic-lacking and false positive one
Input: sbi/orthologous_groups_with_candidate_unitary_gene_loss_in_sbi.tsv & sbi/ppipe_out
Output: sbi/classify_candidate_gene_loss/sbi/relic-[retaining/lacking].group
Example for the file named relic-retaining.group, which contains relic-retaining gene loss event represented by the number of orthologous group happened in it.
orthologous_group_number 10152 11297 11789 11933 12256 12352 12423 12652 1299 13388 13445 14104 14257 15612 15993 16276 16329 2873 2995 3967 4348 6026 6308 6704 6806 7093 7244 7576 8076 828 834 8510 8557 8773 9445 955 9973
Example for the file named relic-lacking.group, which contains relic-lacking gene loss event represented by the number of orthologous group happened in it.
orthologous_group_number 10489 10901 11049 12579 12949 1318 13440 13617 13836 14056 14620 14917 15100 16194 193 2096 290 5995 6818 7605 7646 8698 9201 9841 9853 9960 9963
Script: ensure_relic-retaining_unitary_gene_loss.sh
Function: search disablers, i.e. frameshift and premature stop-codon, for relics via GeneWise and select which containing at least one disabler as the final relic-retaining gene loss event
Input: sbi/classify_candidate_gene_loss/sbi/relic-retaining.group & sbi/ppipe_out
Note: In this step, we use GeneWise to detect disablers for relic-retaining gene loss event, so you need install it first. In addition, for running GeneWise, it is necessary to provide genomic sequence of relic and the protein sequence of its orthologous counterpart. In our pipeline, we extract genomic sequence via gmap_setup, which is in GMAP package, and protein sequence via fastacmd, which is in BLAST package.
Output: sbi/ensure_relic-retaining_gene_loss/relic-retaining_unitary_gene_loss_of_sbi.[group/info]
Example for the file named relic-retaining_unitary_gene_loss_of_sbi.info, which contains disablers information of relic-retaining unitary gene loss event in sorghum. The columns in this file display the number of orthologous group in which relic-retaining gene loss happened, orthologous counterpart of the lost gene, detailed genomic position of relic left after unitary gene loss (chromosome, start bp, end bp and strand) and its disablers information (number of frameshift and premature stop-codon sites).
orthologous_group_number orthologous_conterpart chr start end strand frameshift_number premature_stop-codon_number 8076 GRMZM2G429231 2 75118681 75121777 + 5 0 6806 GRMZM2G377780 1 69714301 69716778 + 1 0 9973 GRMZM2G155565 1 2696706 2698191 + 1 0 6026 GRMZM2G016774 1 1343902 1348595 + 1 0 12352 GRMZM2G049759 3 63753628 63758498 + 0 1 834 GRMZM2G160702 6 61013583 61019973 - 1 0 9445 GRMZM2G028307 8 6901429 6905071 + 1 0 11297 GRMZM2G176396 9 54842922 54845280 - 1 0 2873 GRMZM2G050218 7 63759369 63760561 + 3 1
Script: locate_relic-lacking_unitary_gene_loss.sh
Function: locate relic-lacking gene loss event in the genome based on synteny relationships
Input: sbi/classify_candidate_gene_loss/sbi/relic-lacking.group & EVS009_Supplemental_dataset_S1_pluslinks.tsv
Output: sbi/locate_relic-lacking_gene_loss/relic-lacking_unitary_gene_loss_of_sbi.[group/info]
Example for the file named relic-lacking_unitary_gene_loss_of_sbi.info, which contains position information of relic-lacking unitary gene loss event in sorghum. The column content is the same with file containing disablers information of relic-retaining unitary gene loss event in sorghum, but NA is labeled for disablers information.
orthologous_group_number orthologous_counterpart chr start end strand frameshift_number premature_stop-codon_number 12579 BRADI3G54710 4 66266780 66302303 + NA NA 13836 LOC_Os08g17060 7 20735518 22596183 + NA NA 14620 BRADI3G34510 1 48453206 48521448 - NA NA 14917 GRMZM2G150193 4 1084578 1094215 + NA NA 15100 LOC_Os07g26974 2 17105974 17323562 - NA NA 290 LOC_Os07g42826 2 73617420 73625727 + NA NA
Script: integrate_unitary_gene_loss_info.sh
Function: integrate information for unitary gene loss events, in spite of relic-retaining or relic-lacking ones
Input: sbi/[ensure_relic-retaining_gene_loss/locate_relic-lacking_gene_loss]/sbi/[relic-retaining/relic-lacking]_unitary_gene_loss_of_sbi.[group/info]
Output: unitary_gene_loss_info_for_sbi.tsv
Example for the file named unitary_gene_loss_info_for_sbi.tsv, which contains unitary gene loss information in sorghum. The columns in this file display the number of orthologous group in which unitary gene loss happened, orthologous counterpart of the lost gene, the detailed genomic position of relic left after unitary gene loss (chromosome, start bp, end bp and strand), disablers information of the left relic (number of frameshift and premature stop-codon sites), the classification of gene loss event and the species gene loss event happened in.
orthologous_group_number orthologous_counterpart chr start end strand frameshift_number premature_stop-codon_number unitary_gene_loss_classification unitary_gene_loss_species 11297 GRMZM2G176396 9 54842922 54845280 - 1 0 relic_retaining sbi 13836 LOC_Os08g17060 7 20735518 22596183 + NA NA relic_lacking sbi 6026 GRMZM2G016774 1 1343902 1348595 + 1 0 relic_retaining sbi 12352 GRMZM2G049759 3 63753628 63758498 + 0 1 relic_retaining sbi 14917 GRMZM2G150193 4 1084578 1094215 + NA NA relic_lacking sbi 12579 BRADI3G54710 4 66266780 66302303 + NA NA relic_lacking sbi 8076 GRMZM2G429231 2 75118681 75121777 + 5 0 relic_retaining sbi 6806 GRMZM2G377780 1 69714301 69716778 + 1 0 relic_retaining sbi 15100 LOC_Os07g26974 2 17105974 17323562 - NA NA relic_lacking sbi 9973 GRMZM2G155565 1 2696706 2698191 + 1 0 relic_retaining sbi 834 GRMZM2G160702 6 61013583 61019973 - 1 0 relic_retaining sbi 9445 GRMZM2G028307 8 6901429 6905071 + 1 0 relic_retaining sbi 2873 GRMZM2G050218 7 63759369 63760561 + 3 1 relic_retaining sbi 14620 BRADI3G34510 1 48453206 48521448 - NA NA relic_lacking sbi 290 LOC_Os07g42826 2 73617420 73625727 + NA NA relic_lacking sbi
If you want to detect unitary gene loss event in the MRCA (Most Recent Common Ancestor) of multiple species, e.g. sorghum and maize,you should prepare PseudoPipe result and GMAP genome database for each species as shown below.
Then you can run our pipeline with the parameter sbi and zm in the directory unitary_gene_loss_detection, which is shown below.
Finally, you can determine in which species or MRCA the unitary gene loss event happened based on the output file unitary_gene_loss_info_for_sbi_zm.tsv.From the example shown below you can conclude that the unitary gene loss in orthologous group 8968 is happened in the MRCA of sorghum and maize,the unitary gene loss in orthologous group 4397 is happened in sorghum, while the unitary gene loss in orthologous group 10276 is happened in maize.
orthologous_group_number orthologous_counterpart chr start end strand frameshift_number premature_stop-codon_number unitary_gene_loss_classification unitary_gene_loss_species 4397 LOC_Os02g08140 3 47955604 47956683 + 1 0 relic-retaining sbi 8968 LOC_Os02g52510 4 63921086 63928674 - 2 0 relic-retaining sbi 8968 LOC_Os02g52510 5 209754553 209757880 - 2 0 relic-retaining zm 5146 LOC_Os03g10940 1 6133765 6136732 - 1 0 relic-retaining sbi 10318 LOC_Os01g74500 3 74386656 74396272 + NA NA relic-lacking sbi 165 LOC_Os11g48060 4 57749194 57750836 + 2 0 relic-retaining sbi 10276 LOC_Os10g28230 3 157041566 157041979 + 2 0 relic-retaining zm 16364 LOC_Os01g55549 3 63381108 63384547 - 4 0 relic-retaining sbi 11573 LOC_Os04g38470 10 121455394 121492570 + NA NA relic-lacking zm 9435 LOC_Os12g18650 6 49944588 49946702 - 2 0 relic-retaining sbi
Multiple alignment of promoter regions of relic and its orthologous counterparts
You can download the MAF_file.rar.