Valeriy Y. Tarasov
Education/Job
Biophysics/Molecular Biology - wet labs.
Bioinformatics/Software development for molecular biology on Mac and iPhone - private.
Hobby: Physics - h-space theory(ToE)
Important. The assembly can be efficiently performed for up to several hundreds reads, not more!
“open” button – opens files with extensions: .ab1, .abi, .txt. “save” button – saves the selected contig consensus to file with .txt extension. “delete” button – delete the selected read files in the files table. “clear” button – delete all read files in the files table. “print” button – use this button to print the assembly graphic map or to save it into pdf or eps file.
Reads files table
Assembly – graphic map, assembled reads, consensus and coverage
Reads files table Press the “open” button to open a file or multiple read files (.txt, .abi, .abi). The file names will appear in the column “File Name”. To delete the files select the rows and press the “delete” button. To clear the files table press the “clear” button.
Assembly – graphic map, assembled reads, consensus and coverage To start assembly select the minimal reads overlap value, then press the “assemble” button. Play with this value to get an appropriate result. To cancel assembly press the “cancel” button. The assembly result will appear as a set of contigs in the contigs table. Select a contig in the table to see the assembly map. In the map each arrow represents a sequence file. The green arrow corresponds to an original file sequence, and the red arrow represents a reverse complement sequence. Click on an arrow to see the corresponding file in the files table.
For .ab1, .abi files a double click on the selected file row opens a new window with ABI chromatogram. If the contig length is less than 3000 bp the reads sequences and consensus are displayed. For longer contigs click and drag on the assembly map to see the reads sequences and consensus in the corresponding region. To shift the view frame to the left or right side use the “<” and “>” buttons. To return to whole contig view use the “return” button.
Below the assembly map, each vertical red/green bars represents an assembly coverage for each nucleotide. The red bars mean nucleotide variations. On the assembly map these variations are displayed as vertical dash red lines. Place the mouse pointer over a bar to see the coverage number and the nucleotide variation. This functionality is available when the contig frame is less than 3000 bp. Press the “save” button to save a contig consensus sequence to .txt file. To open a consensus sequence in the “DNA1”, “DNA2” or “Alignment” Tab press corresponding buttons.
Prediction of protein secondary structures as sequence pattern search The secondary structures sequences can be extracted from PDB files and the patterns generated from these sequences can be used for the match search in a target sequence.
Problem with the prediction using patterns The protein patterns database is not feasible to generate and to use because of too many pattern variants. For alpha helix pattern of 8 aa long (about two turns) the maximal number of pattern variants is 8 in power of 20 (10 in power of 18).
Solution to the problem of high number of patterns Amino acids can be grouped according to their physicochemical properties – hydrophobicity, negative/ positive charge and etc. 20 amino acids can be split into 8 groups and for 8 aa alpha helix pattern of new code the maximal number of variants is 16777216 (8 in power of 8). The real number of alpha helix patterns should be smaller than that. The following grouping was used in BioLabDonkey Version 1.0:
Standard code
New code
1. V, I, LW, F, C
– very hydrophobic –
W
2. A, M
– less hydrophobic –
V
3. N, Q, S, T, Y
– polar neutral –
O
4. D, E
– negatively charged –
N
5. K, R
– positively charged –
P
6. H
–
B
7. G
–
F
8. P
–
S
For this grouping the prediction had more false positive results. The better outcome is seen for the following grouping implemented in the update of BioLabDonkey, in Version 1.1 (from 23.10.2019):
Standard code
New code
1. V, I, L, F, C, M
– hydrophobic –
W
2. A
–
V
3. S, T
– polar, hydroxylic –
O
4. D, E
– negatively charged –
N
5. K, R
– positively charged –
P
6. H
–
B
7. G
–
F
8. P
–
S
9. W, Y
– polar, aromatic –
A
10. N, Q
-polar, acidic –
Q
Generation of database for alpha helices, beta strands and turns from PDBAlpha helix . The minimal pattern size for alpha helix was set to 8 aa – about two turns. The octamer is considered as a minimum for a stable alpha helix. The octamer patterns were extracted from alpha helix regions in PDB. Beta strand. The beta strand sequences were taken as they are present in PDB. Turn. The turn patterns were set as the sequences of 4 aa long, including glycine or proline.
The database was generated from the following organisms: Saccharomyces cerevisiae, Helicobacter pylori, Klebsiella pneumoniae, E.coli, Mycobacterium tuberculosis, Pseudomonas aeruginosa, Salmonela typhimurium, Staphylococcus aureus, Streptococcus pneumoniae, Vibrio cholerae, Bacillus subtilis, Homo sapiens
The number of generated patterns in BioLabDonkey database Version 1.1 (from 23.10.2019): alpha helix – 200155 , beta strand – 26054
From the mechanism of cotranslational folding, as a helix formation can happen inside the ribosomal exit tunnel, before the beta sheets, the alpha helices were searched first. The sequence regions not occupied by the alpha helices were searched for beta strands. For turn patterns the sequence regions free of alpha helices or beta strands were tested.
Evaluation of the prediction accuracy (random examples)
1. Comparison of the prediction with the secondary structure from pdb when the patterns database does not include the patterns from this pdb. The good prediction is expected to have as less as possible both false negative and the false positive secondary structures.
D-ornithine/D-lysine decarboxylase from Salmonella typhimurium
Xenopus laevis MHC I complex
2. Comparison of the prediction with the secondary structure from pdb when the patterns database include the patterns from this pdb. The good prediction is expected to have as less as possible the false positives secondary structures.
F41 fragment of flagellin of Salmonella typhimirium
3.Comparison of the prediction with the results of other algorithms – machine learning–based techniques.
Comparison with Jpred4 ” (no similarity to sequences with known PDB) for “NgrC protein” (from Providencia stuartii plasmid pTC2 )
Comparison with Jpred4 (no similarity to sequences with known PDB) for “hypothetical protein” (from Providenciastuartiiplasmid pTC2 )
Collection and Use of Personal Information No personal information (data) that can be used to identify or contact a single person is collected.
Collection and Use of Non-Personal Information No non-personal information (data) is collected. User can generate database txt files for DNA features annotation and protein secondary structure prediction. These files are stored inside the program and can be exported/imported by user.
Cookies and Other Technologies No cookies and other technologies are collected.
Disclosure to Third Parties and Service Providers There is no disclosure to third parties and service providers
The Existence of Automated Profiling The program does not have anything for user profiling.