Why and when should you prefer PSI-BLAST over blastp? Why e-values differ in PSI-BLAST from first search to second iteration?

Bioinformatic Exam

1. (a) Why and when should you prefer PSIBLAST over blastp?

(b) Why evalues differ in PSIBLAST from first search to second iteration?

(c) What does PSSM stand for? Which BLAST search program uses the PSSM?

(d) In what situation one needs to verify two putative homologs using pairwise alignment and why?

2. Use PSIBLAST to find homologues of protein P14629 (select uniprot/swissprot database),limited to vertebrates.
a. What threshold did you use and why?

b. What type of enzyme shows up in almost every description?

c. Find the protein with accession number Q9UQ84.2. What is the enzyme?

d. What is the E value for Q9UQ84.2?

e. Confirm homology between Q9UQ84.2 and P14629 using pairwise BLAST alignment. What is the E value?

f. Are they homologues based on that E value?

g. Why might BLAST have not found all homologs on the first iteration?

h. Other than pairwise alignment, describe what else you could have done on your computer with BLAST and alignment tools to try to confirm homology between Q9UQ84.2 and P14629.

i. Describe what else you could do in the wet lab to try to confirm homology between
Q9QUQ84.2 and P14629.

3. Retrieve the following five protein sequences in FASTA format. Be sure to include the top (descriptive with gi/accession numbers) line. You can use either numbers (gi/accession or accession/version) to find the files.

Sequence # GI Accession Histone variant (description)
1
22901744 AAN10051 H3.1

2 23664258 AAN39283 H3.2

3 55977062 P84243 H3.3

4 4504299 NP_003484 H3t

5 1345726 P49450 CENPA

Go to the MUSCLE website: http://www.ebi.ac.uk/Tools/msa/muscle/
Paste all five sequences (one paste) into the window and align the sequences. Alternatively,
upload the saved file.
a. Some regions have more asterisks than other regions. What can you say about the protein regions with more asterisks (other than ‘there is a lot of identity’)?

b. Looking at the Nterminal 40 amino acids, which sequence is the most different in sequence and what is its name?

c. Keeping the MUSCLE result available, do the same alignment with MAFFT
(
http://mafft.cbrc.jp/alignment/server/ ). Find the discrepancies between the
two results. Where is the alignment different?

d. In PubMed, find a 2005 paper by Govin et al. with “histone H3” in the title. Open
the full text. What figure in that paper resembles your alignment?

e. Does the alignment more closely match MAFFT or MUSCLE?

f. Which of the original five proteins is the protein mentioned in the title of the
Govin paper (look in figure legend)?

g. What is CENPA? With what DNA sequences is CENPA associated?

4. Use the attached hemagglutinin to align using Clustal Omega and MUSCLE.
a. In CLUSTAL Omega, open Jalview (Results Viewers Tab). Look for the regions with heavy color (very conserved). At what position is this started?

b. Open Jalview (Result Summary Tab) after running MUSCLE. Where is the conserved region?

c. Which flu sequence(s) are much shorter than the others?

d. What sequence has the longest Cterminal gap?

e. Do you think it’s wise to try several multiple alignment programs? Briefly explain.

5. Using the potato inhibitor protein below, run a PHIBLAST of the RefSeq database using the pattern, [FYW]P[EQH][LIV](2)Gx(2)[STAGV]x(2)A . Set the threshold to 0.0001. On the organism line, limit your output to eudicots.
>tr|A8V3U6|A8V3U6_HORVD Chymotrypsin inhibitor2 OS=Hordeum vulgare var. distichum
GN=CI2 PE=4SV=1

MSSMEKKPEGVNIGAGDRQNQKTEWPELVGKSVEEAKKVILQDKPEAQIIVLPVGTIVTI

EYRIDRVRLFVDRLDNIAQVPRVG

a. What are eudicots?

b. What is the Evalue of the best match?

c. At what amino acid position does the pattern begin in the query sequence (scroll to
the alignment)

d. Run a second iteration. How many of the sequences are new (approximate)?

e. Should PHI/PSIBLAST always continue to find new hits at every iteration? Briefly
explain.

f. At what iteration are no more new sequences added?

6. Run multiple alignment programs using the attached glutathione peroxidases (gluperox).
Start with Clustal Omega.

a. Which sequences appear to have a longer Nterminus than the others?

b. An asterisk represents a conserved position that is identical in all sequences. How many such positions are present?

c. Compare the Clustal Omega output to the gluperox image. The researchers used ClustalW to produce that figure. How close is the output to what is seen in the figure? Can you find any major differences (other than order of sequences)?

d. The sequences all have one selenocysteine residue (u). At what amino acid
position is selenocysteine (use gi number 285803075 as your reference

sequence to determine the position)?

e. Is selenocysteine aligned perfectly by Clustal?

f. Name the amino acid that comes immediately before selenocysteine in most sequences. Then name the amino acid that follows selenocysteine in most sequences.

g. Now run Muscle. Use the file gluoperox_2 because MUSCLE does not work with
selenocycsteine (the U is replaced by X). Comment on how different the alignment looks to the Clustal alignment.

h. The X represents selenocysteine. Does the X align perfectly?

i. Now run TCoffee (http://tcoffee.crg.cat/apps/tcoffee/do:regular ). Is the TCoffee output closer to Muscle or Clustal?

j. Is the selenocysteine perfectly aligned? If not, where does it differ (what sequences)?

Why and when should you prefer PSI-BLAST over blastp? Why e-values differ in PSI-BLAST from first search to second iteration?
Scroll to top