**3.1 BLAST search**

Basic Local Alignment Search Tool (BLAST) is an algorithm for comparing amino-acid sequences or the nucleotides. By performing a BLAST search, one is able to compare an unknown sequence with a library or database of known sequences, and identify library sequences that resemble the unknown sequence above a certain score percentage [11] (usually 40%). This chapter is going to give an example that follows the discovery of a previously unknown fem1b gene in the mouse and performs a BLAST search of the human genome to see if humans carry a similar fem1b gene. BLAST identifies sequences in the human fem1b genome that resemble the mouse fem1b gene based on similarity of sequence. Given a sequence of one fragment of mouse gene (Figure 8), BLAST software is going to search all human gene banks and find similar genes. http://blast.ncbi.nlm.nih.gov/Blast.cgi is the official BLAST website.

Fig. 8. Unknown mouse gene.

Fig. 6. Immunofluorescence staining of insulin (green), Fem1b (red), and merged image demonstrating that Fem1b is expressed not only in insulin-positive β cells but also in

Fig. 7. Immunofluorescence staining with a combination of antibodies to glucagon and somatostatin (red) and Fem1b (green) and a merged image verifying expression of Fem1b

Basic Local Alignment Search Tool (BLAST) is an algorithm for comparing amino-acid sequences or the nucleotides. By performing a BLAST search, one is able to compare an unknown sequence with a library or database of known sequences, and identify library sequences that resemble the unknown sequence above a certain score percentage [11] (usually 40%). This chapter is going to give an example that follows the discovery of a previously unknown fem1b gene in the mouse and performs a BLAST search of the human genome to see if humans carry a similar fem1b gene. BLAST identifies sequences in the human fem1b genome that resemble the mouse fem1b gene based on similarity of sequence. Given a sequence of one fragment of mouse gene (Figure 8), BLAST software is going to search all human gene banks and find similar genes. http://blast.ncbi.nlm.nih.gov/Blast.cgi

The coimmunostaining with antibodies to glucagon and somatostatin, markers for α cells and δ cells, respectively, demonstrates that the Fem1b protein is also expressed in these non-

insulin-negative non-β cells.

β cells (Figure 7).

within non- β cells.

**3.1 BLAST search** 

is the official BLAST website.

Fig. 8. Unknown mouse gene.

**3. Fem1b gene search and alignment** 

When the results page appears, click the identifier with the highest score and you will see the following information. Here the highest score is 481. The score was calculated on the match quality and the length of the most-similar segments that occur between the unknown mouse gene and the target human fem1b gene.


### When you scroll down the page, you see reach a long list of the human fem1b nucleotide sequence starting with

## **3.2 Sequence statistics analysis**

Sections of a nucleotide sequence with a certain percentage of A+T or C+G usually indicates intergenic parts of the sequence. Figure 9 is a plot of monomer densities and combined monomer densities. One can use such statistic plot to determine if the sequence has the characteristics of a protein-coding region.

Figure 10 is the visualization of the nucleotide distribution. Figure 11 is the codon distribution showing a high amount of GAA, GAT and AAC. The amino acids for GAA, GAT and AAC are Glutamate, Aspartate, and Asparagine respectively. The corresponding bar chart distribution is displayed at figure 12. It is noticeable that it contains high volume of leucine, alanine, and valine; low volume of tryptophan, methionine, and proline.

Fig. 9. Human fem1b gene's monomer densities and A-T &C-G combined monomer densities.

Fig. 10. Human fem1b gene's nucleotide distribution (A: 542, C: 426, G: 462, T: 451).

Fig. 9. Human fem1b gene's monomer densities and A-T &C-G combined monomer densities.

Fig. 10. Human fem1b gene's nucleotide distribution (A: 542, C: 426, G: 462, T: 451).


Fig. 11. Human fem1b gene codon distribution.

Fig. 12. Amino acids distribution of human fem1b gene.

### **3.3 Open reading frame of Fem1b gene from both human and mouse**

An open reading frame (ORF) is a nucleotide sequence without having a stop codon in a given reading frame. ORFs can be identified by examining each of the three possible reading

A: 59 R: 35 N: 41 D: 38 C: 17 Q: 23 E: 38 G: 32 H: 27 I: 41 L: 70 K: 29 M: 12 F: 16 P: 15 S: 29 T: 34 W: 3 Y: 23 V: 45

frames on each strand. A DNA sequence must contain a translation start codon and it is usually "AGT". Possible stop codons are "TAA", "TAG and "TGA" [11]. Identifying the start and stop codons for translation determines the ORF in a given nucleotide sequence. Once an ORF is located for a gene or mRNA, a nucleotide sequence can be translated into its corresponding amino acid sequence. Figure 13 – 15 display three reading frames for human's and mouse's fem1b gene sequences. Both genes show the longest ORF on the first reading frame.

Dot plots are one of the easiest ways to look for similarity between two sequences. The diagonal line shown in figure 16 indicates a good alignment between the human's and mouse's fem1b gene.


Fig. 13. First ORF of fem1b gene (left – mouse; right – human).


Fig. 14. Second ORF of fem1b gene (left – mouse; right – human).

frames on each strand. A DNA sequence must contain a translation start codon and it is usually "AGT". Possible stop codons are "TAA", "TAG and "TGA" [11]. Identifying the start and stop codons for translation determines the ORF in a given nucleotide sequence. Once an ORF is located for a gene or mRNA, a nucleotide sequence can be translated into its corresponding amino acid sequence. Figure 13 – 15 display three reading frames for human's and mouse's fem1b gene sequences. Both genes show the longest ORF on the first

Dot plots are one of the easiest ways to look for similarity between two sequences. The diagonal line shown in figure 16 indicates a good alignment between the human's and

Fig. 13. First ORF of fem1b gene (left – mouse; right – human).

Fig. 14. Second ORF of fem1b gene (left – mouse; right – human).

reading frame.

mouse's fem1b gene.


Fig. 15. Third ORF of fem1b gene (left – mouse; right – human).

Fig. 16. Dot plot comparing the human and mouse amino acid sequences.

### **3.4 Sequence alignment 3.4.1 Global alignment**

The Needleman-Wunsch algorithm, which was first published by Saul Needleman and Christian Wunsch in 1970 [12], performs a global alignment on two amino acid or nucleotide sequences. Such algorithm was the first application of dynamic programming to molecular sequence comparison. The following output was performed on two nucleotide sequences of mouse's and human's by the Needleman-Wunsch algorithm

### **3.4.2 Local alignment**

The Smith-Waterman algorithm was first published by Temple Smith and Michael Waterman in 1981 [13]. It is a well-known dynamic programming algorithm for local amino acid or nucleotide sequence alignment. Unlike the global alignment, the Smith-Waterman algorithm performs comparison among segments of all lengths and optimizes the similarity. It is guaranteed to find the optimal local alignment with respect to the scoring method. However, the Smith-Waterman algorithm requires *O(mn)* (m and n are the length of two input sequences) . In practical use, it has been replaced by the heuristic BLAST algorithm, which is much more efficient although not guaranteed to find the optimal alignments. The following output was from local alignment of the amino acid sequences of mouse's and human's using the Smith-Waterman algorithm.
