Chromosome No: Start of pattern: End of pattern: No. of Patterns:

In summary, three sets of Perl scripts comprise *SNPpattern*: 1) grouping data scripts – to create separate data files for further downstream comparison analysis; 2) SNP allele block scripts – to find, count, and compare the SNP allele block patterns between any group of individuals; and 3) similarity scripts - to score the similarity between individuals based on an individual's entire SNP allele pattern. Table 8 encapsulates the primary function and


Fig. 4. Method for scoring SNPs. It shows scoring of paternal chromosome (row 1) for only 2 out of 4 individuals.

ethnic groups (EG) called A, B and C. We then compare the SNP allele patterns for the full length of the genome for each animal within each EG. From the comparisons we then determine that EG A and B share the most similar SNP allele patterns. Therefore it is proposed that there is a greater chance that EG A and EG B carry the same causal variant allele relationship by descent than EGs A and C, or B and C. The comparisons are based on similarity matrices whereby a score is incremented by 1 when a SNP allele of an individual matches that in another individual (Figure 4). Each individual is processed in turn. In the example only the paternal chromosome (row 1 in this case) is scored. The user can choose to score either row 1 or row 2, or row 1 *and* row 2 of the phased genotypes contained in the group input file(s).


Table 7. Example similarity matrix. Shows a simple matrix constructed from made-up data in Figure 4. Here we can see that individuals from ethnic groups A and D share the most similarity; and individuals from ethnic groups A and B, and individuals from ethnic groups B and C share the least similarity. An individuals' overall similarity to all other individuals in the group can be ranked according to its total similarity score. In this example, individuals in A is considered the most similar and individuals in B the least. In practice the similarity matrix is constructed from thousands of SNP allele pattern comparisons for hundreds of individuals.

ID: Individual SNP allele pattern

ID: SNP allele patterns to compare:

ID: Individual SNP allele pattern

ID: SNP allele patterns to compare:

B 1 1 1 2 1 2 C 2 1 1 1 2 3 D 1 1 2 1 2 5

A 1 1 2 1 2 2 C 2 1 1 1 2 2 D 1 1 2 1 2 2

A 1 1 2 1 2

Score

B 1 1 1 2 1

Score

etc

Fig. 4. Method for scoring SNPs. It shows scoring of paternal chromosome (row 1) for only 2

ethnic groups (EG) called A, B and C. We then compare the SNP allele patterns for the full length of the genome for each animal within each EG. From the comparisons we then determine that EG A and B share the most similar SNP allele patterns. Therefore it is proposed that there is a greater chance that EG A and EG B carry the same causal variant allele relationship by descent than EGs A and C, or B and C. The comparisons are based on similarity matrices whereby a score is incremented by 1 when a SNP allele of an individual matches that in another individual (Figure 4). Each individual is processed in turn. In the example only the paternal chromosome (row 1 in this case) is scored. The user can choose to score either row 1 or row 2, or row 1 *and* row 2 of the phased genotypes contained in the

**ID A B C D Total A** 5 2 3 5 **16 B** 2 5 2 2 11 **C** 3 2 5 3 14 **D** 5 2 3 5 15 Table 7. Example similarity matrix. Shows a simple matrix constructed from made-up data in Figure 4. Here we can see that individuals from ethnic groups A and D share the most similarity; and individuals from ethnic groups A and B, and individuals from ethnic groups B and C share the least similarity. An individuals' overall similarity to all other individuals

in the group can be ranked according to its total similarity score. In this example,

individuals in A is considered the most similar and individuals in B the least. In practice the similarity matrix is constructed from thousands of SNP allele pattern comparisons for

out of 4 individuals.

group input file(s).

hundreds of individuals.

#### **3.4 Linking SNP allele block regions to genomic annotation**

One of the output files from a *SNPpattern* Perl script is a file that contains all SNP allele blocks where the number of distinct SNP allele patterns is low or high. The script allows for a user-definable upper or lower pattern frequency threshold. For example, if a user enters a threshold of "<3" then only SNP allele blocks with a distinct SNP allele pattern frequency of less than 3 will be output. Likewise, if the user enters ">99" only SNP allele blocks with a distinct SNP allele pattern frequency of greater than 99 will be output. Figure 4-4 shows an example of the output file. The output consists of a list with 4 columns: Chromosome number of the chromosome containing the SNP allele block (the genomic region of interest); start and end genomic location of SNP allele block; the number of distinct SNP allele patterns found within the SNP allele block for a group of individuals (only lists the genomic regions where the number of patterns is below or above a user-defined threshold), and the average number of patterns per block. The intended use of the output file is to act as a starting point for a researcher to find biological meaning in regions identified to have low or high haplotype diversity. Biological meaning may help in the understanding of why in some regions and not others there is a conservation of the same alleles from generation to generation. In other words, why is there only 1 or 2 distinct SNP allele patterns existing in the same genomic region for all individuals in a group? Conversely, some regions have a large number of different SNP allele patterns implying a hotspot region for recombination. Finding the underlying biology within the hotspot region may provide clues to the mechanism of recombination. The expectation is that the output list can be used for further downstream analysis such as searching for annotation of the chromosome region within which the SNP allele block is located.

```