Exercise 003¶
# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/scripts/utils.py
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/all_sequences.fasta
Determination of sequences lengths¶
Load the FASTA file all_sequences.fasta
by using your own FASTA reader or biopython
from the last exercise. Save all sequences in a list called sequences
for the following exercises. Now determine the length of all sequences and store the result in a new list called sequence_lengths
. Finally, print the lowest and highest value in this list.
# Function definitions for this exercise to be re-used
# within the Jupyter notebook
def read_fasta(path: str) -> list[tuple[str, str]]:
"""Reads a FASTA file and returns a tupl data structure.
Args:
path (str): Path to the FASTA file.
Returns:
list[tuple[str, str]]: List of tuples containing header and sequence.
"""
lines = open(path, "r").readlines()
records = []
for line in lines:
if line.startswith(">"):
header = line.lstrip(">")
else:
records.append(
(header, line.strip())
)
return records
sequences = read_fasta("./all_sequences.fasta")
sequence_lengths = [len(seq) for _, seq in sequences] # Variables that are named "_" are not used
# TODO: Explain underscores in tuple destruct
print(f"Maximum sequence length: {max(sequence_lengths)}")
print(f"Minimum sequence length: {min(sequence_lengths)}")
Maximum sequence length: 2391 Minimum sequence length: 129
Calculation of the GC-Content¶
Now calculate the GC-Content for each sequence by counting G
and C
respectively and divide the sum of both numbers by the sequence length. Store the result for each sequence in a list called gc_content
and print the highest and lowest value of that list.
Tips
- Each string has internal functions that for instance can be used to count specific characters. Make use of it!
def calculate_gc(sequence: str) -> float:
"""Calculates the GC content of a sequence
Args:
sequence (str): A nucleotide sequence
Returns:
float: The gc content of this sequence
"""
# Make sure everything is upper-case
sequence = sequence.upper()
return (sequence.count("G") + sequence.count("C")) / len(sequence)
gc_content = [calculate_gc(seq) for _, seq in sequences]
print(f"Maximum GC content: {max(gc_content)*100:.2f}%") # :.2f --> Please round the value to 2 digits
print(f"Minimum GC content: {min(gc_content)*100:.2f}%")
Maximum GC content: 76.39% Minimum GC content: 22.83%
Translation of a DNA sequence¶
Translate each sequence into its respective amino acid sequence. In order to do that, start by splitting the sequences into triplets. Now use these triplets and construct a new sequence by assigning the corrsponding amino acid. Store the result in a new list called proteins
and print a single sequence of your choice.
There is a helper function to_triplets
that will take care of splitting a gene into its respective triplets. In order to convert the triplets, use the dictionary CODON_TABLE
which maps from triplet to amino acid.
Tips
- Initially you have downloaded the
utils.py
file which contains both the function and codon table. Inspect the content of the file to read the docstring of theto_triplets
function and the structure ofCODON_TABLE
.
from utils import to_triplets, CODON_TABLE
def to_amino_acid(sequence: str) -> str:
"""Translates a nucleotide sequence into an amino acid sequence.
Args:
sequence (str): A nucelotide sequence.
Returns:
str: The translated amino acid sequence.
"""
BASES = {"A", "G", "C", "T"}
if set(sequence) != BASES:
raise ValueError(f"This sequence contains unknown bases! {set(sequence)}")
if len(sequence) % 3 != 0:
raise ValueError("The sequence length should be divisible by 3!")
amino_seq = []
triplets = to_triplets(sequence)
for triplet in triplets:
amino_acid = CODON_TABLE[triplet]
amino_seq.append(amino_acid)
return "".join(amino_seq)
proteins = [to_amino_acid(seq) for _, seq in sequences]
proteins[-1]
'MNEVVQKEVLKLWQTGVTYPISDSSLVSPIQVVPKKGGITVVSNEKNELIPTRIVTGWRMCIDYRKLNEATRKDHFPLPFIDQMLERLAEHEYYCFLDGYSGYNQIVVDSKDQEITSFTSFKYLLTKKEYKPKLIRWVLLLQEFNIEIKDKNGAENKVADHLSRIPHEEGGAHQFKVNERFSDEQLMMIQESLWFADIANFKAIREFPTNINKHMRRKLLNEAKHYIWNEPYLFKKGVDGILRRCISQEKGQKVLWQCHRFAYGGHFSGERIVAKVLQCGFYWPTIFKDAKELVSRCNECQRASNLSKKNEMPQQFILELELFDVWGD_'