Exercise 003¶

In [ ]:

Copied!

# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/scripts/utils.py
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/all_sequences.fasta
# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/scripts/utils.py
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/all_sequences.fasta

Determination of sequences lengths¶

Load the FASTA file all_sequences.fasta by using your own FASTA reader or biopython from the last exercise. Save all sequences in a list called sequences for the following exercises. Now determine the length of all sequences and store the result in a new list called sequence_lengths. Finally, print the lowest and highest value in this list.

In [2]:

Copied!





# Function definitions for this exercise to be re-used
# within the Jupyter notebook

def read_fasta(path: str) -> list[tuple[str, str]]:
    """Reads a FASTA file and returns a tupl data structure.

    Args:
        path (str): Path to the FASTA file.

    Returns:
        list[tuple[str, str]]: List of tuples containing header and sequence.
    """

    lines = open(path, "r").readlines()
    records = []

    for line in lines:
        if line.startswith(">"):
            header = line.lstrip(">")
        else:
            records.append(
                (header, line.strip())
            )

    return records
# Function definitions for this exercise to be re-used
# within the Jupyter notebook

def read_fasta(path: str) -> list[tuple[str, str]]:
    """Reads a FASTA file and returns a tupl data structure.

    Args:
        path (str): Path to the FASTA file.

    Returns:
        list[tuple[str, str]]: List of tuples containing header and sequence.
    """

    lines = open(path, "r").readlines()
    records = []

    for line in lines:
        if line.startswith(">"):
            header = line.lstrip(">")
        else:
            records.append(
                (header, line.strip())
            )

    return records

In [3]:

Copied!

sequences = read_fasta("./all_sequences.fasta")
sequence_lengths = [len(seq) for _, seq in sequences] # Variables that are named "_" are not used

# TODO: Explain underscores in tuple destruct

print(f"Maximum sequence length: {max(sequence_lengths)}")
print(f"Minimum sequence length: {min(sequence_lengths)}")
sequences = read_fasta("./all_sequences.fasta")
sequence_lengths = [len(seq) for _, seq in sequences] # Variables that are named "_" are not used

# TODO: Explain underscores in tuple destruct

print(f"Maximum sequence length: {max(sequence_lengths)}")
print(f"Minimum sequence length: {min(sequence_lengths)}")

Maximum sequence length: 2391
Minimum sequence length: 129

Calculation of the GC-Content¶

Now calculate the GC-Content for each sequence by counting G and C respectively and divide the sum of both numbers by the sequence length. Store the result for each sequence in a list called gc_content and print the highest and lowest value of that list.

Tips

Each string has internal functions that for instance can be used to count specific characters. Make use of it!

In [4]:

Copied!





def calculate_gc(sequence: str) -> float:
    """Calculates the GC content of a sequence

    Args:
        sequence (str): A nucleotide sequence

    Returns:
        float: The gc content of this sequence
    """

    # Make sure everything is upper-case
    sequence = sequence.upper()

    return (sequence.count("G") + sequence.count("C")) / len(sequence)
def calculate_gc(sequence: str) -> float:
    """Calculates the GC content of a sequence

    Args:
        sequence (str): A nucleotide sequence

    Returns:
        float: The gc content of this sequence
    """

    # Make sure everything is upper-case
    sequence = sequence.upper()

    return (sequence.count("G") + sequence.count("C")) / len(sequence)

In [11]:

Copied!

gc_content = [calculate_gc(seq) for _, seq in sequences]

print(f"Maximum GC content: {max(gc_content)*100:.2f}%") # :.2f --> Please round the value to 2 digits
print(f"Minimum GC content: {min(gc_content)*100:.2f}%")
gc_content = [calculate_gc(seq) for _, seq in sequences]

print(f"Maximum GC content: {max(gc_content)*100:.2f}%") # :.2f --> Please round the value to 2 digits
print(f"Minimum GC content: {min(gc_content)*100:.2f}%")

Maximum GC content: 76.39%
Minimum GC content: 22.83%

Translation of a DNA sequence¶

Translate each sequence into its respective amino acid sequence. In order to do that, start by splitting the sequences into triplets. Now use these triplets and construct a new sequence by assigning the corrsponding amino acid. Store the result in a new list called proteins and print a single sequence of your choice.

There is a helper function to_triplets that will take care of splitting a gene into its respective triplets. In order to convert the triplets, use the dictionary CODON_TABLE which maps from triplet to amino acid.

Tips

Initially you have downloaded the utils.py file which contains both the function and codon table. Inspect the content of the file to read the docstring of the to_triplets function and the structure of CODON_TABLE.

In [6]:

Copied!

from utils import to_triplets, CODON_TABLE
from utils import to_triplets, CODON_TABLE

In [9]:

Copied!





def to_amino_acid(sequence: str) -> str:
    """Translates a nucleotide sequence into an amino acid sequence.

    Args:
        sequence (str): A nucelotide sequence.

    Returns:
        str: The translated amino acid sequence.

    """

    BASES = {"A", "G", "C", "T"}

    if set(sequence) != BASES:
        raise ValueError(f"This sequence contains unknown bases! {set(sequence)}")

    if len(sequence) % 3 != 0:
        raise ValueError("The sequence length should be divisible by 3!")

    amino_seq = []
    triplets = to_triplets(sequence)

    for triplet in triplets:
        amino_acid = CODON_TABLE[triplet]
        amino_seq.append(amino_acid)

    return "".join(amino_seq)
def to_amino_acid(sequence: str) -> str:
    """Translates a nucleotide sequence into an amino acid sequence.

    Args:
        sequence (str): A nucelotide sequence.

    Returns:
        str: The translated amino acid sequence.

    """

    BASES = {"A", "G", "C", "T"}

    if set(sequence) != BASES:
        raise ValueError(f"This sequence contains unknown bases! {set(sequence)}")

    if len(sequence) % 3 != 0:
        raise ValueError("The sequence length should be divisible by 3!")

    amino_seq = []
    triplets = to_triplets(sequence)

    for triplet in triplets:
        amino_acid = CODON_TABLE[triplet]
        amino_seq.append(amino_acid)

    return "".join(amino_seq)

In [12]:

Copied!

proteins = [to_amino_acid(seq) for _, seq in sequences]
proteins[-1]
proteins = [to_amino_acid(seq) for _, seq in sequences]
proteins[-1]

Out[12]:

'MNEVVQKEVLKLWQTGVTYPISDSSLVSPIQVVPKKGGITVVSNEKNELIPTRIVTGWRMCIDYRKLNEATRKDHFPLPFIDQMLERLAEHEYYCFLDGYSGYNQIVVDSKDQEITSFTSFKYLLTKKEYKPKLIRWVLLLQEFNIEIKDKNGAENKVADHLSRIPHEEGGAHQFKVNERFSDEQLMMIQESLWFADIANFKAIREFPTNINKHMRRKLLNEAKHYIWNEPYLFKKGVDGILRRCISQEKGQKVLWQCHRFAYGGHFSGERIVAKVLQCGFYWPTIFKDAKELVSRCNECQRASNLSKKNEMPQQFILELELFDVWGD_'