Exercise 006¶

In [1]:

Copied!

# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/all_sequences.fasta
# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/all_sequences.fasta

--2024-06-25 14:03:01--  https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/all_sequences.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2940315 (2.8M) [text/plain]
Saving to: ‘all_sequences.fasta’

all_sequences.fasta 100%[===================>]   2.80M  --.-KB/s    in 0.07s   

2024-06-25 14:03:01 (37.4 MB/s) - ‘all_sequences.fasta’ saved [2940315/2940315]

DNASequence class¶

Read the FASTA file all_sequences.fasta and store header info and sequence in a suitable class. Make sure that at the initialization of the object, the following atrributes are present:

id
organism
sequence
gc_content
length

Tips

Your __init__-method arguments do not have to contain all expected attributes if you can derive them from another attribute. The __init__-method is a function and you can execute any code you want upon initialization. Make sure to assign your calculation to the appropriate attribute via self.xyz.

Dataclasses are a convinient way to create classes that simply hold data. You can make use of them to simplify the process due to the automatic generation of a __init__-method. But keep in mind that this excludes additional calculation you would have otherwise put into your custom __init__-method.

In [2]:

Copied!





class DNASequence:

    def __init__(
        self, # Refers to the object to create
        sequence: str,
        organism: str,
        id: str
    ):
        # "External" attributes
        self.sequence = sequence.upper()
        self.organism = organism
        self.id = id

        # Calculated attributes
        self.length = len(self.sequence)
        self.gc_content = (self.sequence.count("G") + self.sequence.count("C")) / self.length
class DNASequence:

    def __init__(
        self, # Refers to the object to create
        sequence: str,
        organism: str,
        id: str
    ):
        # "External" attributes
        self.sequence = sequence.upper()
        self.organism = organism
        self.id = id

        # Calculated attributes
        self.length = len(self.sequence)
        self.gc_content = (self.sequence.count("G") + self.sequence.count("C")) / self.length

In [9]:

Copied!





# Using dataclasses
from dataclasses import dataclass

@dataclass
class DNASequenceDC:
    sequence: str
    organism: str
    id: str
    length: int
    gc_content: float

instance = DNASequenceDC(
    sequence="ATG",
    organism="ecoli",
    id="someID",
    length=3,
    gc_content=0.5,
)

print(instance)
# Using dataclasses
from dataclasses import dataclass

@dataclass
class DNASequenceDC:
    sequence: str
    organism: str
    id: str
    length: int
    gc_content: float

instance = DNASequenceDC(
    sequence="ATG",
    organism="ecoli",
    id="someID",
    length=3,
    gc_content=0.5,
)

print(instance)

DNASequenceDC(sequence='ATG', organism='ecoli', id='someID', length=3, gc_content=0.5)

In [3]:

Copied!





def read_fasta(path: str) -> list[DNASequence]:
    """Reads a FASTA file and parses all entries into DNASequence objects

    Args:
        path: Path to the FASTA file to parse

    Returns:
        list[DNASequence]: Parsed sequences wrapped in DNASequence objects
    """

    sequences = []
    data = open(path).readlines()

    for i, line in enumerate(data):
        if not i % 2:
            # Grab the header and continue
            organism, id = line.lstrip(">").split("|")
            continue

        obj = DNASequence(
            sequence=line.strip(),
            organism=organism.strip(),
            id=id.strip(),
        )

        sequences.append(obj)

    return sequences
def read_fasta(path: str) -> list[DNASequence]:
    """Reads a FASTA file and parses all entries into DNASequence objects

    Args:
        path: Path to the FASTA file to parse

    Returns:
        list[DNASequence]: Parsed sequences wrapped in DNASequence objects
    """

    sequences = []
    data = open(path).readlines()

    for i, line in enumerate(data):
        if not i % 2:
            # Grab the header and continue
            organism, id = line.lstrip(">").split("|")
            continue

        obj = DNASequence(
            sequence=line.strip(),
            organism=organism.strip(),
            id=id.strip(),
        )

        sequences.append(obj)

    return sequences

Magic Methods - Alignment by `==`¶

This is an optional exercise

Can you extend the class to output the identity between the two sequences (stored as an attribute) when the == comparison operator is used? Apply the implementation to two sequences that you have chosen and use the supplied get_identity function.

Learn more about Magic methods

In [ ]:

Copied!





# Execute this cell to use all packages
!pip install biopython

from Bio import pairwise2

def get_identity(seq1: str, seq2: str):
    """Aligns two sequences using BioPython

    Args:
        seq1 (str): Query sequence to align to
        seq2 (str): Target sequence to align with

    Returns:
        float: Identity of the resulting alignment

    """
    return pairwise2.align.globalxx(seq1, seq2, score_only=True) / len(seq1)
# Execute this cell to use all packages
!pip install biopython

from Bio import pairwise2

def get_identity(seq1: str, seq2: str):
    """Aligns two sequences using BioPython

    Args:
        seq1 (str): Query sequence to align to
        seq2 (str): Target sequence to align with

    Returns:
        float: Identity of the resulting alignment

    """
    return pairwise2.align.globalxx(seq1, seq2, score_only=True) / len(seq1)

In [ ]:

Copied!





class DNASequence:

    def __init__(
        self, # Refers to the object to create
        sequence: str,
        organism: str,
        id: str
    ):
        # "External" attributes
        self.sequence = sequence.upper()
        self.organism = organism
        self.id = id

        # Calculated attributes
        self.length = len(sequence)
        self.gc_content = ((self.sequence.count("G") + self.sequence.count("C")) / self.length)

    def __eq__(self, other):
        """Overrides the == operator and runs this method instead.

        We first check that the other object we want to align is of
        the same type. If so, we will use the 'get_identity' function
        to receive the percent identity of both sequences.
        """

        assert type(other) == type(self), (
            f"Only types of 'DNASequence' can be used for comparison. Got {type(other)}, which is invalid."
        )

        return get_identity(self.sequence, other.sequence)
class DNASequence:

    def __init__(
        self, # Refers to the object to create
        sequence: str,
        organism: str,
        id: str
    ):
        # "External" attributes
        self.sequence = sequence.upper()
        self.organism = organism
        self.id = id

        # Calculated attributes
        self.length = len(sequence)
        self.gc_content = ((self.sequence.count("G") + self.sequence.count("C")) / self.length)

    def __eq__(self, other):
        """Overrides the == operator and runs this method instead.

        We first check that the other object we want to align is of
        the same type. If so, we will use the 'get_identity' function
        to receive the percent identity of both sequences.
        """

        assert type(other) == type(self), (
            f"Only types of 'DNASequence' can be used for comparison. Got {type(other)}, which is invalid."
        )

        return get_identity(self.sequence, other.sequence)

In [ ]:

Copied!

dna_sequences = read_fasta("./all_sequences.fasta")

# Lets align two sequences
dna_sequences[10] == dna_sequences[100]
dna_sequences = read_fasta("./all_sequences.fasta")

# Lets align two sequences
dna_sequences[10] == dna_sequences[100]

Out[ ]:

0.780952380952381

Exercise 006¶

DNASequence class¶

Magic Methods - Alignment by ==¶

Magic Methods - Alignment by `==`¶