Exercise 006¶
In [1]:
Copied!
# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/all_sequences.fasta
# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/all_sequences.fasta
--2024-06-25 14:03:01-- https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/all_sequences.fasta Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 2940315 (2.8M) [text/plain] Saving to: ‘all_sequences.fasta’ all_sequences.fasta 100%[===================>] 2.80M --.-KB/s in 0.07s 2024-06-25 14:03:01 (37.4 MB/s) - ‘all_sequences.fasta’ saved [2940315/2940315]
DNASequence class¶
Read the FASTA file all_sequences.fasta
and store header info and sequence in a suitable class. Make sure that at the initialization of the object, the following atrributes are present:
id
organism
sequence
gc_content
length
Tips
- Your
__init__
-method arguments do not have to contain all expected attributes if you can derive them from another attribute. The__init__
-method is a function and you can execute any code you want upon initialization. Make sure to assign your calculation to the appropriate attribute viaself.xyz
.- Dataclasses are a convinient way to create classes that simply hold data. You can make use of them to simplify the process due to the automatic generation of a
__init__
-method. But keep in mind that this excludes additional calculation you would have otherwise put into your custom__init__
-method.
In [2]:
Copied!
class DNASequence:
def __init__(
self, # Refers to the object to create
sequence: str,
organism: str,
id: str
):
# "External" attributes
self.sequence = sequence.upper()
self.organism = organism
self.id = id
# Calculated attributes
self.length = len(self.sequence)
self.gc_content = (self.sequence.count("G") + self.sequence.count("C")) / self.length
class DNASequence:
def __init__(
self, # Refers to the object to create
sequence: str,
organism: str,
id: str
):
# "External" attributes
self.sequence = sequence.upper()
self.organism = organism
self.id = id
# Calculated attributes
self.length = len(self.sequence)
self.gc_content = (self.sequence.count("G") + self.sequence.count("C")) / self.length
In [9]:
Copied!
# Using dataclasses
from dataclasses import dataclass
@dataclass
class DNASequenceDC:
sequence: str
organism: str
id: str
length: int
gc_content: float
instance = DNASequenceDC(
sequence="ATG",
organism="ecoli",
id="someID",
length=3,
gc_content=0.5,
)
print(instance)
# Using dataclasses
from dataclasses import dataclass
@dataclass
class DNASequenceDC:
sequence: str
organism: str
id: str
length: int
gc_content: float
instance = DNASequenceDC(
sequence="ATG",
organism="ecoli",
id="someID",
length=3,
gc_content=0.5,
)
print(instance)
DNASequenceDC(sequence='ATG', organism='ecoli', id='someID', length=3, gc_content=0.5)
In [3]:
Copied!
def read_fasta(path: str) -> list[DNASequence]:
"""Reads a FASTA file and parses all entries into DNASequence objects
Args:
path: Path to the FASTA file to parse
Returns:
list[DNASequence]: Parsed sequences wrapped in DNASequence objects
"""
sequences = []
data = open(path).readlines()
for i, line in enumerate(data):
if not i % 2:
# Grab the header and continue
organism, id = line.lstrip(">").split("|")
continue
obj = DNASequence(
sequence=line.strip(),
organism=organism.strip(),
id=id.strip(),
)
sequences.append(obj)
return sequences
def read_fasta(path: str) -> list[DNASequence]:
"""Reads a FASTA file and parses all entries into DNASequence objects
Args:
path: Path to the FASTA file to parse
Returns:
list[DNASequence]: Parsed sequences wrapped in DNASequence objects
"""
sequences = []
data = open(path).readlines()
for i, line in enumerate(data):
if not i % 2:
# Grab the header and continue
organism, id = line.lstrip(">").split("|")
continue
obj = DNASequence(
sequence=line.strip(),
organism=organism.strip(),
id=id.strip(),
)
sequences.append(obj)
return sequences
Magic Methods - Alignment by ==
¶
This is an optional exercise
Can you extend the class to output the identity between the two sequences (stored as an attribute) when the ==
comparison operator is used? Apply the implementation to two sequences that you have chosen and use the supplied get_identity
function.
Learn more about Magic methods
In [ ]:
Copied!
# Execute this cell to use all packages
!pip install biopython
from Bio import pairwise2
def get_identity(seq1: str, seq2: str):
"""Aligns two sequences using BioPython
Args:
seq1 (str): Query sequence to align to
seq2 (str): Target sequence to align with
Returns:
float: Identity of the resulting alignment
"""
return pairwise2.align.globalxx(seq1, seq2, score_only=True) / len(seq1)
# Execute this cell to use all packages
!pip install biopython
from Bio import pairwise2
def get_identity(seq1: str, seq2: str):
"""Aligns two sequences using BioPython
Args:
seq1 (str): Query sequence to align to
seq2 (str): Target sequence to align with
Returns:
float: Identity of the resulting alignment
"""
return pairwise2.align.globalxx(seq1, seq2, score_only=True) / len(seq1)
In [ ]:
Copied!
class DNASequence:
def __init__(
self, # Refers to the object to create
sequence: str,
organism: str,
id: str
):
# "External" attributes
self.sequence = sequence.upper()
self.organism = organism
self.id = id
# Calculated attributes
self.length = len(sequence)
self.gc_content = ((self.sequence.count("G") + self.sequence.count("C")) / self.length)
def __eq__(self, other):
"""Overrides the == operator and runs this method instead.
We first check that the other object we want to align is of
the same type. If so, we will use the 'get_identity' function
to receive the percent identity of both sequences.
"""
assert type(other) == type(self), (
f"Only types of 'DNASequence' can be used for comparison. Got {type(other)}, which is invalid."
)
return get_identity(self.sequence, other.sequence)
class DNASequence:
def __init__(
self, # Refers to the object to create
sequence: str,
organism: str,
id: str
):
# "External" attributes
self.sequence = sequence.upper()
self.organism = organism
self.id = id
# Calculated attributes
self.length = len(sequence)
self.gc_content = ((self.sequence.count("G") + self.sequence.count("C")) / self.length)
def __eq__(self, other):
"""Overrides the == operator and runs this method instead.
We first check that the other object we want to align is of
the same type. If so, we will use the 'get_identity' function
to receive the percent identity of both sequences.
"""
assert type(other) == type(self), (
f"Only types of 'DNASequence' can be used for comparison. Got {type(other)}, which is invalid."
)
return get_identity(self.sequence, other.sequence)
In [ ]:
Copied!
dna_sequences = read_fasta("./all_sequences.fasta")
# Lets align two sequences
dna_sequences[10] == dna_sequences[100]
dna_sequences = read_fasta("./all_sequences.fasta")
# Lets align two sequences
dna_sequences[10] == dna_sequences[100]
Out[ ]:
0.780952380952381