Exercise 002¶
# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/archaea_sequences.fasta
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/ecoli_sequences.fasta
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/plant_sequences.fasta
Reading a FASTA file¶
Write a parser thats reads and processes a FASTA file called ecoli_sequences.fasta
into a list named ecoli
. Repeat the process for the files archaea_sequences.fasta
and plant_sequences.fasta
. Make sure to name the resulting variables appropriately to ensure clean code principles.
Tips
- FASTA files follow a stringent structure: Header first, followed by the actual sequence. Inspect the file using print to find a pattern.
- Try using functions to solve the problem. While not mandatory, it can save code lines and generalize a concept such as a FASTA parser. Click here for more on Python functions.
- When typing a file name, hit tab on your keyboard to auto-complete. Makes life easier ☺️
FASTA Schema
0 | >Header1
1 | Sequence1
2 | >Header2
3 | Sequence2
def read_fasta(path: str) -> list[tuple[str, str]]:
"""Reads a FASTA file to a list of tuples.
This is a docstring, which is not functional,
but a general comment and documentation of your
function. Docstrings are very important
Args:
path (str): The path to the FASTA file to parse
Returns:
list[tuple[str, str]]: The parsed sequence records.
Example:
>>> sequences = read_fasta("path/to/your.fasta")
>>> print(sequences)
>>> [
>>> ("Header1", "AGTGGTG.."),
>>> ("Header2", "AGTGGTG.."),
>>> ]
"""
lines = open(path, "r").readlines()
records = []
for line in lines:
if line.startswith(">"):
header = line[1::].strip()
else:
sequence = line.strip()
records.append(
(header, sequence)
)
return records
ecoli = read_fasta("./ecoli_sequences.fasta")
archaea = read_fasta("./archaea_sequences.fasta")
plant = read_fasta("./plant_sequences.fasta")
print(ecoli[0])
('ecoli|1', 'ATGCGTTCTCGCTATTTGTTACATCAATATTTTGTTCAGGTACAGTTTGCAGCGCCGTCGCCAGCGCCAACGGATTCCATGTCATATATTATTCCATATAGATTAAGTTTAAATATTAATAAAATGAATATTTGCAATACGTAATTATCTTACCAGCTATAGACAAAAAAAAACCATCCAAATCTGGATGGCTTTTCATAATTCAGAGGAACTAGCTGCGCTGACGAACCGCTTCAAATAAGCAAATTCCGGTTGCAACCGAAACGTTCAGGGAAGAAACACTTCCTGCCATTGGGATGCTGATCAACTCATCGCAATGTTCACGGGTCAGGCGACGCATACCTTCACCTTCCGCGCCCATCACCAGCGCCAGGCGTCCGGTCATTTTGCTTTGATAGAGCGTATGATCCGCCTCACCTGCCGTACCGACGATCCAGATATTCTCTTCCTGCAACATACGCATGGTGCGCGCAAGGTTAGTCACCCGAATCAGTGGAACGCTTTCTGCCGCGCCGCAGGCTACTTTTTTCGCCGTGGCGTTGAGCTGTGCGGAGCGATCTTTCGGCACAATCACCGCGTGAACGCCAGCAGCGTCCGCGCTACGCAGGCACGCGCCGAGGTTGTGCGGATCGGTTACACCGTCGAGGATCAGCAGGAACGGTTGATCGAGCGAAGCGATCAGATCCGGCAGATCGTTTTCCTGGTACTGACGTCCTGGCTTCACGCGGGCGATAATGCCCTGATGCACGGCACCGTCGCTTTTCTCGTCGAGATATTGGCGGTTTGCCAACTGGATAACCACGCCCTGGGACTCAAGGGCGTGGATCAGCGGTAACAGACGTTTATCTTCACGGCCTTTTAAAATAA')
Writing of a FASTA file¶
Write all three lists that you've just created using your parser to a new FASTA file called all_sequences.fasta
. When merging your data into a single file, make sure to assign a header for each sequence that includes information about the organism and an appropriate identifier. Be careful for duplicate headers, each one should be unique.
Finally, inspect your file manually. Did everything went as expected?
Tips
- Try using functions to solve the problem. While not mandatory, it can save code lines and generalize a concept such as a FASTA parser. Click here for more on Python functions.
- Close the file when everything has been done or make use of a context manager.
- Use string formatting to insert variables to your header! Click here for more information.
String formatting
variable1, variable2 = "Hello", "World"
my_string = f"{variable1}_{variable2}"
print(my_string)
# prints "Hello_World"
def write_fasta(
filepath: str,
records: list[tuple[str, str]]
) -> None:
"""Writes a list of sequence records to a FASTA file
Args:
filepath (str): The file path to which the sequences will be written
records (list[tuple[str, str]]): The records to write to the file
Example:
>>> write_fasta(
>>> filepath="./all_seqs.fasta",
>>> records=records
>>> )
"""
with open(filepath, "w") as file:
for header, sequence in records:
file.write(
f">{header}\n{sequence}\n"
)
# Combine all sequence into a single list
all_sequences = ecoli + archaea + plant
# Write the new collection to a file
write_fasta(
filepath="./all_sequences.fasta",
records=all_sequences,
)
Programmers are lazy!¶
Accoding to Larry Wall, creator of the programming language Perl, programmers are lazy. Python is one of the best supported programming languages in terms of packages and thats one of the major appeals of using Python. Thus, to know how and where to look for third-party software is essential to an efficient development process.
Remember, you dont have to re-invent the wheel to be good at programming.
Can you find a package that could have helped you in this exercise?
Answer
There are a couple of packages available at PyPI that enable you to read FASTA files:
- BioPython - Reading FASTA files and much more bio-related functions
- fasta-reader - Bare bones FASTA parser
How can I install packages?¶
You can use Python's standard package installer pip
to install whatever package you want. Typically, these are found at PyPI and thus are available for pip
. This is how you can install a package:
pip install package_name
Within a Jupyter notebook you can also install a package. Lets do that in the next cell!
# The exclamation mark tells Jupyter to run
# a command within the systems shell/cmd
!pip install biopython
Collecting biopython Obtaining dependency information for biopython from https://files.pythonhosted.org/packages/7e/40/0543d137e4a45b0e5d3a7c5ce9b607e687d9e5b715eadfdae34de8d6906a/biopython-1.83-cp311-cp311-macosx_11_0_arm64.whl.metadata Downloading biopython-1.83-cp311-cp311-macosx_11_0_arm64.whl.metadata (13 kB) Requirement already satisfied: numpy in /Users/janrange/anaconda3/lib/python3.11/site-packages (from biopython) (1.26.4) Downloading biopython-1.83-cp311-cp311-macosx_11_0_arm64.whl (2.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.7/2.7 MB 10.8 MB/s eta 0:00:0000:0100:01 Installing collected packages: biopython Successfully installed biopython-1.83
How can I use packages?¶
In order to use an installed package, you need to import it into Python. This is done by using the import
statement in Python. You can also import specific parts of a package by using the from xy import a
syntax. Let's put that into practice by using the newly installed biopython
package to parse our newly created FASTA file.
from Bio import SeqIO # We only want SeqIO
# We are going to use the `parse` method of the SeqIO module.
for record in SeqIO.parse("all_sequences.fasta", "fasta"):
print(f"Record Header: {record.id}")
print(f"Record Sequence: {record.seq}")
# We will break the loop to display only the first
break
Record Header: ecoli|1 Record Sequence: ATGCGTTCTCGCTATTTGTTACATCAATATTTTGTTCAGGTACAGTTTGCAGCGCCGTCGCCAGCGCCAACGGATTCCATGTCATATATTATTCCATATAGATTAAGTTTAAATATTAATAAAATGAATATTTGCAATACGTAATTATCTTACCAGCTATAGACAAAAAAAAACCATCCAAATCTGGATGGCTTTTCATAATTCAGAGGAACTAGCTGCGCTGACGAACCGCTTCAAATAAGCAAATTCCGGTTGCAACCGAAACGTTCAGGGAAGAAACACTTCCTGCCATTGGGATGCTGATCAACTCATCGCAATGTTCACGGGTCAGGCGACGCATACCTTCACCTTCCGCGCCCATCACCAGCGCCAGGCGTCCGGTCATTTTGCTTTGATAGAGCGTATGATCCGCCTCACCTGCCGTACCGACGATCCAGATATTCTCTTCCTGCAACATACGCATGGTGCGCGCAAGGTTAGTCACCCGAATCAGTGGAACGCTTTCTGCCGCGCCGCAGGCTACTTTTTTCGCCGTGGCGTTGAGCTGTGCGGAGCGATCTTTCGGCACAATCACCGCGTGAACGCCAGCAGCGTCCGCGCTACGCAGGCACGCGCCGAGGTTGTGCGGATCGGTTACACCGTCGAGGATCAGCAGGAACGGTTGATCGAGCGAAGCGATCAGATCCGGCAGATCGTTTTCCTGGTACTGACGTCCTGGCTTCACGCGGGCGATAATGCCCTGATGCACGGCACCGTCGCTTTTCTCGTCGAGATATTGGCGGTTTGCCAACTGGATAACCACGCCCTGGGACTCAAGGGCGTGGATCAGCGGTAACAGACGTTTATCTTCACGGCCTTTTAAAATAA