Exercise 002¶

In [ ]:

Copied!





# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/archaea_sequences.fasta
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/ecoli_sequences.fasta
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/plant_sequences.fasta
# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/archaea_sequences.fasta
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/ecoli_sequences.fasta
!wget https://raw.githubusercontent.com/JR-1991/PythonProgrammingBio24/main/data/plant_sequences.fasta

Reading a FASTA file¶

Write a parser thats reads and processes a FASTA file called ecoli_sequences.fasta into a list named ecoli. Repeat the process for the files archaea_sequences.fasta and plant_sequences.fasta. Make sure to name the resulting variables appropriately to ensure clean code principles.

Tips

FASTA files follow a stringent structure: Header first, followed by the actual sequence. Inspect the file using print to find a pattern.

Try using functions to solve the problem. While not mandatory, it can save code lines and generalize a concept such as a FASTA parser. Click here for more on Python functions.

When typing a file name, hit tab on your keyboard to auto-complete. Makes life easier ☺️

FASTA Schema

0 | >Header1
1 | Sequence1
2 | >Header2
3 | Sequence2

In [2]:

Copied!





def read_fasta(path: str) -> list[tuple[str, str]]:
    """Reads a FASTA file to a list of tuples.
    
    This is a docstring, which is not functional,
    but a general comment and documentation of your
    function. Docstrings are very important
    
    Args:
        path (str): The path to the FASTA file to parse
    
    Returns:
        list[tuple[str, str]]: The parsed sequence records.
    
    Example:
        >>> sequences = read_fasta("path/to/your.fasta")
        >>> print(sequences)
        
        >>> [
        >>>  ("Header1", "AGTGGTG.."),
        >>>  ("Header2", "AGTGGTG.."),
        >>> ]
    
    """
    
    lines = open(path, "r").readlines()
    records = []
    
    for line in lines:
        if line.startswith(">"):
            header = line[1::].strip()
        else:
            sequence = line.strip()
            records.append(
                (header, sequence)
            )
            
    return records
def read_fasta(path: str) -> list[tuple[str, str]]:
    """Reads a FASTA file to a list of tuples.
    
    This is a docstring, which is not functional,
    but a general comment and documentation of your
    function. Docstrings are very important
    
    Args:
        path (str): The path to the FASTA file to parse
    
    Returns:
        list[tuple[str, str]]: The parsed sequence records.
    
    Example:
        >>> sequences = read_fasta("path/to/your.fasta")
        >>> print(sequences)
        
        >>> [
        >>>  ("Header1", "AGTGGTG.."),
        >>>  ("Header2", "AGTGGTG.."),
        >>> ]
    
    """
    
    lines = open(path, "r").readlines()
    records = []
    
    for line in lines:
        if line.startswith(">"):
            header = line[1::].strip()
        else:
            sequence = line.strip()
            records.append(
                (header, sequence)
            )
            
    return records
        

In [5]:

Copied!

ecoli = read_fasta("./ecoli_sequences.fasta")
archaea = read_fasta("./archaea_sequences.fasta")
plant = read_fasta("./plant_sequences.fasta")

print(ecoli[0])
ecoli = read_fasta("./ecoli_sequences.fasta")
archaea = read_fasta("./archaea_sequences.fasta")
plant = read_fasta("./plant_sequences.fasta")

print(ecoli[0])

('ecoli|1', 'ATGCGTTCTCGCTATTTGTTACATCAATATTTTGTTCAGGTACAGTTTGCAGCGCCGTCGCCAGCGCCAACGGATTCCATGTCATATATTATTCCATATAGATTAAGTTTAAATATTAATAAAATGAATATTTGCAATACGTAATTATCTTACCAGCTATAGACAAAAAAAAACCATCCAAATCTGGATGGCTTTTCATAATTCAGAGGAACTAGCTGCGCTGACGAACCGCTTCAAATAAGCAAATTCCGGTTGCAACCGAAACGTTCAGGGAAGAAACACTTCCTGCCATTGGGATGCTGATCAACTCATCGCAATGTTCACGGGTCAGGCGACGCATACCTTCACCTTCCGCGCCCATCACCAGCGCCAGGCGTCCGGTCATTTTGCTTTGATAGAGCGTATGATCCGCCTCACCTGCCGTACCGACGATCCAGATATTCTCTTCCTGCAACATACGCATGGTGCGCGCAAGGTTAGTCACCCGAATCAGTGGAACGCTTTCTGCCGCGCCGCAGGCTACTTTTTTCGCCGTGGCGTTGAGCTGTGCGGAGCGATCTTTCGGCACAATCACCGCGTGAACGCCAGCAGCGTCCGCGCTACGCAGGCACGCGCCGAGGTTGTGCGGATCGGTTACACCGTCGAGGATCAGCAGGAACGGTTGATCGAGCGAAGCGATCAGATCCGGCAGATCGTTTTCCTGGTACTGACGTCCTGGCTTCACGCGGGCGATAATGCCCTGATGCACGGCACCGTCGCTTTTCTCGTCGAGATATTGGCGGTTTGCCAACTGGATAACCACGCCCTGGGACTCAAGGGCGTGGATCAGCGGTAACAGACGTTTATCTTCACGGCCTTTTAAAATAA')

Writing of a FASTA file¶

Write all three lists that you've just created using your parser to a new FASTA file called all_sequences.fasta. When merging your data into a single file, make sure to assign a header for each sequence that includes information about the organism and an appropriate identifier. Be careful for duplicate headers, each one should be unique.

Finally, inspect your file manually. Did everything went as expected?

Tips

Try using functions to solve the problem. While not mandatory, it can save code lines and generalize a concept such as a FASTA parser. Click here for more on Python functions.

Close the file when everything has been done or make use of a context manager.

Use string formatting to insert variables to your header! Click here for more information.

String formatting

variable1, variable2 = "Hello", "World"
my_string = f"{variable1}_{variable2}"

print(my_string)  

# prints "Hello_World"

In [6]:

Copied!





def write_fasta(
    filepath: str,
    records: list[tuple[str, str]]
) -> None:
    """Writes a list of sequence records to a FASTA file
    
    Args:
        filepath (str): The file path to which the sequences will be written
        records (list[tuple[str, str]]): The records to write to the file
    
    Example:
        >>> write_fasta(
        >>>     filepath="./all_seqs.fasta",
        >>>     records=records
        >>> )
    
    """
    
    with open(filepath, "w") as file:
        for header, sequence in records:
            file.write(
                f">{header}\n{sequence}\n"
            )
def write_fasta(
    filepath: str,
    records: list[tuple[str, str]]
) -> None:
    """Writes a list of sequence records to a FASTA file
    
    Args:
        filepath (str): The file path to which the sequences will be written
        records (list[tuple[str, str]]): The records to write to the file
    
    Example:
        >>> write_fasta(
        >>>     filepath="./all_seqs.fasta",
        >>>     records=records
        >>> )
    
    """
    
    with open(filepath, "w") as file:
        for header, sequence in records:
            file.write(
                f">{header}\n{sequence}\n"
            )

In [7]:

Copied!





# Combine all sequence into a single list
all_sequences = ecoli + archaea + plant

# Write the new collection to a file
write_fasta(
    filepath="./all_sequences.fasta",
    records=all_sequences,
)
# Combine all sequence into a single list
all_sequences = ecoli + archaea + plant

# Write the new collection to a file
write_fasta(
    filepath="./all_sequences.fasta",
    records=all_sequences,
)

Programmers are lazy!¶

Accoding to Larry Wall, creator of the programming language Perl, programmers are lazy. Python is one of the best supported programming languages in terms of packages and thats one of the major appeals of using Python. Thus, to know how and where to look for third-party software is essential to an efficient development process.

Remember, you dont have to re-invent the wheel to be good at programming.

Can you find a package that could have helped you in this exercise?

Answer

There are a couple of packages available at PyPI that enable you to read FASTA files:

BioPython - Reading FASTA files and much more bio-related functions
fasta-reader - Bare bones FASTA parser

How can I install packages?¶

You can use Python's standard package installer pip to install whatever package you want. Typically, these are found at PyPI and thus are available for pip. This is how you can install a package:

pip install package_name

Within a Jupyter notebook you can also install a package. Lets do that in the next cell!

In [8]:

Copied!

# The exclamation mark tells Jupyter to run
# a command within the systems shell/cmd
!pip install biopython
# The exclamation mark tells Jupyter to run
# a command within the systems shell/cmd
!pip install biopython

Collecting biopython
  Obtaining dependency information for biopython from https://files.pythonhosted.org/packages/7e/40/0543d137e4a45b0e5d3a7c5ce9b607e687d9e5b715eadfdae34de8d6906a/biopython-1.83-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading biopython-1.83-cp311-cp311-macosx_11_0_arm64.whl.metadata (13 kB)
Requirement already satisfied: numpy in /Users/janrange/anaconda3/lib/python3.11/site-packages (from biopython) (1.26.4)
Downloading biopython-1.83-cp311-cp311-macosx_11_0_arm64.whl (2.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.7/2.7 MB 10.8 MB/s eta 0:00:0000:0100:01
Installing collected packages: biopython
Successfully installed biopython-1.83

How can I use packages?¶

In order to use an installed package, you need to import it into Python. This is done by using the import statement in Python. You can also import specific parts of a package by using the from xy import a syntax. Let's put that into practice by using the newly installed biopython package to parse our newly created FASTA file.

In [10]:

Copied!

from Bio import SeqIO # We only want SeqIO

# We are going to use the `parse` method of the SeqIO module.

for record in SeqIO.parse("all_sequences.fasta", "fasta"):
    
    print(f"Record Header: {record.id}")
    print(f"Record Sequence: {record.seq}")
    
    # We will break the loop to display only the first
    break
from Bio import SeqIO # We only want SeqIO

# We are going to use the `parse` method of the SeqIO module.

for record in SeqIO.parse("all_sequences.fasta", "fasta"):
    
    print(f"Record Header: {record.id}")
    print(f"Record Sequence: {record.seq}")
    
    # We will break the loop to display only the first
    break

Record Header: ecoli|1
Record Sequence: ATGCGTTCTCGCTATTTGTTACATCAATATTTTGTTCAGGTACAGTTTGCAGCGCCGTCGCCAGCGCCAACGGATTCCATGTCATATATTATTCCATATAGATTAAGTTTAAATATTAATAAAATGAATATTTGCAATACGTAATTATCTTACCAGCTATAGACAAAAAAAAACCATCCAAATCTGGATGGCTTTTCATAATTCAGAGGAACTAGCTGCGCTGACGAACCGCTTCAAATAAGCAAATTCCGGTTGCAACCGAAACGTTCAGGGAAGAAACACTTCCTGCCATTGGGATGCTGATCAACTCATCGCAATGTTCACGGGTCAGGCGACGCATACCTTCACCTTCCGCGCCCATCACCAGCGCCAGGCGTCCGGTCATTTTGCTTTGATAGAGCGTATGATCCGCCTCACCTGCCGTACCGACGATCCAGATATTCTCTTCCTGCAACATACGCATGGTGCGCGCAAGGTTAGTCACCCGAATCAGTGGAACGCTTTCTGCCGCGCCGCAGGCTACTTTTTTCGCCGTGGCGTTGAGCTGTGCGGAGCGATCTTTCGGCACAATCACCGCGTGAACGCCAGCAGCGTCCGCGCTACGCAGGCACGCGCCGAGGTTGTGCGGATCGGTTACACCGTCGAGGATCAGCAGGAACGGTTGATCGAGCGAAGCGATCAGATCCGGCAGATCGTTTTCCTGGTACTGACGTCCTGGCTTCACGCGGGCGATAATGCCCTGATGCACGGCACCGTCGCTTTTCTCGTCGAGATATTGGCGGTTTGCCAACTGGATAACCACGCCCTGGGACTCAAGGGCGTGGATCAGCGGTAACAGACGTTTATCTTCACGGCCTTTTAAAATAA