Writing sequences and sequence alignments

Writing a sequence alignment to disk

Writing out an alignment can be achieved easily with the write_seqs app. First, let’s load the alignment we want to write.

When creating the write_seqs app, we need to provide a data store to which we want the data to be written, and optionally, we can specify the format we want the sequences to be written in.

Writing many sequence collections to a data store

Typically, the final step of a data processing pipeline is writing out the filtered data. When write_seqs is composed into a process, the process will write out multiple sequence collections to a data store.

We can create our input data store containing all the files with the “.fasta” suffix in the data directory using open_data_store.

Let’s define a process. In this example, our process loads the sequences, filters the sequences to keep only those which are translatable, translates the sequences, and then writes the filtered sequences to a data store.

Tip

When running this code on your machine, remember to replace path_to_dir with an actual directory path.

We apply process to our input data store, and assign the resulting data store to result.

Accessing an overview of our process

We can interrogate result to see an overview of the process.

There were 10 data files to which the process was successfully applied. However, there were three files for which the process did not complete. We can see a summary of the failures by accessing the summary_not_completed property.

Looks like the first two failed because they are protein sequences and load_unaligned expected DNA sequences.

Interestingly, another file failed in the keep_translatable step. By design, these failures did not stop the rest of the pipeline from being run. In fact, the data store collects the NotCompleted objects, which store traceback information, allowing you to interrogate any failings.