Processing DNA Strings With Unison
Use the Unison functional programming language to process DNA text strings
Whenever I learn a new programming language, I like to do simple exercises which doesn't require knowing a large API but which expose you to the basics of the language. Project Rosalind is a fun site to get familiar with working with arrays, strings, characters, and common higher-order functions such as map
, fold
and filter
.
The Rosalind problems are more text oriented than the Project Eulerproblems, which are more math oriented. The Euler projects take a bit more time to figure out and solve for me. But another site which I recently discovered is Exercism which actually has good programming problems. I will discuss one solution from that site as well, involving string manipulation.
Project Rosalind Problems
To understand Rosalind problems, it is useful to know a few things about how DNA and RNA are represented in computing. DNA is represented as text strings consisting only of the characters A, C, G, T. So ATGCTTCAGAAAGGTCTTACG
would be an example of a DNA text string.
Counting DNA Nucleotides
The problem described here, involves counting the number of A, C, G and T characters in a DNA string. The sample
and output
variables are just to compare your output and see if you get the expected result.
There are a couple of things to notice in the code below which I have not covered in other Unison articles. Individual characters are written as ?A
and ?C
rather than 'A'
and 'C'
which is what you may be used to from C-like languages.
We tend to use natural numbers (unsigned integers) a lot more in Unison than in other languages. That is why you see the Nat
type used for numbers rather than Int
most of the time.
In the code example, the dna
variable is of type Text
, which is a problem when calling countElement
because there is only a List.countElement
function and no Text.countElement
which can work directly with characters contained within a Text
object. For this reason, we need to use Text.toCharList
to turn the DNA string into a list of characters.
sample = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
output = [20, 12, 17, 21]
{{ Count the number of A, C, G and T letters in `dna` string }}
countNucleotides : Text -> [Nat]
countNucleotides dna =
count char = countElement char
(toCharList dna)
map count [?A, ?C, ?G, ?T]
The way we document a function is also a bit different from what you may be used to. We enclose documentation strings within double curly brackets {{ }}
. The documentation gets stored in a variable called doc
within the countNucleotides
namespace. In Unison, types and functions also form their own namespace.
Here is the output from evaluating countNucleotides
within the REPL-like ucm
environment bundled with Unison:
> countNucleotides sample
⧩
[20, 12, 17, 21]
Transcribing DNA into RNA
The next problem involves converting a DNA string to an RNA string. All the details are explained here. Basically, you have to replace every occurrence of the U
character with a T
.
sample = "GATGGAACTTGACTACGTAAATT"
output = "GAUGGAACUUGACUACGUAAAUU"
{{ Takes an DNA string `str` as input and returns the RNA string }}
toRNA : Text -> Text
toRNA str = Text.map (ch -> if ch == ?T
then ?U
else ch)
str
Again we test using the ucm
REPL like functionality.
> toRNA "GATGGAACTTGACTACGTAAATT"
⧩
"GAUGGAACUUGACUACGUAAAUU"
Complementing a Strand of DNA
In DNA strings, symbols A and T are complements of each other, as are C and G. In this problem we want to complement and reverse a DNA string as described here.
There are some different ways of solving this problem. If you would like to work with a List
type, you need to use the Text.toCharList
and Text.fromCharList
functions to move between Text
and List
objects.
sample = "AAAACCCGGT"
output = "ACCGGGTTTT"
complement : Char -> Char
complement char =
match char with
?A -> ?T
?T -> ?A
?C -> ?G
?G -> ?C
revComp : Text -> Text
revComp dna =
map complement
(toCharList dna) |> List.reverse |> fromCharList
You can, however, work directly with Text
objects. In this case, we are using the Text.map
instead of the List.map
function, but Unison can figure out which one to use because we have clarified that dna
is of type Text
and the complement
function deals with Char
objects.
revComp : Text -> Text
revComp dna =
map complement dna |> reverse
Other Programming Problems
The exercism.org site provides a polished way of practicing your programming skills. I actually discovered the site after I wrote the first draft of this story. They have gamified the experience and prove plenty of units tests to make you evolve your solutions to perfection. They have a special Unison track.
But before looking at my Exercism example, let's do a simpler string manipulation first with converting snake cased function names to camel cased function names.
Snake Case to Camel Case
Keep reading with a 7-day free trial
Subscribe to Erik Explores to keep reading this post and get 7 days of free access to the full post archives.