Erik Explores

Erik Explores

Share this post

Erik Explores
Erik Explores
Processing DNA Strings With Unison
Languages

Processing DNA Strings With Unison

Use the Unison functional programming language to process DNA text strings

Erik Engheim's avatar
Erik Engheim
Jan 20, 2023
∙ Paid

Share this post

Erik Explores
Erik Explores
Processing DNA Strings With Unison
Share

Whenever I learn a new programming language, I like to do simple exercises which doesn't require knowing a large API but which expose you to the basics of the language. Project Rosalind is a fun site to get familiar with working with arrays, strings, characters, and common higher-order functions such as map, fold and filter.

The Rosalind problems are more text oriented than the Project Eulerproblems, which are more math oriented. The Euler projects take a bit more time to figure out and solve for me. But another site which I recently discovered is Exercism which actually has good programming problems. I will discuss one solution from that site as well, involving string manipulation.

Consider becoming a free or paid subscriber. You can do a free trial with full access to all my articles.

Project Rosalind Problems

To understand Rosalind problems, it is useful to know a few things about how DNA and RNA are represented in computing. DNA is represented as text strings consisting only of the characters A, C, G, T. So ATGCTTCAGAAAGGTCTTACG would be an example of a DNA text string.

Counting DNA Nucleotides

The problem described here, involves counting the number of A, C, G and T characters in a DNA string. The sample and output variables are just to compare your output and see if you get the expected result.

There are a couple of things to notice in the code below which I have not covered in other Unison articles. Individual characters are written as ?Aand ?C rather than 'A' and 'C' which is what you may be used to from C-like languages.

We tend to use natural numbers (unsigned integers) a lot more in Unison than in other languages. That is why you see the Nat type used for numbers rather than Int most of the time.

In the code example, the dna variable is of type Text, which is a problem when calling countElement because there is only a List.countElement function and no Text.countElement which can work directly with characters contained within a Text object. For this reason, we need to use Text.toCharList to turn the DNA string into a list of characters.

sample = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
output = [20, 12, 17, 21]

{{ Count the number of A, C, G and T letters in `dna` string }}
countNucleotides : Text -> [Nat]
countNucleotides dna =
    count char = countElement char 
                              (toCharList dna)
    map count [?A, ?C, ?G, ?T]

The way we document a function is also a bit different from what you may be used to. We enclose documentation strings within double curly brackets {{ }}. The documentation gets stored in a variable called doc within the countNucleotides namespace. In Unison, types and functions also form their own namespace.

Here is the output from evaluating countNucleotides within the REPL-like ucm environment bundled with Unison:

> countNucleotides sample
  ⧩
  [20, 12, 17, 21]

Transcribing DNA into RNA

The next problem involves converting a DNA string to an RNA string. All the details are explained here. Basically, you have to replace every occurrence of the U character with a T.

sample = "GATGGAACTTGACTACGTAAATT"
output = "GAUGGAACUUGACUACGUAAAUU"

{{ Takes an DNA string `str` as input and returns the RNA string }}
toRNA : Text -> Text
toRNA str = Text.map (ch -> if ch == ?T 
                          then ?U 
                          else ch) 
                        str

Again we test using the ucm REPL like functionality.

> toRNA "GATGGAACTTGACTACGTAAATT"
  ⧩
  "GAUGGAACUUGACUACGUAAAUU"

Complementing a Strand of DNA

In DNA strings, symbols A and T are complements of each other, as are C and G. In this problem we want to complement and reverse a DNA string as described here.

There are some different ways of solving this problem. If you would like to work with a List type, you need to use the Text.toCharList and Text.fromCharList functions to move between Text and List objects.

sample = "AAAACCCGGT"
output = "ACCGGGTTTT"

complement : Char -> Char
complement char = 
    match char with
        ?A -> ?T
        ?T -> ?A
        ?C -> ?G
        ?G -> ?C

revComp : Text -> Text
revComp dna = 
    map complement 
        (toCharList dna) |> List.reverse |> fromCharList

You can, however, work directly with Text objects. In this case, we are using the Text.map instead of the List.map function, but Unison can figure out which one to use because we have clarified that dna is of type Text and the complement function deals with Char objects.

revComp : Text -> Text
revComp dna = 
    map complement dna |> reverse

Other Programming Problems

The exercism.org site provides a polished way of practicing your programming skills. I actually discovered the site after I wrote the first draft of this story. They have gamified the experience and prove plenty of units tests to make you evolve your solutions to perfection. They have a special Unison track.

But before looking at my Exercism example, let's do a simpler string manipulation first with converting snake cased function names to camel cased function names.

Snake Case to Camel Case

Keep reading with a 7-day free trial

Subscribe to Erik Explores to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Erik Engheim
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share