Tender Is The Byte

Bioinformatics by hand: Neighbor-joining trees

2022-08-31T00:00:00+00:00

Bioinformatics by hand
Neighbor-joining trees
Pros and cons of neighbor-joining trees
How to neighbor-join
Formulas
Example 1
On Distance Matrices
Example 2
Wrapping up

Bioinformatics by hand

I’ve been teaching bioinformatics at the University of Delaware for roughly the last year now. I had never been in a bioinformatics class prior to teaching; my degrees are in ecology and marine science, so all of my bioinformatics knowledge came from research experience. It’s been really interesting to see bioinformatics taught in a formal setting. One thing I’ve noticed is the disconnect that can occur between students and instructors when students without programming experience are asked to perform “hands-on” exercises.

In an effort to de-mystify bioinformatics, instructors often have students manually perform a task that would normally be done computationally. While these exercises are valuable and often succeed in their goal, I have noticed that many students who are not used to being presented with code or equations tend to have difficulty implementing algorithms by hand, regardless of difficulty. This can cause students to shut down and question whether they are in the correct field, rather than empower them.

When this occurs, there seem to be two underlying issues: First, even at the collegiate level, many students are not confident in their ability to do math. This issue I will leave alone, as it cannot be solved in a single course or assignment at the graduate level. Second, the way that a computer would perform a procedure is not necessarily the same way a human would perform it. Sometimes, this can create a gap between students with little or no computing background and instructors who are highly familiar with algorithms.

In this post, I’ll walk you through the process of building neighbor-joining trees. Building phylogenetic trees by hand seems at first like a daunting task, but I promise it’s much easier than you think!

Neighbor-joining trees

Neighbor-joining (NJ) is one of many methods used for creating phylogenetic (evolutionary) and phenetic (trait-based similarity) trees. The method was first introduced in a 1987 paper and is still in use today.

Neighbor-joining uses a distance matrix to construct a tree by determining which leaves are “neighbors” (i.e., children of the same internal parent node) via an iterative clustering process. A neighbor joining tree aims to show the minimum amount of evolution needed to explain differences among objects, which makes it a minimum evolution method.

There has been some debate about the mathematical behavior of neighbor-joining trees. Originally, neighbor joining was thought to be most closely related to tree methods that use ordinary least squares to estimate branch lengths, but further investigation showed that they actually shared more properties with “balanced” minimum evolution methods. You don’t need to know anything about these different methods in order to perform neighbor joining, but if you would like to read more about them, there is an excellent explanation in this paper.

The type of tree produced depends on the input. If you provide a distance matrix based on evolutionary data (e.g., multiple sequence alignment), you will get a phylogenetic tree. If you input distances based on non-evolutionary data (e.g., phenotypic traits), then you will get a phenetic tree. Note that a NJ tree doesn’t have to contain only organisms. You can make NJ trees for anything you can represent/compare with a distance matrix.

NJ trees are simple to make and require only basic operations (addition, subtraction, division), but can seem daunting because of the number of steps required. Here, I will show you how to make two small neighbor-joining trees by hand (or, by spreadsheet).

Pros and cons of neighbor-joining trees

There are a lot of different ways to build phylogenetic and other trees, so how does neighbor-joining compare?

Advantages

It’s simple and easy to understand.
It’s fast and computationally inexpensive compared to other popular methods. Maximum-likelihood and Bayesian methods especially are much slower.
It works. Neighbor-joining has been found to be topologically accurate and to sometimes out-perform more complicated methods like maximum-likelihood and Bayesian inference.

Disadvantages

You lose data. When you squish down sequence alignment or other data into distances, you are performing data reduction. This isn’t necessarily a bad thing (ordination methods like PCA also do this), but you should keep it in mind.
You only get one possible tree. Other methods such as maximum-likelihood and Bayesian inference return multiple different trees, i.e. evolutionary hypotheses, which can be useful for some analyses.
Neighbor-joining can sometimes result in negative branch lengths. Note that this does not affect the topology of the tree, just branch lengths.

How to neighbor-join

To begin neighbor-joining, you need a distance matrix. A distance matrix is a square matrix containing pairwise distances between members of some group. It must be symmetric (e.g., the distance from A to B is the same as the distance from B to A) and the distance from an object to itself must be 0. The distance does not necessarily need to be metric, but in at least one instance a metric distance slightly outperformed a non-metric distance.

Once you have a matrix, you can begin neighbor-joining.

The neighbor-joining process consists of three steps:

Initiation
Iteration
Termination

A quick note on the formulas (which can be found in the section below this one): You may notice a slight difference in the equations between this tutorial and another. Do not panic. These are only slight algebraic differences that do not affect the final answer, only the intermediate numbers.

Initiation

In the initiation step, we define a set of leaf nodes, T, and set L equal to the number of leaf nodes. These are the nodes at the “ends” of trees and therefore do not have any child nodes. You should have one leaf node for each item you want to compare. For example, if you are placing sequences on a tree, you will have one leaf node per sequence.

Iteration

The iteration step is where most of the action takes place. Virtually all of our calculations are made in this step, and, as the name implies, we will repeat these calculations over and over until some conclusion is reached.

First, we calculate the net divergence (r) of each leaf node. You can think of this as being essentially the distance from each leaf node to all of the others.

Next, we calculate the adjusted distance (D) between each pair of nodes, which is based on the pairwise distance in the starting matrix and the divergence of each node. The pair of nodes with the lowest adjusted distance are neighbors and share a parent node.

Next, we declare the parent node and calculate the distance from each of the neighbors to the shared parent. This is also the step where I like to add the siblings and parent to the tree.

At this point, our goal is to construct a new distance matrix. To do this, we remove the two nodes that we earlier determined to be neighbors from the distance matrix and replace them with the newly formed parent node. New pairwise distances (d) are calculated between the new parent node and other nodes in the matrix. Any other distances (i.e., pairwise comparisons present in the new matrix and the previous matrix) can simply be transferred to the new matrix.

Note: In the formulas and calculations below, adjusted distances use a capital D, whereas pairwise distance use a lowercase d. Try not to get them mixed up!

One thing to be aware of is that, after the first iteration, the neighbors are not restricted to being leaves, and may in fact be internal parent nodes.

Each iteration step ends with a new distance matrix that is one node smaller than the one in the previous step (e.g., (L-1) by (L-1) after the first iteration). Iteration continues until there are only two nodes remaining in the matrix.

Termination

The final step is termination.

The only task remaining is to join the two nodes that remain after iteration with a single edge to complete the tree!

Now that we’ve braved the written explanation, it’s time to look at some examples to make all of these steps clearer!

Formulas

These are the formulas for each of the calculations we will perform (you can find more formatted version in the excel file containing the examples).

Net divergence

Net divergence r for a node i with 3 other nodes (j, k, and l):

r(i) = [1/(L-2)] \* [d(ij) + d(ik) + d(il)]

Adjusted distance

Adjusted distance D for two nodes i and j:

D(ij) = d(ij) - [r(i) + r(j)]

Distance from child to parent

Distance from child i to parent k, d(ik), where j is the neighbor of i:

d(ik) = [d(ij) + r(i) + r(j)] / 2

Distance from non-child to new node

Distance from other non-child node, m to new node d(mk):

d(mk) = [d(im) + d(jm) - d(ij)] / 2

Example 1

There’s a good chance that even if you read the description of neighbor-joining above, you still don’t have a great idea of how to do it. That should become clearer with some examples.

Here is our starting matrix:

	A	B	C	D
A	0	4	5	10
B	4	0	7	12
C	5	7	0	9
D	10	12	9	0

Step 1: Initiation

All we do here is define a set of leaf nodes, T, and set L equal to the number of leaf nodes.

T = { A, B, C, D }

L = 4

Step 2: Iteration

Now for the real action. Remember, this will consist of multiple iterations.

Iteration 1

First, we calculate the net divergence r of each node:

r(A) = [1/(L-2)] * [d(AB) + d(AC) + d(AD)] = (1/2) * (4 + 5 + 10) = 9.5

r(B) = [1/(L-2)] * [d(AB) + d(BC) + d(BD)] = (1/2) * (4 + 7 + 12) = 11.5

r(C) = [1/(L-2)] * [d(AC) + d(BC) + d(CD)] = (1/2) * (5 + 7 + 9) = 10.5

r(D) = [1/(L-2)] * [d(AD) + d(BD) + d(CD)] = (1/2) * (10 + 12 + 9) = 15.5

Next, the adjusted distance D for each node pair:

D(AB) = d(AB) - [r(A) + r(B)] = 4 - (9.5 + 11.5) = -17

D(AC) = d(AC) - [r(A) + r(C)] = 5 - (9.5 + 10.5) = -15

D(AD) = d(AD) - [r(A) + r(D)] = 10 - (9.5 + 15.5) = -15

D(BC) = d(BC) - [r(B) + r(C)] = 7 - (11.5 + 10.5) = -15

D(BD) = d(BD) - [r(B) + r(D)] = 12 - (11.5 + 15.5) = -15

D(CD) = d(CD) - [r(C) + r(D)] = 9 - (10.5 + 15.5) = -17

The pair of nodes with the smallest adjusted distance are neighbors. In this case, we have a tie between the pairs AB and CD. We can only move forward with one pair, so we’ll pick AB. We now define a new node that connects these neighbors; we’ll call this new node Z.

We’re close now to constructing our first bit of the tree. To do that, we need to calculate the distance from each neighbor (child) node to the connecting (parent) node.

d(AZ) = [d(AB) + r(A) - r(B)]/2 = (4 + 9.5 - 11.5)/2 = 1

d(BZ) = [d(AB) + r(B) - r(A)]/2 = (4 + 11.5 - 9.5)/2 = 3

With this information, we can draw the first two branches on our tree:

Example 1 tree first iteration

Lastly, we need to reconstruct the distance matrix, replacing A and B with Z. Some distances can be transferred, but others (represented by question marks), need to be calculated:

	Z	C	D
Z	0	?	?
C	?	0	9
D	?	9	0

Here are the formulas for calculating d(ZC) and d(ZD).

d(ZC) = [d(AC) + d(BC) - d(AB)]/2 = (5 + 7 - 4)/2 = 4

d(ZD) = [d(AD) + d(BD) - d(AB)]/2 = (10 + 12 - 4)/2 = 9

With these calculations done, we can replace the question marks in our distance matrix:

	Z	C	D
Z	0	4	9
C	4	0	9
D	9	9	0

And we’re done…with the first iteration. Remember, the iteration step ends when there are only two nodes left in the matrix, and we have three. On to the next iteration!

Iteration 2

For this iteration, we use the latest version of the distance matrix, constructed at the end of the previous iteration and reset L (the number of nodes in the matrix).

L = 3

Calculate the net divergence r of each node:

r(Z) = [1/(L-2)] * [d(ZC) + d(ZD)] = 1 * (4 + 9) = 13

r(C) = [1/(L-2)] * [d(ZC) + d(CD)] = 1 * (4 + 9) = 13

r(D) = [1/(L-2)] * [d(ZD) + d(CD)] = 1 * (9 + 9) = 18

Next, the adjusted distance D for each node pair:

D(ZC) = d(ZC) - [r(Z) + r(C)] = 4 - (13 + 13) = -22

D(ZD) = d(ZD) - [r(Z) + r(D)] = 9 - (13 + 18) = -22

D(CD) = d(CD) - [r(C) + r(D)] = 9 - (13 + 18) = -22

All of the pairs are tied for lowest adjusted distance, so we’ll select ZC because it’s first in the list and define a new node Y that connects the neighbors.

Calculate the distances from the new parent node to it’s children:

d(ZY) = [d(ZC) + r(Z) - r(C)]/2 = (4 + 13 - 13)/2 = 2

d(CY) = [d(ZC) + r(C) - r(Z)]/2 = (4 + 13 - 13)/2 = 2

Add the new branches to the tree:

Example 1 tree second iteration

Calculate any other new distances and construct the new distance matrix:

d(YD) = [d(ZD) + d(CD) - d(ZC)]/2 = (9 + 9 - 4)/2 = 7

	Y	D
Z	0	7
D	7	0

Step 3: Termination

L now consists of only 2 nodes (Y and D), so we add the edge between them to finish the tree:

Example 1 tree termination

Summary

And with that, we’ve built our first neighbor-joining tree! Here is the tree coming together in each step:

Example 1 tree step-by-step

On Distance Matrices

Now, you may have noticed that to build the tree in Example 1, we didn’t actually need all of those formulas. In iteration 1, for example, we can figure out the distance from A and B to their parent just by noticing that B is always 2 units further from other nodes than A. Therefore, d(BZ) must equal d(AZ) + 2. If their combined distance from Z is 4, then the only possible branch lengths are 1 and 3.

So, why did we go through the trouble of neighbor-joining? And when do we actually need neighbor-joining?

Additive matrices

The distance matrix that we used for example 1 is what’s called an additive matrix. Simply put, a matrix is additive if you are able to reproduce the starting matrix by adding together the branch lengths along the paths between nodes. To demonstrate this, let’s look back at example 1.

Reconstruct the example 1 distance matrix from the tree

In the figure above, I’ve deconstructed the tree so that you can see the individual paths between each pair of leaf nodes. Notice that we can reconstruct the starting matrix exactly using only the distances on the tree, which is the main trait of an additive matrix (for a more technical and thorough look at additive matrices, see this presentation).

I like to use an additive matrix as the first neighbor-joining example because, 1) it gives me an excuse to discuss additive matrices, and 2) it’s very easy to check your work. If you are unable to reconstruct the starting matrix in example 1 using the tree, you know you have a problem in your calculations, which is harder to catch with non-additive matrices.

Alright, so if we don’t need neighbor-joining for additive distance matrices, then when do we need it? Neighbor-joining is said to work best for near-additive matrices, i.e. matrices for which the tree almost reconstructs the starting matrix, though they have been reported to be topologically accurate even when this is not the case. And I should note here that the vast majority of distance matrices based on biological data are not additive or even nearly additive.

Without further ado, here is another example using a nearly-additive matrix.

Example 2

Here is our starting matrix:

	A	B	C	D
A	0	2	2	2
B	2	0	3	2
C	2	3	0	2
D	2	2	2	0

Step 1: Initiation

Again, we define T and L. They are the same as example 1.

T = { A, B, C, D }

L = 4

Step 2: Iteration

Iteration 1

First, we calculate the net divergence r of each node:

r(A) = [1/(L-2)] * [d(AB) + d(AC) + d(AD)] = (1/2) * (2 + 2 + 2) = 3

r(B) = [1/(L-2)] * [d(AB) + d(BC) + d(BD)] = (1/2) * (2 + 3 + 2) = 3.5

r(C) = [1/(L-2)] * [d(AC) + d(BC) + d(CD)] = (1/2) * (2 + 3 + 2) = 3.5

r(D) = [1/(L-2)] * [d(AD) + d(BD) + d(CD)] = (1/2) * (2 + 2 + 2) = 3

Next, the adjusted distance D for each node pair:

D(AB) = d(AB) - [r(A) + r(B)] = 2 - (3 + 3.5) = -4.5

D(AC) = d(AC) - [r(A) + r(C)] = 2 - (3 + 3.5) = -4.5

D(AD) = d(AD) - [r(A) + r(D)] = 2 - (3 + 3) = -4

D(BC) = d(BC) - [r(B) + r(C)] = 3 - (3.5 + 3.5) = -4

D(BD) = d(BD) - [r(B) + r(D)] = 2 - (3.5 + 3 = -4.5

D(CD) = d(CD) - [r(C) + r(D)] = 2 - (3.5 + 3) = -4.5

A lot of ties here. Again, we’ll pick the tied pair that is closest to the top of the list, AB, and assign them a parent node, Z.

Now, calculate the distance from each neighbor (child) node to the connecting (parent) node.

d(AZ) = [d(AB) + r(A) - r(B)]/2 = (2 + 3 - 3.5)/2 = 0.75

d(BZ) = [d(AB) + r(B) - r(A)]/2 = (2 + 3.5 - 3)/2 = 1.25

And draw the first two branches on our tree:

Example 2 tree first iteration

Lastly, we calculate new distances and reconstruct the distance matrix:

d(ZC) = [d(AC) + d(BC) - d(AB)]/2 = (2 + 3 - 2)/2 = 1.5

d(ZD) = [d(AD) + d(BD) - d(AB)]/2 = (2 + 2 - 2)/2 = 1

	Z	C	D
Z	0	1.5	1
C	1.5	0	2
D	1	2	0

On to the next iteration!

Iteration 2

For this iteration, we use the latest version of the distance matrix, constructed at the end of the previous iteration and reset L.

L = 3

Calculate the net divergence r of each node:

r(Z) = [1/(L-2)] * [d(ZC) + d(ZD)] = 1 * (1.5 + 1) = 2.5

r(C) = [1/(L-2)] * [d(ZC) + d(CD)] = 1 * (1.5 + 2) = 3.5

r(D) = [1/(L-2)] * [d(ZD) + d(CD)] = 1 * (1 + 2) = 3

Next, the adjusted distance D for each node pair:

D(ZC) = d(ZC) - [r(Z) + r(C)] = 1.5 - (2.5 + 3.5) = -4.5

D(ZD) = d(ZD) - [r(Z) + r(D)] = 1 - (2.5 + 3) = -4.5

D(CD) = d(CD) - [r(C) + r(D)] = 2 - (3.5 + 3) = -4.5

All of the pairs are tied for lowest adjusted distance, so we’ll select ZC because it’s first in the list and define a new node Y that connects the neighbors.

Calculate the distances from the new parent node to it’s children:

d(ZY) = [d(ZC) + r(Z) - r(C)]/2 = (1.5 + 2.5 - 3.5)/2 = 0.25

d(CY) = [d(ZC) + r(C) - r(Z)]/2 = (1.5 + 3.5 - 2.5)/2 = 1.25

Add the new branches to the tree:

Example 2 tree second iteration

Calculate any other new distances and construct the new distance matrix:

d(YD) = [d(ZD) + d(CD) - d(ZC)]/2 = (1 + 2 - 1.5)/2 = 0.75

	Y	D
Z	0	0.75
D	0.75	0

Step 3: Termination

L now consists of only 2 nodes (Y and D), so we add the edge between them to finish the tree:

Example 2 tree termination

Summary

Here is our second tree in completion:

Example 2 tree step-by-step

Lastly, let’s make a distance matrix using the tree to provide the distances. Notice that these distances are just a little bit off from the starting matrix. Hence, “near-additive”.

	A	B	C	D
A	0	2	2.25	1.75
B	2	0	2.75	2.25
C	2.25	2.75	0	2
D	1.75	2.25	2	0

Wrapping up

Having reached the end of this lesson, you should have learned how to construct neighbor-joining trees by hand from additive and nearly additive matrices. If you want to take a closer look at the examples (and access one additional example), you can check out this excel file.

Generating Python bindings for OCaml with pyml_bindgen

2022-04-12T00:00:00+00:00

pyml_bindgen is a command line app that generates Python bindings via pyml directly from OCaml value specifications. While you could write pyml bindings by hand, it can get repetitive, especially if you are binding a decent sized Python library.

In this post, I will introduce pyml_bindgen and go through a couple of common tasks.

Install
A simple example
Controlling the bindings
Binding cyclic Python classes
Other stuff
Wrap-up

Install

To get started with pyml_bindgen, you will need to install it. It is available on opam (opam install pyml_bindgen).

A simple example

Let’s start with a simple example.

Python code

Here is the Python class that we want to bind (hobbit.py).

class Hobbit:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __str__(self):
        return f'Hobbit -- {self.name}, {self.age}'

As you see, it’s pretty simple! It’s just the __init__ method to create the class and the __str__ method for converting it to a string with the Python str or print functions.

Here’s an example of using it in Python.

from hobbit import Hobbit
bilbo = Hobbit('Bilbo', 111)
print(bilbo)
#=> Hobbit -- Bilbo, 111

Write value specifications

To bind Python classes with pyml_bindgen, you first need to write value specifications to define the OCaml interface for the Python code we are binding.

To start, we will keep the functions and argument names the same.

val __init__ : name:string -> age:int -> unit -> t
val __str__ : t -> unit -> string
val name : t -> string
val age : t -> int

There are a couple things call your attention to here:

I haven’t defined type t anywhere yet. Depending on the command line arguments you pass to pyml_bindgen, it will take care of this for you.
For the __init__ function, I have used all named arguments plus the unit argument. The unit argument tells pyml_bindgen that you are binding a normal Python method or function call (as opposed to a Python attribute or property).
The __str__ function takes t as the first argument. Value specifications that start with t, will bind to object method calls on the Python side.
name and age both take t as the first and only argument. If a value specification takes t and nothing else, it binds to the Python attribute of that name.

Save the above in a file called hobbit.txt.

Generate bindings

Now, we’re ready to generate the OCaml bindings.

Here’s how you would run pyml_bindgen for this example.

$ pyml_bindgen hobbit.txt hobbit Hobbit \
  --of-pyo-ret-type no_check \
  > hobbit.ml

Let’s unpack that.

The first three arguments are the path to the OCaml value specifications, the name of the Python module we are binding, and the Python class name.
- Since we named the Python file hobbit.py, its module name is hobbit.
- Depending on the directory structure you’re using, this may change.
--of-pyo-ret-type specifies the return type for functions that generate Python objects.
- Using no_check means the generated functions will assume the Python object is the correct type.
- You can also use option and or_error as well.
The output is redirected to a file called hobbit.ml. Thus, our generated code will be in a module called Hobbit.
We did not tell pyml_bindgen that it should generate a full module with a signature, so it will just write the implementation.
- In this example it is fine, but you will often want to generate the module and signature, so that your types will be abstract.
- For example, you could use --caml-module Hobbit --split-caml-module to generate both an ml and mli file.
If you look at the generated code, it will be kind of messy. I usually run the output through ocamlformat if I need to edit the output, or check the generated code into version control or something like that.

Test it out

Now we can make a program to test it out. Don’t forget to call initialize before running the rest of your code!

let () = Py.initialize ()

let bilbo = Hobbit.__init__ ~name:"Bilbo" ~age:111 ()

let () =
  assert ("Hobbit -- Bilbo, 111" = Hobbit.__str__ bilbo ());
  assert ("Bilbo" = Hobbit.name bilbo);
  assert (111 = Hobbit.age bilbo)

Since we didn’t generate a signature to go with our implementation, the type of the value returned by Hobbit.__init__ will be Pytypes.pyobject. In this way, we can pass any pyobject to the Hobbit.__str__ function. Let’s see.

let x = Py.Int.of_int 1234

let () = print_endline @@ Hobbit.__str__ x ()

If you run that, it will print 1234. Huh? Well, if you look at the generated code for the Hobbit.__str__ function, it looks something like this:

let __str__ t () =
  let callable = Py.Object.find_attr_string t "__str__" in
  let kwargs = filter_opt [] in
  Py.String.to_string
  @@ Py.Callable.to_function_with_keywords callable [||] kwargs

Without going into too much detail, essentially all it is doing is calling the __str__ method on the Python object passed in. While this is fine on the Python side, it doesn’t work the way we might want it to on the OCaml side. Ideally, we only want the Hobbit module functions to work on values of type Hobbit.t.

Generating abstract types

If we were writing the bindings by hand, we would make Hobbit.t abstract. With pyml_bindgen, we can do that using the --caml-module option.

$ pyml_bindgen hobbit_specs.txt hobbit Hobbit \
  --of-pyo-ret-type no_check \
  --caml-module Hobbit \
  --split-caml-module . \
  > hobbit.ml

Notice that I also used --split-caml-module . which tells pyml_bindgen to split the implementation and signature into separate ml and mli files, and to put the output in the directory in which the command is run. You can pass in whatever directory you want to this option.

Now if we tried something like this:

let x = Py.Int.of_int 1234

let () = print_endline @@ Hobbit.__str__ x ()

It would be a compile-time error.

Controlling the bindings

Let’s clean up this example a little bit.

Using different function names

While __init__ and __str__ are fine for OCaml function names, they don’t feel quite right. pyml_bindgen lets you bind Python functions to different names on the OCaml side using attributes on the value specifications. To bind to a different function name, we use the py_fun_name attribute. Check it out.

val create : name:string -> age:int -> unit -> t
[@@py_fun_name __init__]

val to_string : t -> unit -> string
[@@py_fun_name __str__]

We bind the __init__ function to an OCaml function called create, and the Python function __str__ to the OCaml function to_string. That’s much more natural!

As you can see, the syntax is like this: [@@attr-id attr-payload]. In this case, the attribute id is py_fun_name and the payload is the name of the Python function that we want to bind. Put another way, the attribute payload should be the name of the function as it is defined in the Python library you are binding to (i.e., __init__ is the name of the function on the Python side, not create).

Putting it together, you get [@@py_fun_name __init__] for the Python __init__ function and [@@py_fun_name __str__] for the Python __str__ function.

Using different argument names

The other available attribute is py_arg_name. With this, we can bind arguments to different names on the OCaml and Python sides. This can be useful in situations in which Python argument names use reserved OCaml keywords, or simply to make the generated API feel more natural for use in OCaml.

For example, you may have a Python function that has an argument name method.

def cluster(method='ward'):
    ...

Since method is a reserved keyword in OCaml, we can’t use it directly. Instead, we want to name it method_ in our OCaml code.

val cluster : method_:string -> ...
[@@py_arg_name method_ method]

In this case, the payload is two items: the first is the argument name on the OCaml side, and the second is the argument name on the Python side.

Note that in cases in which you need multiple attributes per specification, they must be placed one per line. (This is a pyml_bindgen specific restriction.) E.g., something like this:

val run_clustering : method_:string -> ...
[@@py_fun_name cluster]
[@@py_arg_name method_ method]

This will bind the OCaml function run_clustering to the corresponding Python function cluster.

Binding cyclic Python classes

Often you will need to bind Python classes that refer to each other. One way to bind these is to use recursive modules. Let’s update our Hobbit example to show how you can do this in pyml_bindgen.

class Hobbit:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        self.house = None

    def __str__(self):
        return f'Hobbit -- {self.name}, age: {self.age}, house: {self.house.name}'

    def buy_house(self, house):
        self.house = house
        self.house.owner = self

class House:
    def __init__(self, name):
        self.name = name
        self.owner = None

    def __str__(self):
        return f'House -- {self.name}, owner: {self.owner.name}'

So this is a pretty silly example, but it’s just to illustrate the point. In this case, a Hobbit can own a House and a House can have a Hobbit for an owner.

To bind these classes, I will use the gen_multi and combine_rec_modules helper programs that come with pyml_bindgen.

gen_multi

gen_multi is a wrapper script that runs pyml_bindgen multiple times to generate multiple OCaml modules in one go. It takes a tsv file specifying the same set of options that you would pass in to pyml_bindgen if you used it directly.

Assume this is in a file called gen_multi_cli.tsv.

signatures	py_module	py_class	associated_with	caml_module	split_caml_module	embed_python_source	of_pyo_ret_type
hobbit.txt	hobbit	Hobbit	class	Hobbit	NA	hobbit.py	no_check
house.txt	house	House	class	House	NA	house.py	no_check

The order of the columns must as shown above. (For more info on each of these options, run pyml_bindgen --help.)

You will see that we refer to hobbit.txt and house.txt. These are the value specifications for each of the Python classes. Here are there contents.

hobbit.txt

val create : name:string -> age:int -> unit -> t
[@@py_fun_name __init__]

val to_string : t -> unit -> string
[@@py_fun_name __str__]

val buy_house : t -> house:House.t -> unit -> unit

house.txt

val create : name:string -> unit -> t
[@@py_fun_name __init__]

val to_string : t -> unit -> string
[@@py_fun_name __str__]

combine_rec_modules

combine_rec_modules takes a file of OCaml modules and “converts” them into recursive modules. It does this using a simple text transformation.

Often you will want to pipe the output of gen_multi directly into combine_rec_modules.

Generate the modules & test it out

Now let’s see it in action.

$ gen_multi gen_multi_cli.tsv | combine_rec_modules /dev/stdin > lib.ml

We put that in a module called Lib. And here is how we might use that.

open Lib

let () = Py.initialize ()

let bilbo = Hobbit.create ~name:"Bilbo" ~age:111 ()

let bag_end = House.create ~name:"Bag End" ()

let () = Hobbit.buy_house bilbo ~house:bag_end ()

let () =
  assert (
    "Hobbit -- Bilbo, age: 111, house: Bag End" = Hobbit.to_string bilbo ())

Other stuff

Let me mention a couple of other things before we go…

In this post we ran pyml_bindgen (or its helper scripts) manually, it’s not too hard to set up Dune rules to automatically generate bindings. See the dune files in the example directory on the pyml_bindgen GitHub for more information.
While I only showed how to bind to Python classes, you can also bind to functions associated with modules rather than with classes.
Another cool feature is that you can embed Python source code directly into your generated OCaml modules. See here for more details.

Wrap-up

pyml_bindgen is a command line app for generating Python bindings using pyml. It makes incorporating Python libraries into your OCaml projects as easy as writing regular OCaml value specifications.

To get more information on setting up and using pyml_bindgen, including ideas on how to structure your projects, check out the examples, tests, and docs.

An introduction to the re2 regular expression library for OCaml

2021-10-02T00:00:00+00:00

In this tutorial, we will talk about re2, an OCaml library providing bindings to RE2, Google’s regular expression library.

This post is intended for newer OCaml programmers, or those who want to use the re2 library, but could use a couple of examples to help get started. This is not a general introduction to regular expressions, however. If you have never used regular expressions before, read up a little bit on the syntax before tackling this post.

Overview
Creating regular expressions
Checking for a match
Finding matches
Finding submatches
Splitting strings
Replacing
Miscellaneous info
Wrap up

Overview

The there are few choices for regular expression libraries available for OCaml on Opam. Some of the most popular include

re, a pure OCaml library (installed 7667 times last month),
pcre, bindings to the Perl Compatibility Regular Expressions library (PCRE), (installed 1115 times last month), and
re2, OCaml bindings for RE2, Google’s regular expression library (installed 114 times last month).

The first two are by far the most popular in terms of raw Opam install counts. However, re2 integrates nicely into the Jane Street Base/Core/Async ecosystem (it’s a Jane Street package after all!), and is covered under the MIT license rather than the LGPL with OCaml linking exception, which may be appealing depending on your situation.

Note: According to this blog post and this GitHub issue, Jane Street is phasing out its use of re2. The re2 GitHub does have recent commits, though, so your mileage may vary.

One issue that newcomers may face when getting started with the re2 library is the slightly terse API documentation. While it is detailed and thorough, it can be hard to get started with if you’re not already used to reading Jane Street mli files and source code.

Note: if you want to follow along, you can paste the examples into the toplevel (or utop). However, don’t paste in lines starting with - :. These lines show the type of the expression as reported by utop.

Creating regular expressions

You create regular expressions with Re2.create and Re2.create_exn. The former returns Re2.t Or_error.t and the latter Re2.t.

let re = Or_error.ok_exn @@ Re2.create "apple";;
let re = Re2.create_exn "apple";;

Matching options

You can control how regular expression matching works by passing the options argument to the create and create_exn functions. If you omit this argument, the default options will be passed. Here they are:

Re2.Options.default;;
- : Re2.Options.t =
{
  Re2.Options.case_sensitive = true;
  dot_nl = false;
  encoding = Re2.Options.Encoding.Utf8;
  literal = false;
  log_errors = false;
  longest_match = false;
  max_mem = 8388608;
  never_capture = false;
  never_nl = false;
  one_line = false;
  perl_classes = false;
  posix_syntax = false;
  word_boundary = false;
}

For a more detailed description of these options, see the re2.h header filer.

By default, re2 uses case-sensitive matching. To create a case-insensitive regex, pass in an options map like so.

let re_i =
  let options = { Re2.Options.default with case_sensitive = false } in
  Re2.create_exn ~options "abc"

Checking for a match

Perhaps the most basic regex task is to check if a string matches a given regular expression. You can use Re2.matches for this.

(* Case sensitive *)
let re = Re2.create_exn "apple" in
assert (Re2.matches re "apple pie");
assert (not (Re2.matches re "Apple pie"));;

(* Case insensitive *)
let re =
  let options = { Re2.Options.default with case_sensitive = false } in
  Re2.create_exn ~options "apple" 
in
assert (Re2.matches re "apple pie");
assert (Re2.matches re "Apple pie");;

Finding matches

To find all matches of a regular expression in a string, you can use the find_* functions.

Find first match

To return the first match in the query string, use find_first or find_first_exn. These functions return matched string rather than the underlying Re2.Match.t.

let re = Re2.create_exn "apple" in
  Re2.find_first_exn re "apple pie is made from apples";;
- : string = "apple"

let re = Re2.create_exn "[ab]{2}" in
Re2.find_first_exn re "ababa";;
- : string = "ab"

Find all matches

While find_first returns the first match in a query string, find_all and find_all_exn return lists of all non-overlapping matches in the query string.

let re = Re2.create_exn "apple" in
Re2.find_all re "apple pie";;
- : string list Or_error.t = Result.Ok ["apple"]

let re = Re2.create_exn "apple" in
Re2.find_all_exn re "apple pie is made from apples";;
- : string list = ["apple"; "apple"]

Submatches and capturing groups

You can use the sub argument to return submatches defined by capturing groups rather than the whole match.

let re = Re2.create_exn "a([bc])" in
let s = "ab ac ab" in
Re2.find_all_exn ~sub:(` Index 1) re s;;
- : string list = ["b"; "c"; "b"]

Be aware that passing index greater than the amount of capturing groups will raise an error.

let re = Re2.create_exn "a([bc])" in
let s = "ab ac ab" in
Re2.find_all_exn ~sub:(` Index 10) re s;;
Exception: Re2__Regex.Exceptions.Regex_no_such_subpattern(10, 2).

Or_error returning vs. Exception raising

Like most of the functions in the Re2 module, the find functions come in both Or_error.t returning and exception raising versions. If the regular expression doesn’t match, find_all returns a Result.Error.t whereas find_all_exn raises an exception.

let re = Re2.create_exn "apple" in
Re2.find_all re "peach pie";;
- : string list Or_error.t =
Result.Error
 ("Re2__Regex.Exceptions.Regex_match_failed(\"apple\")")

let re = Re2.create_exn "apple" in
Re2.find_all_exn re "peach pie";;
Exception: Re2__Regex.Exceptions.Regex_match_failed("apple").
(* ...output omitted... *)

It is important to remember that the find_all functions return non-overlapping matches.

let re = Re2.create_exn "[ab]{2}" in
Re2.find_all_exn re "ababa";;
- : string list = ["ab"; "ab"]

Finding submatches

If you need a bit more control than provided by find_all with the sub argument (e.g., find_all ~sub:(` Index 1)), the you may need to use find_submatches or find_submatches_exn. These return the first match in the query string. The match is returned as a string option array, where the first element is the whole match, and subsequent elements are submatches as defined by any capturing groups.

let re = Re2.create_exn "a([bc])([de])" in
Re2.find_submatches_exn re "abdace";;
- : string option array = [|Some "abd"; Some "b"; Some "d"|]

You may wonder why find_submatches_exn returns a string option array and not simply a string array. find_submatches_exn uses Match.get under-the-hood. Basically, find_submatches_exn processes a Match.t Sequence.t of matches, calling get on each one. And the Match.get function returns a string option.

This little code snippet will hopefully give you an idea of what’s going on.

let re = Re2.create_exn "a([bc])([de])" in
let match_ = Re2.first_match_exn re "abdace" in
[|
  Re2.Match.get match_ ~sub:(` Index 0);
  Re2.Match.get match_ ~sub:(` Index 1);
  Re2.Match.get match_ ~sub:(` Index 2);
  Re2.Match.get match_ ~sub:(` Index 3);
|]
;;
- : string option array = [| Some "abd"; Some "b"; Some "d"; None |]

If the Index you pass to ~sub is higher than the of capturing groups plus one (e.g., the number returned from Re2.num_submatches), None is returned.

More complicated submatch interface

If you want to work with the Re2.Match.t directly, you can use functions from the complicated interface like first_match and get_matches.

If you need to work with submatches of every match in a string rather than just the first, and you need direct access to the Match.t, you will want to use get_matches or get_matches_exn. Let’s try it out with a weird, little example.

Say we have a string made up of chunks. Each chunk is a number followed by an A (for add) or an S (for subtract) (e.g., 50A and 3S). The chunk describes an arithmetic operation: 12A means add 12 to the previous total; 3S means subtract 3 from the previous total.

A full string then might look something like this: 10A5S2S3A, which represents the following sequence of operations: 0 + 10 - 5 - 2 + 3.

One way to solve this little problem using regexes and the get_matches function. Let’s see how it might go.

let total =
  let s = "10A5S2S3A" in
  (* Make the regex *)
  let re = Re2.create_exn "([0-9]*)([AS])" in
  (* Get a Match.t list *)
  let matches = Re2.get_matches_exn re s in
  (* Fold over the matches to get the total. *)
  List.fold matches ~init:0 ~f:(fun total m ->
      (* The first capturing group is the "count". *)
      let number = Int.of_string @@ Re2.Match.get_exn m ~sub:(` Index 1) in
      (* The second capturing group represents the operation. *)
      match Re2.Match.get_exn m ~sub:(` Index 2) with
      | "A" -> total + number
      | "S" -> total - number
      | _ -> assert false)
;;

assert (total = 0 + 10 - 5 - 2 + 3);;

Note: This weird format is actually loosely based on the CIGAR strings found in SAM files describing biological sequence alignments.

Controlling submatches

In the last two examples, we used the sub argument along with a polymorphic variant to select capture groups. Let’s take a closer look at the type used for that.

To select submatches, we use id_t, which looks like this:

type id_t = [ ` Index of int | ` Name of string ]

This type is used to refer to submatches. E.g., ` Index 1 would be the result of first capturing group, ` Index 2 the 2nd, etc. Remember that ` Index 0 refers to the whole match.

In addition to referring to submatches/capturing groups by index, you can refer to them by name.

let re = Re2.create_exn "a(?P<second_letter>[bc])" in
let m = Re2.first_match_exn re "abc" in
let x = Re2.Match.get_exn m ~sub:(` Name "second_letter") in
let y = Re2.Match.get_exn m ~sub:(` Index 1) in
assert String.(x = y);;

When using a complicated regular expression with multiple capturing groups, it is often less error prone to use named submatches rather than numbered ones.

Note: It is not a compile-error to try an access a capturing group that doesn’t exist in the regular expression. Depending on the function, you may get None or raise an exception.

Using `id_t` to control match efficiency

Many of the regex matching functions take a ?sub:id_t argument.

In some cases, you can increase the efficiency of matching by restricting the number of submatches. If you only care about whether a pattern matches, and not about submatches, you could pass in ~sub:(` Index -1) to many of the above functions.

You can get increasingly more information by increasing the n to the index.

(* Get only the whole match. *)
~sub:(` Index 0)

(* Get the whole match and first submatch. *)
~sub:(` Index 1)

This section of the documentation has more info on how specifying the sub argument can have an impact on regex performance, and which functions are affected by its usage.

Splitting strings

Another common regex task is splitting an input string based on a regular expression pattern. Re2 provides the split function for this purpose.

let re = Re2.create_exn "[.,! ]+" in
Re2.split re "Hello, world! I like pie.";;
- : string list = ["Hello"; "world"; "I"; "like"; "pie"; ""]

If you need to include the actual matches in the output, you can. Passing ~include_matches:true ensures the “separators” are in there with the rest of the output.

let re = Re2.create_exn "[.,! ]+" in
Re2.split ~include_matches:true re "Hello, world! I like pie.";;
- : string list =
["Hello"; ", "; "world"; "! "; "I"; " "; "like"; " "; "pie"; "."; ""]

Just be aware of that final empty string at the end!

You can also limit the number of matches with the max argument. You could use this to get the first value separated from the remaining values in a string of tab-separated values, for example.

let re = Re2.create_exn "\t" in
Re2.split ~max:1 re "apple\tpie\tis\tgood";;
- : string list = ["apple"; "pie\tis\tgood"]

If the regular expression has no matches in the query string, then a one element list is returned.

let re = Re2.create_exn "\t" in
Re2.split ~max:1 re "apple pie is good";;
- : string list = ["apple pie is good"]

Replacing

Using `rewrite`

The simpler interface for regex replacement consists of the rewrite and rewrite_exn functions. The template argument defines how you want to replace any matches in the query string. In this case, we replace any matches with a capital A.

let re = Re2.create_exn "a" in
Re2.rewrite_exn re ~template:"A" "apple peach";;
- : string = "Apple peAch"

You can reference the submatches in the template string using the syntax \\n. Check it out.

let re = Re2.create_exn "([ae])" in
Re2.rewrite_exn re ~template:"( \\1 )" "apple peach";;
- : string = "( a )ppl( e ) p( e )( a )ch"

If you have multiple submatches, just keep referring to them in the same way: \\1 ... \\2 ... etc.

If you need to check if your rewrite template is valid before running rewrite, use valid_rewrite_template function.

let re = Re2.create_exn "([ae])([io])([uy])" in
let template = "\\3 - \\2 - \\1" in
Re2.valid_rewrite_template re ~template;;
- : bool = true

Using `replace`

The re2 library also provides more powerful replacing functions: replace and replace_exn. You can use them if you need direct access to the Match.t.

Here is a silly example that picks a different replacement value depending on the match.

let re = Re2.create_exn "([ae])" in
Re2.replace_exn re "apple peach" ~f:(fun m ->
  match Re2.Match.get_exn m ~sub:(` Index 1) with
  | "a" -> "u"
  | "e" -> "o"
  | _ -> assert false)
;;
- : string = "upplo pouch"

While the replace function is more complicated than rewrite, it gives you more control and has a few other options you may find useful.

Miscellaneous info

Escaping strings for regular expressions

Properly escaping regular expressions can sometimes be tricky, especially if you want to avoid illegal backslash characters in your strings.

Re2 provides a function escape that escapes its input in such a way that if you create a regex from the resulting escaped string, it would match the original string. Here’s how it works.

Re2.escape "Apple. (Pie)!!";;
- : string = "Apple\\.\\ \\(Pie\\)\\!\\!"

Re2.matches
  (Re2.create_exn @@ Re2.escape "Apple. (Pie)!!")
  "Apple. (Pie)!!";;
- : bool = true

Depending on how many special characters are in the string you use to build the regex, escaping can be pretty noisy! In these cases, escape is especially useful.

Infix matching operator

If you’re feeling nostalgic for Perl, feel free to use the =~ infix operator!

let re = Re2.create_exn "ab";;

Re2.Infix.("abc" =~ re);;
- : bool = true

(* Let's get crazy and open the module! *)
open Re2.Infix;;

"abc" =~ re;;
- : bool = true

“Precompiling” your regular expressions

Unless you have a good reason not to, you will probably want to create your regular expression outside of the function that will be using it.

To see why, let’s check out this little benchmark program that compares two functions. The first one reuses a regex that is created outside of the function, whereas the second one creates a new regex each time the function is called.

Note: This benchmark program uses Jane Street’s core_bench micro-benchmarking library.

open! Core
open! Core_bench

let re = Re2.create_exn "a([bc])"

let find re s = Re2.find_first_exn re s
let find' s = Re2.find_first_exn (Re2.create_exn "a([bc])") s

let () =
  Command.run
    (Bench.make_command
       [
         Bench.Test.create ~name:"outside" (fun () ->
             find re "abcabcabc");
         Bench.Test.create ~name:"inside" (fun () ->
             find' "abcabcabc");
       ])

Name	Time/Run	mWd/Run	Percentage
outside	272.60 ns	2.00 w	3.74%
inside	7_281.55 ns	91.00 w	100.00%

As you can see, reusing a regex rather than creating a new one each time a function is called makes a big difference in this benchmark. Keep in mind that this is a micro-benchmark, and that this difference may not be that important to the run time of your program as a whole. That said, if you had the slow version of the above function in a hot loop, it could really be wasting a lot of CPU cycles.

Wrap up

Hopefully this overview helps you get started with using re2!

To get more info about using re2, check out the API docs. Additionally, the re2 source code is quite readable. I encourage you to take a look at how the functions are defined–it may help clear up any additional questions you have!

Styling plots in base R graphics to match ggplot2 classic theme

2021-05-09T00:00:00+00:00

ggplot2 is an R package for creating graphics in a declarative way and is based on The Grammar of Graphics. If you have never used ggplot2, it’s a nice library for making publication ready figures with much less hassle than the base R graphics.

Something I think is pretty fun is to try and recreate ggplot2 style figures using base R graphics. Sometimes, I look at the actual plotting code in the ggplot2 package, but I think it is more fun to just make a figure with ggplot and then try and get a reasonable match with base R. Doing so, you really get an appreciation of the convencience of the ggplot2 package.

With that, let’s try and recreate a figure using the “classic” ggplot2 theme: theme_classic.

If you want to learn more about base R graphics, check out my deep dive into rotating axis labels in base R plots.

Set up
Fixing the axes
Fixing the points
Adding a legend
Some final touchups
Wrap up

Set up

First, here is some “set up” code where we create some data and set some variables to hold colors and stuff like that.

library(ggplot2)

k_purple <- "#875692"
k_orange <- "#F38400"

set.seed(12341234)

x <- 1:100
y <- (rnorm(100, sd = 15) + x + 100) / 10
group <- c(rep("A", 50), rep("B", 50))

With that out of the way, let’s see the ggplot2 classic theme that we will try and match. Here it is:

ggplot(data = data.frame(x, y, group),
       mapping = aes(x = x, y = y, color = group)) +
    geom_point(size = 2) +
    scale_color_manual(values = c(k_purple, k_orange)) +
    theme_classic()

ggplot2 classic theme

And finally, let’s compare the simplest possible base R graphics plot. I’m sure that you’re familiar with what it looks like!

plot(x, y)

Base R graphics plot

You can see that that plot is pretty far from where we want to be. Let’s go step-by-step getting closer to the theme_classic ggplot version each time.

Fixing the axes

The first thing you see is that box around the plot that isn’t present in the ggplot version. Let’s remove it by passing bty = "n" to the plot function.

plot(x, y,
     ## Remove the box around the plot.
     bty = "n")

Removing the box

You can see that the axes are a bit different than in the ggplot2 version. Here, the final ticks are the edges of the axis. The ggplot version has a nice, solid line for the x and y axes that connects at the bottom left corner. You can get that effect with the bty option to plot.

The bty parameter is an interesting one. Here is the section from the par help file describing bty:

‘bty’ A character string which determined the type of box which is drawn about plots. If ‘bty’ is one of ‘”o”’ (the default), ‘”l”’, ‘”7”’, ‘”c”’, ‘”u”’, or ‘”]”’ the resulting box resembles the corresponding upper case letter. A value of ‘”n”’ suppresses the box.

Those options look pretty weird, but they each show the “shape” of what the box will look like: l will look like a upper case L, or have a line on the left and the right only. The 7 will look sort of like a 7, or have the box lines on the top and right only. Since we want lines on the left and bottom, we can use bty = "l". I will also remove the default x and y axes (using xaxt and yaxt) since we don’t want it to overlap the lines of the box. Also we can increase the width a bit with lwd.

While you can control the box inside the plot function, I will use the box function instead. That way, it will be a little easier to customize. To do that, we will keep the bty = "n" in the plot function to turn the box off, then add it back in after with box.

plot(x, y,
     ## Remove box.
     bty = "n",
     ## Remove default x and y axis.
     xaxt = "n", yaxt = "n")
box("plot",
    ## Add 'box' lines to the bottom and left of the plot.
    bty = "l",
    ## Increase width of box lines.
    lwd = 2)

With nice axis lines

Add the tick marks

Now let’s add the axis ticks and labels back in. For that we use the axis function. We will change a few of the options at once, so I will go over them first. The side parameter controls where the axis is drawn with respect to the plot: 1 = below, 2 = to the left, 3 = above, and 4 = to the right. Remember how the axis is drawn with the line by default? We turn that off with lwd = 0 and then we set the tick width to match the box width using lwd.ticks = 2. Finally, we want to rotate the tick labels of the y axis so they are perpendicular to the axis. Here it is.

plot(x, y, bty = "n", xaxt = "n", yaxt = "n")
box("plot", bty = "l", lwd = 2)
## X Axis
axis(side = 1,
     ## Don't draw the axis line.
     lwd = 0,
     ##  Match the width of the tick marks to the box lines.
     lwd.ticks = 2)
## Y axis
axis(side = 2, lwd = 0, lwd.ticks = 2,
     ## Rotate tick labels prependicular to the axis.
     las = 2)

With ticks and tick labels

Adjusting ticks and tick labels

Next, we are going to make some adjustments to the length of the tick marks and to where the axis labels are drawn. This can get a little weird, and there are multiple ways to do it. Let’s go through some of the options we will need.

The mgp parameter is a little tricky. It is a three part vector that controls the margin for the axis title (mgp[1]), axis (tick) labels (mgp[2]), and the axis line (mgp[3]). The default value is c(3, 1, 0). The units are in lines of text.

We want to move the axis labels and tick labels closer to the axis, so we need to reduce the first two numbers in that vector. This time, I’m going to use the par function to set the parameter since I want it to apply to all the plotting functions.

## Move the axis label and tick labels closer to the axis line.
par(mgp = c(1.5, 0.4, 0))
plot(x, y, bty = "n", xaxt = "n", yaxt = "n")
box("plot", bty = "l", lwd = 2)
axis(side = 1, lwd = 0, lwd.ticks = 2)
axis(side = 2, lwd = 0, lwd.ticks = 2, las = 2)

Adjusting the axis labels

Adjusting tick label length

Now that we’ve tweaked the label positions, we need to adjust the tick length. We do that with tcl parameter to the par function, which specifies tick mark length as a fraction of the height of a line of text. So tcl = 1 will make tick labels the same height as a line of text, tcl = -0.5 (the default) will make them 1/2 the line height. The sign of the argument controls the direction the ticks point: positive values point into the chart, negative values point away. Let’s make them half as long as they are now with tcl = -0.25.

par(mgp = c(1.5, 0.4, 0),
    ## Reduce the size of the tick marks.
    tcl = -0.25)
plot(x, y, bty = "n", xaxt = "n", yaxt = "n")
box("plot", bty = "l", lwd = 2)
axis(side = 1, lwd = 0, lwd.ticks = 2)
axis(side = 2, lwd = 0, lwd.ticks = 2, las = 2)

Shrinking the tick marks

Moving the x labels a bit more

That’s pretty good, but to my eye, the x axis tick labels are still a bit too far away from the ticks. To fix that, we can pass the mgp param directly to the axis function that we use to draw the axis. It will overwrite the global value set by the par function, but only for the function we pass it to. The 2nd element in the mgp vector controls the axis tick labels, so we will reduce it from 0.4 to 0.2.

par(mgp = c(1.5, 0.4, 0),tcl = -0.25)
plot(x, y, bty = "n", xaxt = "n", yaxt = "n")
box("plot", bty = "l", lwd = 2)
axis(side = 1, lwd = 0, lwd.ticks = 2,
     ## Reducing the 2nd element from 0.4 to 0.2 moves the x axis
     ## tick labels closer to the axis line.
     mgp = c(1.5, 0.2, 0))
axis(side = 2, lwd = 0, lwd.ticks = 2, las = 2)

Moving the x axis labels in

That’s better!

Fixing the points

Now that the axes are looking pretty good, let’s move on to the points. To change the type of point that is plotted, you use the pch parameter. I like pch = 20 for little dots, but pch = 16 could work as well. We can also change the size of the points with the cex parameter. The default size is cex = 1 and increasing the number will increase the size (e.g., cex = 2 will be twice as big). We will use cex = 1.4 to approximate the size of the ggplot points.

Finally, to change the color, we will use the col parameter to the plot function. For this parameter, we can pass in a vector the same length as the x and y data vectors to specify the color for each data point. The group vector we created at the beginning gives two groups, A and B, for the points. We want to associate each group with a color so we make a named color vector like this: colors <- c(A = k_purple, B = k_orange). Then we use the groups vector to index the colors vector like this: colors[group].

If that doesn’t make sense, here is a simple example.

tastiness <- c(Cookie = "yummy", Cake = "yucky")
desserts <- c("Cookie", "Cake", "Cookie")
tastiness[desserts]
##   Cookie  Cake    Cookie
##   "yummy" "yucky" "yummy"

Let’s use that idea for our plot.

## Associate group A with purple and group B with orange.
par(mgp = c(1.5, 0.4, 0), tcl = -0.25)
colors <- c(A = k_purple, B = k_orange)
plot(x, y, bty = "n", xaxt = "n", yaxt = "n",
     ## Draw filled in dots instead of open circles.
     pch = 20,
     ## Increase the size of the dots.
     cex = 1.4,
     ## Set the color of each dot based on its group.
     col = colors[group])
box("plot", bty = "l", lwd = 2)
axis(side = 1, lwd = 0, lwd.ticks = 2, mgp = c(1.5, 0.2, 0))
axis(side = 2, lwd = 0, lwd.ticks = 2, las = 2)

Fixing the points

Now that’s looking pretty good!

Adding a legend

It’s time now to put in the legend. We will start with something basic and then adjust it to match the legend in the ggplot2 figure.

To make a legend in base R graphics, use the legend function. We set the legend location with the x parameter. To put the legend on the right side of the plot, we use x = "right". We use the legend param to actually tell the legend the names of the groups: legend = c("A", "B"). Now for the points, we specify the style we used (pch = 20) and the different colors for the each group (col = colors). Here it is.

par(mgp = c(1.5, 0.4, 0), tcl = -0.25)
colors <- c(A = k_purple, B = k_orange)
plot(x, y, bty = "n", xaxt = "n", yaxt = "n",
     pch = 20, cex = 1.4, col = colors[group])
box("plot", bty = "l", lwd = 2)
axis(side = 1, lwd = 0, lwd.ticks = 2, mgp = c(1.5, 0.2, 0))
axis(side = 2, lwd = 0, lwd.ticks = 2, las = 2)
## Add a legend to the right side of the plot.
legend(x = "right",
       ## Specify the group names.
       legend = c("A", "B"),
       ## And the colors of the dots.
       col = colors,
       ## And the shape of the dots.
       pch = 20)

Adding a legend

That’s not bad, but not quite the look we are going for. We need to add a legend title, remove the box around the legend, and tweak the size and spacing of the elements.

Adjusting the legend

To set the title, we can do this: title = "group". Removing the box is done as in the main plot by setting bty = "n". I think it looks nice when the size of the points in a legend to match the size of the points in the plot. To do that, we can use the pt.cex option. We set it to 1.4 to match the cex parameter that we passed in to plot like so: pt.cex = 1.4.

It’s a subtle thing, but the spacing between the legend elements in the ggplot figure are a bit more spaced out than in the base graphics figure. To adjust that, we use x.intersp and y.intersp parameters, which adjust the character spacing in the horizontal and vertical directions (the units are line heights again). The default is 1 for both. Since we want a little more space, we increase them to something like this: x.intersp = 1.4, y.intersp = 1.15.

Here’s what those changes look like.

par(mgp = c(1.5, 0.4, 0), tcl = -0.25)
colors <- c(A = k_purple, B = k_orange)
plot(x, y, bty = "n", xaxt = "n", yaxt = "n",
     pch = 20, cex = 1.4, col = colors[group])
box("plot", bty = "l", lwd = 2)
axis(side = 1, lwd = 0, lwd.ticks = 2, mgp = c(1.5, 0.2, 0))
axis(side = 2, lwd = 0, lwd.ticks = 2, las = 2)
legend(x = "right", legend = c("A", "B"), col = colors, pch = 20,
       ## Add a title
       title = "group",
       ## Remove the box around the legend.
       bty = "n",
       ## Increase the size of the points to match those in the plot.
       pt.cex = 1.4,
       ## Increase the spacing in the x and y directions.
       x.intersp = 1.4, y.intersp = 1.15)

Adjusting the legend

outside of the plotting area

Move the legend outside of the plotting area

Next we need to adjust the position of the whole legend. Do you see how it is actually inside the plot on the base graphics version, but outside of it in the ggplot version? We can move the legend around with the inset parameter. The default value is 0. If you pass in a positive number, the legend moves into the plot, whereas if you pass in a negative number the legend moves out away from the plot. We will pass in inset = -0.1 to bump it to the right to get it outside of the plot.

par(mgp = c(1.5, 0.4, 0), tcl = -0.25)
colors <- c(A = k_purple, B = k_orange)
plot(x, y, bty = "n", xaxt = "n", yaxt = "n",
     pch = 20, cex = 1.4, col = colors[group])
box("plot", bty = "l", lwd = 2)
axis(side = 1, lwd = 0, lwd.ticks = 2, mgp = c(1.5, 0.2, 0))
axis(side = 2, lwd = 0, lwd.ticks = 2, las = 2)
legend(x = "right", legend = c("A", "B"), col = colors, pch = 20,
       title = "group", bty = "n", pt.cex = 1.4,
       x.intersp = 1.4, y.intersp = 1.15,
       ## Nudge the legend to the right.
       inset = -0.1)

Moving the legend outside of the plot area

Whoops! Do you see how the legend went right off the chart? To make sure the legend doesn’t get clipped, we need to pass in xpd = TRUE to the legend function. The xpd parameter affects how the plot elements are clipped if they exceed the edges of the plot. Here is how you move the legend outside of the plotting area using the xpd parameter.

par(mgp = c(1.5, 0.4, 0), tcl = -0.25)
colors <- c(A = k_purple, B = k_orange)
plot(x, y, bty = "n", xaxt = "n", yaxt = "n",
     pch = 20, cex = 1.4, col = colors[group])
box("plot", bty = "l", lwd = 2)
axis(side = 1, lwd = 0, lwd.ticks = 2, mgp = c(1.5, 0.2, 0))
axis(side = 2, lwd = 0, lwd.ticks = 2, las = 2)
legend(x = "right", legend = c("A", "B"), col = colors, pch = 20,
       title = "group", bty = "n", pt.cex = 1.4,
       x.intersp = 1.4, y.intersp = 1.15,
       inset = -0.1,
       ## Ensure the legend is not clipped even though it is
       ## outside of the plotting area.
       xpd = TRUE)

Do not clip the legend outside the plotting area

Some final touchups

We’re almost there now! Just a few more adjustments to make: tick label size, plot element colors, and plot margins.

Tick label size

Right now, the tick labels are a lot bigger than they are in the ggplot version. To fix it, we can pass in cex.axis = 0.85 to the par function. That way, it will be applied to both the x and y axes and we don’t have to specify it twice. Remember that the normal cex is 1 so any number less than that will be smaller than the default.

Plot element colors

Setting the plot element colors can be a little tricky because we have to specify them in a few different places. I should mention that there are quite a few ways to control the colors in plots made with base R graphics. It can get a little confusing as to what parameter is controlling what aspect of the plot, especially when you consider that the options passed in to the par function control lots of different plot elements. For example, par(fg = "green") will turn a lot of plot elements green, but not all of them. Rather than do that, we will adjust colors mostly inside the functions that they will affect.

We will first set a variable to hold the color and use that: base_color <- "#444444". The axes label colors are controlled with the col.lab parameter to the par function (col.lab = base_color). To change the axis (box) line color, we pass in col = base_color to the box function. For the axes ticks and tick labels, we the col and col.axis parameters to the axis function to control the tick color and the tick label color, respectively (e.g., col = base_color, col.axis = base_color). To change the legend color, we pass text.col = base_color directly to the legend function.

Plot margins

As with many other things in base R graphics, there are a couple ways to control the plot margins. We are going to be using the mar parameter to the par function. To do so, you pass in a 4 part vector specifying the size of the margin (in lines of text) of the bottom, left, top, and right sides of the plot, in that order. The default is c(5, 4, 4, 2) + 0.1. We will shrink all the margins except for the right, which we need to increase to make enough room for our legend: mar = c(3, 3, 1, 3.5). Just to make it clear, that is three lines of text for the bottom and left margins, one line of text for the top margin, and 3.5 lines of text for the right margin.

All the final adjustments

Let’s put all the final touchups in now.

base_color <- "#444444"
par(mgp = c(1.5, 0.4, 0), tcl = -0.25,
    ## Shrink the tick labels.
    cex.axis = 0.85,
    ## Set the axis label color
    col.lab = base_color,
    ## Adjust the margin:  bottom, left, top, right
    mar = c(3, 3, 1, 3.5))
colors <- c(A = k_purple, B = k_orange)
plot(x, y, bty = "n", xaxt = "n", yaxt = "n",
     pch = 20, cex = 1.4, col = colors[group])
box("plot", bty = "l", lwd = 2,
    ## Set the box color.
    col = base_color)
axis(side = 1, lwd = 0, lwd.ticks = 2, mgp = c(1.5, 0.2, 0),
     ## Set the axis tick and tick label colors.
     col = base_color, col.axis = base_color)
axis(side = 2, lwd = 0, lwd.ticks = 2, las = 2,
     ## Set the axis tick and tick label colors.
     col = base_color, col.axis = base_color)
legend(x = "right", legend = c("A", "B"), col = colors, pch = 20,
       title = "group", bty = "n", pt.cex = 1.4,
       x.intersp = 1.4, y.intersp = 1.15,
       inset = -0.1, xpd = TRUE,
       ## Set the legend text color.
       text.col = base_color)

Applying the final adjustments

Looking good! So that’s almost the same as the original “classic” theme ggplot2 plot. One thing you may notice is that there are a different number of tick marks on the axes. You can actually adjust this in base R graphics, but it can be a little bit tricky, so we will leave that for another post.

Wrap up

Whew, that was a lot of stuff! As we saw, copying the style of the ggplot theme_classic requires quite a lot of fiddling around with a lot of different parameters to a few different functions. If I was making a plot for a publication or blog post or something, I would definitely just use ggplot, but it can be fun and educational to try to reproduce something that an awesome library does with base R graphics. Hopefully, you enjoyed the process and learned a lot about base R graphics!

Computational lab notebooks using git and git-annex

2021-05-07T00:00:00+00:00

Disclaimer: if you need a lab notebook for legal records, copyright, patent rights, or anything like that, then this article probably isn’t for you. This post is not providing any recommendations for those cases.

Overview
Provenance tracking
A git-based lab notebook
A CLI app to help manage git-based lab notebooks
A super simple example

Too long; didn’t read: Check out the cln app on GitHub. It helps you manage a computational lab notebook using git and git-annex. You can find the documentation here.

Overview

Keeping a good lab notebook for your computational work is important, but it can be challenging. A quick Google search will show you lots of examples of people talking about it:

I have tried a lot of different methods, but they all more or less boil down to a workflow sort of like this:

Write down some summary of what I’m about to do and why.
Run some commands, programs, or bash stuff.
Copy what I did into a document. (e.g., Markdown notes files, TiddlyWiki, etc.)
Write a bit more about what happened.
Rinse and repeat.

Then, depending on my needs, I may clean up the analysis and put it into an R Markdown or Jupyter notebooks notebook so it will be easier to reproduce later.

One problem with this general workflow is that it requires tracking a lot of things manually (e.g., copying and pasting). Whenever you do a lot of that, you will inevitably forget to paste a command into your notebook. You might make a mistake or typo when running a command, and rather than noting it down in your notebook, you just rerun it and pretty soon your lab notebook is out of sync with the commands that you have actually run. Another issue is that you may be running a bunch of commands quickly, just testing some ideas out. When doing this, you end up needing to track a ton of things in an ad-hoc manner leading to a messy lab notebook that you need to come back to later and reorganize.

In other words, you need to manually track a lot of information, and it can be quite a challenge to keep track of everything!

Provenance tracking

One approach to dealing with this problem is by tracking the provenance of files. An example of this is how QIIME 2 includes metadata in their artifact files (.qza files) to track things that were done in an analysis.

I like the idea of provenance tracking, but even if you do use QIIME, there are a lot of things you need to do outside of QIIME that will need tracking. While not quite the same, this sort of provenance tracking reminds me a bit of using git or other version control software. Git is software used to track changes in a set of files, and is often used by programmers during software development.

git -- a distributed version control system

Note: If you have never used git before, the official docs have a lot of info that may be of use to you. I have also written a small git tutorial that you may find useful!

While I had used git while working on software, I had never tried using it to manage a computational lab notebook. One reason is that it doesn’t handle large files well. For computational work, whether bioinformatics or data science, you will be dealing with a lot of large files. Sequencing files easily get over 10 GB in size, so using git alone is going to be problematic. However, there are extensions to git like Git Large File Storage and git-annex that help to address this problem. (Essentially, git-annex tracks symbolic links in the git repository rather than the file itself. There is a lot more to it than that, so you check out the git-annex walkthrough if you want to know more.)

A git-based lab notebook

Note: I’m not the first one to think of using git to help manage a computational lab notebook. In fact, you can find some interesting discussion on whether version control is even useful for lab notebooks here, here, and here.

Using git and git-annex, I figured that I could get a pretty decent workflow going for my computational lab notebook. After playing around with it for a while (and seeing that git-annex was a good solution to git’s large file problem), I settled into a pretty familiar workflow:

Run a program, script, whatever.
Track any new files or changes with git.
Commit the changes.
Repeat.

One key difference from my “typical” workflow is that instead of putting the commands that I ran and their explanations into some external document like a markdown file, I would put all the information into the commit message. That way, all the info about how and why I did something would be tracked in the git repository along with the actual files and changes.

That works pretty well, but you still run in to the issue of having to remember what you ran, copy and paste it correctly into the commit message, blah blah blah. In other words, it’s still a bit of a pain. While you get the added benefits of git logs and history tracking, you have to do a lot of repetitive, annoying stuff to get things to work. So, of course, I wrote a little program to help automate some of the tedious stuff!

A CLI app to help manage git-based lab notebooks

While working with the above workflow, in addition to QIIME’s provenance tracking, I was also reminded of database migrations. Basically, the way they work is that you write some script that says how the database is supposed to change (e.g., add column first_name to table authors), and then some migration tool handles actually making any changes to the database. In theory, this gives you a simpler way to track how your database has changed over time–you can just follow the paper trail of your migration files.

The app I wrote works in a similar way, except that instead of making incremental changes to a database, you are formalizing making changes to the repository itself. The app is called cln (it stands for “computational lab notebooks”…clever, I know!). You can find it on GitHub. There is also some pretty extensive documentation available to help you get started using the software.

While I suggest you check out the docs for a more detailed explanation of its installation and usage, I want to show a quick, little example to give you a flavor of how the cln program can help you manage you git-based lab notebook.

A super simple example

The cln command provides a couple of subcommands to help you manage your lab notebook with git and git-annex. (For more details on individual subcommands, see here).

Create a project

To start, you make a new project.

$ mkdir -p ~/projects/cln_example && cd ~/projects/cln_example
$ cln init 'Example Project'
$ tree -a -I .git
.
├── .actions
│   ├── completed
│   ├── failed
│   ├── ignored
│   └── pending
└── README.md

The cln init command initializes a new project, creates a git repository, and generates some scaffolding for actions and git commit templates.

Prepare an action

Next, you prepare an action to run. (Again, this is just a silly example…for a more in depth tutorial, see the documentation).

$ cln prepare 'printf "I like apple pie\n" > msg.txt'

In this case the action is just running a printf command and saving the contents in a file. Of course, you can prepare an action containing anything that you would normally run at the command line. For example, you could prepare a crazy action like this:

$ cln prepare "$(cat <<'EOF'
cut -f2 seq_information.seq_id_eco.tsv \
  | cut -d';' -f5 \
  | ruby -e 'h = Hash.new 0; \
      ARGF.each {|l| h[l.chomp] += 1 }; \
      h.sort_by {|_, count| count }.reverse. \
      each {|eco, count| puts "#{eco}\t#{count}" }' \
  | column -t \
  > seq_eco_counts.txt
EOF
)"

Note: That’s actually an action I prepared and ran in a real project. Previously, I would have put that little ad-hoc Ruby script into a file and ran it in a way that is easier to track, but with the cln to help me manage things, everything will be nicely tracked automatically.

The cln prepare command creates an action file and a git commit template. The action file is simply a bash script with the command you want to run, but having it there in your repository as a standalone script helps you see what is going on if you’re running a complicated command or when you come back to the project a couple of months later.

Run the pending action

Next, you can check that everything is okay doing a dry run. It will spit out some stuff to the terminal to let you know what’s going on and suggests what steps to take next. Note: I’ve edited the terminal output a bit.

$ cln run -dry-run
~~~
~~~
~~~ Hi!  I just previewed an action for you.
~~~
~~~ I plan to run this action file:
~~~   '.actions/pending/action__ ...'
~~~
~~~ It's contents are:
~~~
printf "I like apple pie\n" > msg.txt

~~~
~~~ If that looks good, you can run the action:
~~~   $ cln run
~~~
~~~

If it looks good, you can go ahead and run the action.

$ cln run
  ~~~
  ~~~
  ~~~ Hi!  I just ran an action for you.
  ~~~
  ~~~ * The pending action was '.actions/pending/action__REDACTED.sh'.
  ~~~ * The completed action is '.actions/completed/action__REDACTED.sh'.
  ~~~
  ~~~ Now, there are a couple of things you should do.
  ~~~
  ~~~ * Check which files have changed:
  ~~~     $ git status
  ~~~ * Add actions and commit templates:
  ~~~     $ git add .actions
  ~~~ * Unless they are small, add other new files with git annex:
  ~~~     $ git annex add blah blah blah...
  ~~~ * After adding files, commit changes using the template:
  ~~~     $ git commit -t '.actions/completed/action__REDACTED.gc_template.txt'
  ~~~
  ~~~ After that you are good to go!
  ~~~
  ~~~ * You can now check the logs with git log,
  ~~~   or use a GUI like gitk to view the history.
  ~~~
  ~~~

See how the cln run command gives you hints on what to do next? I tried to make all the cln commands spit out helpful info like that to the terminal.

Track and commit changes

Now, you will be able to see any files that were created or changed as the result of running the action using git status. Depending on the size(s) of the file(s) that were created or changed, you can add them to the git index with either git add or git-annex add. Finally, you commit the changes using the git commit template that was made when you prepared the action.

$ git commit -t '.actions/completed/action__REDACTED.gc_template.txt'

The template file will look something like this:

PUT COMMIT MSG HERE.

== Details ==
PUT DETAILS HERE.

== Command(s) ==
printf "I like apple pie\n" > msg.txt

== Action file ==
action__REDACTED.sh

When you run the git commit command, a text editor will pop up with the contents of the git template file ready for you to fill out. This is nice because you can avoid manually copying in the commands you ran. For such a small example it’s not really a big deal, but if you’re running some complicated bioinformatics software with a lot of flags and options, it’s pretty convenient!

Browse the git history

After editing the message and saving the commit, you can browse through your nicely organized repository history and see something like this:

$ git log
commit ebf738 (HEAD -> master)
Author: Ryan Moore <moorer@udel.edu>
Date:   Mon Apr 5 18:44:54 2021 -0400

    Created the msg.txt file

    == Details ==
    I needed to create a file that describes something that I like.  I
    used the `printf` rather than `echo` because it is more portable.
    (See https://stackoverflow.com/a/11530298 for a discussion of this on
    stack overflow).

    == Command(s) ==
    /usr/bin/printf "I like apple pie\n" > msg.txt

    == Action file ==
    action__460986084__2021-04-05_18:02:37.sh

commit 1a2e90
Author: Ryan Moore <moorer@udel.edu>
Date:   Mon Apr 5 17:43:50 2021 -0400

    Initial commit

Notice how I put a short, descriptive commit message for the first line, and then added in any additional details that I think I will need later. The == Details == section would hold all the extra stuff I would put in my lab notebook anyway, but it is really convenient to have it right there in the git log.

Having the command that you ran, the details about that command, and the changes that command effected in your repository opens up some really powerful ways to track your analyses.

Get individual file provenance info

For example, you can use the git cli app (e.g., git whatchanged or git log) or a GUI like gitk to get detailed info about the provenance of any files in the repository. You could run something like this to see all the history for the msg.txt file.

$ git log --stat --follow -p -- msg.txt
commit ... (HEAD -> master)
Author: Ryan Moore <moorer@udel.edu>
Date:   ....

    Created the msg.txt file

    == Details ==
    I needed to create a file that describes something that I like.  I
    used the `printf` rather than `echo` because it is more portable.
    (See https://stackoverflow.com/a/11530298 for a discussion of this on
    stack overflow).

    == Command(s) ==
    printf "I like apple pie\n" > msg.txt

    == Action file ==
    action__467354640__.....sh
---
 msg.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/msg.txt b/msg.txt
new file mode 100644
index 0000000..135d9d6
--- /dev/null
+++ b/msg.txt
@@ -0,0 +1 @@
+I like apple pie

As you can imagine, having output like that for all the files in your project folder as well as the chronological logs is a very powerful way to track your analyses and makes managing a computational lab notebook so much easier.

Wrap up

Managing a computational lab notebook is tricky. I have found that using git and git-annex can be a good way to keep all the info you need right in the same directory as all your data files, scripts, and analysis code. To help you more easily manage lab notebooks using git and git-annex, I created a command line app called cln. You can find the code on GitHub. Installation instructions and usage examples can be found in the documentation.

divnet-rs: A Rust implementation for DivNet

2021-01-18T00:00:00+00:00

Update: divnet-rs now has a way to parallelize the bootstrapping procedure. With enough RAM, it can give approximately linear decreases in run time with increasing number of cores. Consider it an experimental feature for now.
Update 2022-04-06: On the Lee dataset, v0.3.0 is around 3x faster and uses ~60% of the memory as compared to v0.2.1.
Update 2021-01-22: v0.2.1 further decreases the run time and required memory
Update 2021-01-19: As of divnet-rs v0.2.0, users can manually set the random seed. Also, v0.2.0 uses only about 2/3 the memory that was used by v0.1.1.

Background

One reason for doing microbiome sequencing is to learn about the microbial diversity of the ecosystems of interest. Estimating the diveristy of microbial communities is hard. Essentially every step of a sample to sequence pipeline introduces biases into your analyses, meaning the community composition you observe is likely quite different from the true community composition. Further, microbiome datasets are compositional, and must be treated with statistical and computational methods designed to handle such data.

Most communities are incredibly complex so you’re going to nearly always have issues with undersampling – there are just too many microbes to sequence them all, so you have to work with samples. Even though you cannot practically observe all the taxa in your environment, you still need to estimate the diversity of that environment. So why don’t we just “plug-in” our data into one of the common diversity indices borrowed from macroecology like Shannon or Simpson and be done with it? You will actually see this a lot in the literature: plugging in the observed relative abundances (sometimes after rarefying the data first) from our samples into standard “plug-in” diversity formulas.

There are a couple of problems with this. Undersampling is problematic because alpha diversity metrics are heavily biased when there are unobserved taxa. The random sampling variation combined with biases introduced in the sample-to-sequence pipeline mean your observed relative abundances probably don’t faithfully represent the true community you want to study. Additionally, many commonly used methods for generating confidence intervals assume that taxa are independent (i.e., if taxa A is present in a community, it doesn’t provide any information about whether taxa B is there too).

What is DivNet?

So how are you supposed to measure diversity of microbial communities then? One method that is designed to address a lot of these problems is DivNet, an R package for estimating diversity when taxa in the community occur in an ecological network (i.e., a pattern of microbial co-occurence). DivNet leverages info from multiple samples and can estimate relative abundance of taxon in communities where it was unobserved. It also gives accurate estimates of variance in the measured diversity by taking into account sample metadata/covariates.

Probably the most interesting aspect of DivNet is that it allows you to account for ecological networks where taxa positively and negatively co-occur. DivNet estimates diversity using models from compositional data analysis that can handle co-occurance networks. This is in contrast to most common diversity estimates that are based on the multinomial model that makes assumptions about sampling that prohibit ecological networks (i.e., situations in which taxa positively and negatively co-occur). (Note: you may know the multinomial model from your stats courses in modeling the probability of counts for dice rolls or as generalization of the binomial distribution.)

You can find a lot more information about DivNet, including algorithmic details, validation, comparison to other methods of estimating diversity, and some important details to keep in mind when using DivNet on your data in the DivNet manuscript.

Why make divnet-rs?

In the getting started tutorial, there is a section called “What does DivNet do that I can’t do already?” (it is worth reading if you haven’t!). So I thought it would be good to answer the question, “What does divnet-rs do that the R implementation of DivNet can’t do aleady?” The answer is simple: divnet-rs gives you the ability to apply the DivNet algorithm to large datasets. For those without easy access to high performance computing facilities, you will be able to run divnet-rs on typically sized SSU rRNA microbiome datasets on your laptop. divnet-rs is both faster and much more memory efficent that the R implementation. Of course, bioinformatics software is all about tradeoffs and divnet-rs is no different. Comapared to the R implementation, it’s harder to install, you have to write some R code specifically to get data in and out of divnet-rs, and not all network and boostrapping options offered by the R implementation are available in the Rust implementation. That said, I think divnet-rs still fulfills a useful niche by allowing researchers to apply the DivNet algorithm to datasets that are currently too large for the R implementation to handle.

Comparing run time and memory usage

Set up

While developing divnet-rs, I spent a good amount of time profiling and optimizing the code. Rather than talk about that, I wanted to get a high level overview of how the performance of the R and Rust implementation compared on a real dataset. The data I used was the Lee dataset that is incuded with the DivNet R package. It has 1490 amplicon sequence variants (ASVs), 16 samples, and associated taxonomy and sample info.

So what did I do? First, I took the Lee data and sorted the ASV table in decreasing abundance order. Then I created new datasets from the top 10, 20, 40, 80, 160, 320, 640, and 1280 ASVs. In addition to the full 16 sample datasets, I also created test datasets with only eight samples by randomly picking samples from the ASV table, remiving any ASVs that had zero count in the remaining samples, and then took the top 10, 20, …, 1280 ASVs just like for the 16 sample datasets. I ran everything with the default algorithm tuning in DivNet (6 expectation maximization (EM) iterations (3 burn), 500 Monte-Carlo (MC) iterations (250 burn)) and 2 replicates. I would probably use the “careful” setting (10 EM iterations and 1000 MC iterations) as well as running more replicates if I was actually analyzing data, but this was good enough for this little profiling experiment.

This isn’t the most scientific profiling job ever, but it should give you a sense of how the run time and memory scales with the number of taxa and samples for both the R and Rust versions of DivNet. For the timing, I ran each dataset three times, and I used the time function to get the elapsed time and the max memory used for each run. Since loading all the R dependencies takes a large proportion of the total run time in the smaller DivNet-R runs, I got the elapsed time of just the divnet function using the tictoc R package. I still used time to get the max memory for these runs though.

One other thing to mention, I ran all of these on a compute cluster. I didn’t think about it until after I had already run everthing, but I compiled both divnet-rs and OpenBLAS on a different node than the one that I used to actually run the tests. The compute cluster that I used has a bunch of different types of nodes, so the compiled output of both may not be ideal for the node I actually ran the timings on (e.g., different SIMD instructions, different CPU architectures, etc.). While the timing experiments were running, there were other jobs on the same node running at the same time, so that is another thing that may have influenced the results.

For the R tests, I used R v3.6.2 linked against OpenBLAS v0.3.7 and DivNet v0.3.6. I set DivNet to use only 1 core (ncores = 1) because in all my tests (and on multiple different machines), DivNet is actually slower when using more than one core. For divnet-rs I used v0.1.1 linked against OpenBLAS v0.3.13. I also forced OpenBLAS to use only 1 core (OPENBLAS_NUM_THREADS=1) as that is how the R was using OpenBLAS. (As an aside, if you don’t have R linking against an optimized BLAS implementation, you should. It will give you a big perfomance increase.)

Just keep all this stuff in mind while taking a look at these results.

Results

Here are the run time and memory profiling results:

DivNet timing and memory requirements

Let’s break down a couple of things. The Rust version is faster and more memory efficient, but that’s not surprising – a Rust program should be faster than an R program, and I spent a good amount of time profiling and optimizing the code. In this test, the Rust version is about 20 times faster than the R version.

The other interesting thing to measure is max memory usage. For the largest dataset that I tested (16 samples, 1280 taxa), the Rust version used ~300 MB of RAM as compared to the ~6000 MB used by the R version. When implementing DivNet in Rust, I spent a good amount of time and effort optimizing the run time, and much less worrying about the memory, so it was nice to see it being relatively frugal with the memory.

As you might expect, the 16 sample datasets took longer and used more memory than the 8 sample datasets, but not twice as much time and memory. There was a weird thing thing in the 1280 taxa test set in the Rust implementation. The 8 sample set actually took a bit more time (but still used less memory) than the 16 sample set. I thought this was strange so I actually ran the 16x1280 and 8x1280 datasets many more times to see if there was some weird random variation in the timings, or if I made some mistake in the testing and mislabeled the datasets or something, but each run gave me relatively the same result as you see here. I’m not honestly sure why this is, but like I mention above, these benchmarks aren’t prefect and could be improved.

Differences in the implementations

Before wrapping up, I want to take a little time to highlight some of the more important differences in the R and Rust implementations of DivNet.

Estimating the network

While the original DivNet R code has multiple options for the network parameter, the only network option in divnet-rs is “diagonal”. To explain why this is, here is an excerpt from a GitHub issue where Amy Willis is talking about using DivNet on large datasets:

I would recommend network=”diagonal” for a dataset of this size. This means you’re allowing overdispersion (compared to a plugin aka multinomial model) but not a network structure. This isn’t just about computational expense – it’s about the reliability of the network estimates. Essentially estimating network structure on 20k variables (taxa) with 50 samples with any kind of reliability is going to be very challenging, and I don’t think that it’s worth doing here. In our simulations we basically found that overdispersion contributes the bulk of the variance to diversity estimation (i.e. overdispersion is more important than network structure), so I don’t think you are going to lose too much anyway.

Another benefit of the diagonal network is that it is fast: it’s a simple, vectorizable mathematical operation, as compared to the default method, which will need to do either a Cholesky decomposition or a generalized matrix inversion, or to the “stars” method, which does a whole lot more operations.

divnet-rs isn’t a replacement for DivNet. It’s focus is on allowing the core algorithm to be applied to datasets that are too large for the R implementation to handle, and so, only the diagonal network is available in divnet-rs. If your data is small enough that the R implentation can handle it, then I recommend using the original!

Bootstrapping

Another difference from the original is that only the parametric bootstrap is available – you can’t do the nonparametric bootstrap. The parametric bootstrap is the default in the R implementation, and, if you check out the DivNet manuscript, you’ll see that the parametric and nonparametric bootstraps perform similarly.

Setting the random seed

divnet-rs currently does not allow you to set the seed for the random number generator, which will have an impact on reproducibility across runs. While the DivNet R implementation does allow you to set the random seed prior to the run (for example, just use set.seed(5623472) before running the divnet function), there is a caveat about setting the random seed when running DivNet on multiple cores that you should be aware of. In practice, if you are getting more variability across runs than desired, you can up the EM iterations, the MC iterations, and the replicates, and it should take care of things.

Wrap-up

In this post, I introduced divnet-rs, a Rust implementation of the DivNet R package. It is both faster and more memory efficent than the original, allowing you to run much larger data sets even on your laptop, but it has fewer features and isn’t as straightforward to use. Like any bioinformatics software, there are always tradeoffs, so I encourage you to pick the right tool for the right job: if you have small enough datasets, stick with the R implementation, but if R keeps crashing on you or DivNet is just too slow for whatever reason, give divnet-rs try.

A simple dashboard for COVID-19 case counts

2020-12-30T00:00:00+00:00

I made a simple COVID-19 dashboard that lets you compare the confirmed case counts for multiple counties as well as viewing the raw counts and the counts per 100,000 people. It plots the case counts over time for as many counties as you want to compare and lets you download the resulting chart. Here is an example for Delaware’s three counties:

Confirmed COVID-19 Cases for Delaware Counties

Being a Delaware resident, I like to pretend everyone already knows everything about Delaware, but just in case you don’t, here you go: New Castle county is in the north and has Wilmington (our largest city) and Newark, home of the Univesity of Delaware. Kent county is in the middle and has Dover (the state capitol), and Sussex county is in the south with Lewes and all the beaches. It’s interesting to see the differences between New Castle and Kent counties, which look pretty similar to one another, and Sussex county. At some point, I would like to overlay some demographic or socio-economic data on this to look for any trends, but that’s for a different day.

The data

The COVID-19 case data is from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. Their data is aggregated from a ton of different sources and I encourage you to check out their GitHub page for more information about the data. If you’re interested, they have an article in the Lancet talking about the data and their dashboard. Of course, their dashboard has a lot more bells and whistles than mine!

For the county level population info, I used data from the Atlas of Rural and Small-Town America from the USDA Economic Research Service. It is a really cool and in-depth county level dataset. In addition to the population data, you can find info about jobs, income, veterans and more. They also have a nice interactive map to view everything county-by-county. If you want to download and remix the data yourself, it is all available in CSV and Excel format on their site.

One thing to note is that the county level population data is mostly from 2019 estimates. So, while weighting the case counts by the population data gives a nice way to compare COVID-19 cases across counties, just keep in mind that the population estimates are from last year.

The code

If you’re interested in the source code for the dashboard, you can find it on my GitHub page.

It is an Elm app. I haven’t used Elm much before this project, but it was very easy to get started with. The documentaion was awesome and the Elm Slack channel is full of helpful people. I think having some experience in Rust and Clojure helped me feel right at home using Elm. Elm seems a bit like a gateway to PureScript or Haskell, so I’m thinking of checking those out as well.

The charts are made with Vega-Lite, a nice tool for data visualization based on Vega and the Grammar of Graphics. It’s declarative, in that you write JSON specifications and Vega-Lite compiles the spec to Vega and Vega’s runtime hadles rendering the chart. To generate the Vega-Lite specs, I used this Elm package in conjunction with Elm ports.

Virome Bytes: Microdiversity of Mediterranean Sea Viruses

2020-02-29T00:00:00+00:00

Virus microdiversity

Marine viruses are probably the most well-characterized group of environmental viruses. The oceans were one of the first ecosystems where the abundance and importance of environmental viruses was truly realized, and the relative ease of collecting viruses from seawater (as compared to, say, soils) has helped further their study in this environment. However, even within marine habitats, there’s still a lot that we don’t know about viruses and their ecology.

The microdiversity of viruses is a relatively new area of study in environmental viral ecology. Microdiversity, here, refers to mutation frequencies in genomes within the same population. It accompanies trends like the shift from OTUs to ASVs in focusing in on smaller differences in environmental DNA sequences. In a paper entitled Trends of microdiversity reveal depth-dependent evolutionary strategies of viruses in the mediterranean, Felipe Coutinho and colleagues use microdiversity to study the selective pressures exerted on viral genomes at different depths in the ocean and Mediterranean Sea.

Coutinho et al. examined four viral shotgun metagenomes (viromes) sampled from the surface, the deep chlorophyll maximum (DCM), and the bathypelagic. To increase their sample size, the researchers supplemented their own samples with viromes from the Tara Oceans expedition and Station ALOHA, which were also sampled over multiple depths. Microdiversity was measured using pN/pS ratios, similar to dN/dS ratios, which are calculated as the number of nonsynonymous polymorphisms per nonsynonymous site to the number synonymous polymorphisms per synonymous site.

Different depths, different selective pressures

The authors concluded that marine viruses at different depths show signs of being under different primary selection pressures.

The author's model of the observed patterns of microdiversity

In the deep ocean, where cells and viruses are found in lower numbers, viral metabolism proteins are under the greatest selection pressure. This is presumably to help increase traits such as burst size that would maximize the number of viral progeny produced, thereby increasing the likelihood that one of those phages encounters a suitable host.

In the DCM, viruses accumulate mutations in genes used for host recognition, so that they can expand their host range to compete with other phages. This is necessary because while phage populations in the DCM are large, this study found them to be highly clonal (low diversity). Having lots of copies of the same phage would presumably make competition for hosts intense and encourage host switching.

Viruses from the surface samples had, on average, the greatest number of mutations, but the lowest rates of microdiversity. The high rate of mutation was attributed to high levels of UV radiation in surface waters. The low rate of microdiversity may be due to the combination of relatively high viral counts combined with intermediate diversity. This would result in lower rates of competition for host cells and less need to increase traits like burst size, that may be more important in low cell count environments.

Overall, this is an interesting study that used environmental gradients to examine specific factors driving viral ecology and evolution in the natural environment.

Citation: Coutinho, FH. et al. Trends of Microdiversity Reveal Depth-Dependent Evolutionary Strategies of Viruses in the Mediterranean. mSystems 4 (6) e00554-19 (2019). doi: 10.1128/mSystems.00554-19.

Beginning Bioinformatics: What’s a terminal? What’s the command line?

2019-12-15T00:00:00+00:00

Installing and running typical bioinformatics programs requires a lot of background knowledge. For beginners, terms like “command line,” “terminal,” “changing directories,” and “archive file” might be unfamiliar. Even instructions to type make can be confusing. There is a lot of prerequisite knowledge needed to get started with installing and using bioinformatics software.

So, let’s start with the basics: terminals and the the command line.

Graphical vs. command line interfaces

You’re probably reading this blog in a web-browser. Whether it’s on a phone or on a laptop, your web-browser is a graphical user interface (GUI). We interact with GUIs by clicking around with the mouse, or if we’re using a mobile device, by tapping and swiping on the screen. Most of the programs we use on our phones and computers are GUIs. Finder on a Mac and Windows Explorer (or File Explorer) on a PC are GUIs that let you browse and manage files on your computer. Chrome, Safari, and Internet Explorer are GUIs for browsing the web. So a GUI is a program with a graphical user interface, and make up the majority of the programs you probably use on a daily basis.

Firefox is a GUI for web browsing

Compare this to programs with so-called command-line interfaces (CLI). Rather than pointing and clicking, you interact with these programs by typing things at the command line, generally through a terminal. One example a program with a command-line interface is find, which finds files based on some user-specified criteria. Most bioinformatics programs don’t have graphical user interfaces. If you want to learn to do bioinformatics, you’re almost certainly going to have to get comfortable with the command line.

The find program's CLI

Some programs have both a graphical and a command line interface, like Cytoscape, a program for visualizing networks. Why have both? Well, there are some tasks that are easier to accomplish using a graphical user interface and some that are easier with a command line interface. For example, if you need to explore your network–color it, change the size of nodes and edges, make it look nice and pretty–you’re probably going to want to use the Cytoscape GUI for that. If you have an algorithm or process that you want to apply to hundreds of networks, then you’re definitely going to want to use the command line interface instead.

The terminal and the command line

A terminal is a text-based interface to your computer. Depending on who you’re talking to, you might hear the terminal called a couple of different things. The console, the shell, the command prompt–whatever they call it, people are generally talking about the same thing: a place where you enter commands and interact with command-line programs. Of course, all of these terms have more precise definitions, (just ask a systems admin!). For now though, let’s just agree to call it the terminal and not worry too much about it.

As I mentioned earlier, you control a program with a command line interface by typing commands into a terminal. Here is an example of how you might use the find command:

find . -name '*.txt'

Let’s talk about this just a bit. The first thing there is the word find. find is the name of the command we’re running. (If you’re reading program documentation or a blog and you see a word in font that looks like this, then it generally means it’s either a command, something you’re typing at the terminal, or some snippet of code.) Next are the arguments that we pass to the find command/program. Arguments let us modify the behavior of a command or program. In this case, the . tells find look in the current directory, and -name '*.txt' bit tells find to look for files that end with .txt. Don’t worry too much if that doesn’t make sense right now. We’ll get into the details of actually running command line programs in a different post. For now, just know that command line programs are those that you control by typing commands and arguments into the termial.

Let me just mention one more thing. If you’re reading program documentation or tutorials about the command line, you might see commands that look like they start with a $ character like this:

$ find . -name '*.txt'

The $ character isn’t actually part of the command. Some authors will put it in front of the actual command to represent the command prompt (the place where you’re actually typing in the terminal). It’s just there to make it clearer that what you see is a command that you should type into a terminal.

How do I get a terminal?

If you’re on a Mac, you should have a program called Terminal already installed. To open it, click on the Launchpad and type Terminal into the search box and double click on its icon. iTerm2 is another popular terminal emulator for Macs. If you’re using Linux, then you’ve got tons of options for terminals as well. Windows is a bit different from the other two, but it does have a terminal. Check out this software repository and this guide for more information on the Windows command prompt. I personally don’t use a PC for work, but many people I know who do use PCs for bioinformatics use Cygwin, which let’s you get a more Linux-y command line experience on your PC.

Wrap up

In this post, we talked about graphical user interfaces versus command line interfaces, what is a terminal, what is the command line and how to actually get a terminal for your computer. This is all foundational stuff that you’ll be getting a lot more experience with as you learn more about bioinformatics. Hopefully, this guide helps clear up any confusion you may have had!

If you want some more hands-on info about command line basics, check out this nice tutorial from Django Girls.

Using Sass in Clojure Ring apps

2019-12-12T00:00:00+00:00

So you want to use Sass instead of plain CSS in your Clojure Ring web app, but you’re not sure how to get it set up? No problem! Let’s walk through it together.

Sass + Clojure

According to the official website, Sass is CSS with superpowers. It’s a stable and powerful CSS extension language with two different syntaxes, Sass, the original, and Sassy CSS (SCSS), a newer syntax that is a superset of CSS. If you’re not too familiar with Sass, check out this tutorial.

Install the sass binary
Set up a toy Clojure Ring app
Set up SCSS

Install the sass binary

First off, you’re going to need to install a Sass preprocessor. To use Sass, you write .sass (if you’re using the Sass syntax) or .scss (if you’re using the Sassy CSS syntax) files and then compile them to plain ol’ CSS using one of the many Sass compilers.

I’m using a Mac, so installing Sass is as easy as running this Homebrew command:

$ brew install sass/sass/sass

Now, one tricky thing is that Sass has a lot of different implementations. Sass was originally written in Ruby, so there’s the now deprecated Ruby Sass. Additionally, there is LibSass, a C/C++ port of the Sass engine, Dart Sass, which compiles to JavaScript, and many others. It really doesn’t matter which one you use as long as you’ve got one of them installed.

For the rest of the tutorial, I’m going to assume that you’ve got Dart Sass, as that is the primary Sass implementation. It’s binary is called sass.

Set up a toy Clojure Ring app

To show you how to get Sassy with your CSS, let’s start by setting up an example Clojure Ring app. Assuming that you already have Leiningen installed, run this in your favorite terminal app:

$ lein new sassy-clj && cd sassy-clj

Fix project.clj

Alright, now we can make sure the project.clj file is set up nice and neat. To do that, we’re going to need to change a couple of different things in the defproject macro.

Add the Ring libraries to the :dependencies vector.
Set up the Ring handler.
Add in the lein-ring and lein-scss plugins.

All together, it should look something like this. (I’ve added comments to show the things that you need to add.)

(defproject sassy-clj "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "EPL-2.0 OR GPL-2.0-or-later WITH Classpath-exception-2.0"
            :url "https://www.eclipse.org/legal/epl-2.0/"}
  :dependencies [[org.clojure/clojure "1.10.0"]
                 ;; Include the Ring libraries.
                 [ring "1.8.0"]
                 ;; Include some nice app defaults.
                 [ring/ring-defaults "0.3.2"]]
  :repl-options {:init-ns sassy-clj.core}

  ;; Include the needed plugins.
  :plugins [[lein-ring "0.12.5"]
            [lein-scss "0.3.0"]]

  ;; Set up the Ring server handler.
  :ring {:handler sassy-clj.core/app})

After you’re made those changes, don’t forget to run lein deps in your project’s source directory to download the needed dependencies.

Set up the assets directories

Now then, let’s make some folders to hold the HTML, SCSS, and generated CSS files.

$ mkdir -p resources/html resources/scss resources/public/css

The resources/scss directory is where we’ll keep the *.scss files that we’ll actually be editing, and the resources/public/css directory will hold all of the generated CSS files. If you guessed that resources/html is where we will keep our HTML files, you guessed right!

Set up a sweet home page

Now let’s make a tiny little homepage for our app. First, make a new file called resources/html/home.html and put this in it.

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8"/>
    <link rel="stylesheet" href="css/main.css">
    <title>Sassy CSS for Clojure Ring Apps</title>
  </head>
  <body>
    <h1>Sassy Clj</h1>
    <p>Let's use Sassy CSS in a Clojure Ring app!</p>
  </body>
</html>

You can see that we’ve linked to the css/main.css stylesheet. We won’t be writing this by hand, rather we will set up Leiningen so that it will be generated automatically!

Now, edit the sassy-clj.core namespace found in src/sassy_clj/core.clj like so:

(ns sassy-clj.core
  (:require [ring.middleware.defaults :refer [wrap-defaults site-defaults]]
            [ring.util.response :as response]))

This will let us use the site-defaults, which among other things, will allow serving static assets in the resources/public folder. Also, we want to use Ring’s response helpers.

Next, set up a basic handler function to respond to requests.

(defn handler [request]
  (-> (response/resource-response "/html/home.html")
      (response/content-type "text/html")))

This function will respond to all requests with our homepage.

Finally, we define an app var to be our app’s main handler. This matches what we specified in the project.clj file.

(def app
  (wrap-defaults handler site-defaults))

You’ll notice that I’ve used the wrap-defaults middleware function around the handler we wrote. This is to get those sweet site-defaults in the response.

Start up a development server

By now we should have a working app. Let’s check it out! To do so, start up the server like so:

$ lein ring server-headless

Browse to http://localhost:3000/, and you should see our beautiful home page!

A very basic homepage

Set up SCSS

Edit project.clj again

Now that we have our test project, it’s time to get sassy with some CSS. We don’t want to be compiling SCSS files by hand each time we edit them. Instead, we will be using the lein-scss plugin that we included in our project.clj file earlier. Before we can use it, we need a bit more set up.

We need to tell lein-scss how we want our SCSS files to be compiled. We do that by adding an :scss key with hash map of options to the end of the defproject macro in project.clj.

  :scss {:builds
         {:development {:source-dir "resources/scss"
                        :dest-dir "resources/public/css"
                        :executable "sass"
                        :args ["--style" "expanded"]}
          :production {:source-dir "resources/scss"
                       :dest-dir "resources/public/css"
                       :executable "sass"
                       :args ["--style" "compressed"]}}}

In the options map, we specify the :builds key and then another map where we can specify multiple different builds. This is nice when you want different options for development and production. For example, we’ve specified the expanded style for development, but the compressed style for production.

There are a couple of other things to note here. We use the :source-dir key to specify that we will store our SCSS files in resources/scss, and the :dest-dir key to specify that we want the compiled CSS files to live in resources/public/css. Finally, we tell lein-scss to use the sass executable, and add some command line arguments to be passed in to the sass program.

Remember how I said there were a lot of different options for Sass compilers? Well the :executable "sass" option is for using sass. Of course, if you’re using sassc or scss instead, you can use :executable "sassc" or :executable "scss", and it’ll work just fine!

Make a main.scss file

Once that is set up, make a new file called main.scss in the resources/scss folder and add the following to it:

$font-color: #E47320;
$font-family: Courier;

body {
  font-family: $font-family;
  color: $font-color;
}

Compile SCSS to CSS

Those are some excellent styles, but if you reload the homepage now, you’ll see that they aren’t being applied. This is because we haven’t told lein-scss to actually compile main.scss to main.css yet. Here is how to do that.

$ lein scss :development once
[23:36:01] Running once
[23:36:02] ./sassy-clj/resources/scss/main.scss
       --> ./sassy-clj/resources/public/css/main.css
Elapsed time: 226.240492 msecs [Total time]

Note that we typed :development and not development. The latter will not work.

If you reload the homepage again, you’ll see our beautiful styles have been applied!

A quite stylish homepage!

Setting up auto-compilation

Now you probably don’t want to be manually running lein scss every time you edit your SCSS files. To avoid this, lein-scss comes with an auto mode that watches your SCSS source directory for changes and automatically recompiles the CSS as necessary. You can run it like this:

$ lein scss :development auto

Finally, if you’re ready for production, you can pass in :production instead of :development and you’ll be good to go!

$ lein scss :production once

And that’s it! Go forth and be sassy!

Tender Is The Byte

Bioinformatics by hand: Neighbor-joining trees

Contents

Bioinformatics by hand

Neighbor-joining trees

Pros and cons of neighbor-joining trees

Advantages

Disadvantages

How to neighbor-join

Initiation

Iteration

Termination

Formulas

Net divergence

Adjusted distance

Distance from child to parent

Distance from non-child to new node

Example 1

Step 1: Initiation

Step 2: Iteration

Iteration 1

Iteration 2

Step 3: Termination

Summary

On Distance Matrices

Additive matrices

Example 2

Step 1: Initiation

Step 2: Iteration

Iteration 1

Iteration 2

Step 3: Termination

Summary

Wrapping up

Generating Python bindings for OCaml with pyml_bindgen

Contents

Install

A simple example

Python code

Write value specifications

Generate bindings

Test it out

Generating abstract types

Controlling the bindings

Using different function names

Using different argument names

Binding cyclic Python classes

gen_multi

combine_rec_modules

Generate the modules & test it out

Other stuff

Wrap-up

An introduction to the re2 regular expression library for OCaml

Contents

Overview

Creating regular expressions

Matching options

Checking for a match

Finding matches

Find first match

Find all matches

Submatches and capturing groups

Or_error returning vs. Exception raising

Finding submatches

More complicated submatch interface

Controlling submatches

Using id_t to control match efficiency

Splitting strings

Replacing

Using rewrite

Using replace

Miscellaneous info

Escaping strings for regular expressions

Infix matching operator

“Precompiling” your regular expressions

Wrap up

Styling plots in base R graphics to match ggplot2 classic theme

Contents

Set up

Fixing the axes

Using `id_t` to control match efficiency

Using `rewrite`

Using `replace`