Jekyll2024-02-04T16:19:36+00:00https://www.tenderisthebyte.com/feed.xmlTender Is The ByteHi! I'm Ryan Moore, NBA fan & PhD candidate in Eric Wommack's viral ecology lab @ UD.Ryan Mooremoorer@udel.eduBioinformatics by hand: Neighbor-joining trees2022-08-31T00:00:00+00:002022-08-31T00:00:00+00:00https://www.tenderisthebyte.com/blog/2022/08/31/neighbor-joining-trees<div class="post-toc">
<h4 class="post-toc--header" id="contents">Contents</h4>
<ul>
<li><a href="#bioinformatics-by-hand">Bioinformatics by hand</a></li>
<li><a href="#neighbor-joining-trees">Neighbor-joining trees</a></li>
<li><a href="#pros-and-cons-of-neighbor-joining-trees">Pros and cons of neighbor-joining trees</a></li>
<li><a href="#how-to-neighbor-join">How to neighbor-join</a></li>
<li><a href="#formulas">Formulas</a></li>
<li><a href="#example-1">Example 1</a></li>
<li><a href="#on-distance-matrices">On Distance Matrices</a></li>
<li><a href="#example-2">Example 2</a></li>
<li><a href="#wrapping-up">Wrapping up</a></li>
</ul>
</div>
<h2 id="bioinformatics-by-hand">Bioinformatics by hand</h2>
<p>I’ve been teaching bioinformatics at the University of Delaware for roughly the last year now. I had never been in a bioinformatics class prior to teaching; my degrees are in ecology and marine science, so all of my bioinformatics knowledge came from research experience. It’s been really interesting to see bioinformatics taught in a formal setting. One thing I’ve noticed is the disconnect that can occur between students and instructors when students without programming experience are asked to perform “hands-on” exercises.</p>
<p>In an effort to de-mystify bioinformatics, instructors often have students manually perform a task that would normally be done computationally. While these exercises are valuable and often succeed in their goal, I have noticed that many students who are not used to being presented with code or equations tend to have difficulty implementing algorithms by hand, regardless of difficulty. This can cause students to shut down and question whether they are in the correct field, rather than empower them.</p>
<p>When this occurs, there seem to be two underlying issues: First, even at the collegiate level, many students are not confident in their ability to do math. This issue I will leave alone, as it cannot be solved in a single course or assignment at the graduate level. Second, the way that a computer would perform a procedure is <a href="https://news.mit.edu/2009/brain-data-0825">not necessarily the same</a> way a human would perform it. Sometimes, this can create a gap between students with little or no computing background and instructors who are highly familiar with algorithms.</p>
<p>In this post, I’ll walk you through the process of building neighbor-joining trees. Building phylogenetic trees by hand seems at first like a daunting task, but I promise it’s much easier than you think!</p>
<h2 id="neighbor-joining-trees">Neighbor-joining trees</h2>
<p>Neighbor-joining (NJ) is one of many methods used for creating phylogenetic (evolutionary) and phenetic (trait-based similarity) trees. The method was first introduced in a <a href="https://doi.org/10.1093/oxfordjournals.molbev.a040454">1987 paper</a> and is still in use today.</p>
<p>Neighbor-joining uses a distance matrix to construct a tree by determining which leaves are “neighbors” (i.e., children of the same internal parent node) via an iterative clustering process. A neighbor joining tree aims to show the minimum amount of evolution needed to explain differences among objects, which makes it a <a href="https://doi.org/10.1093/oxfordjournals.molbev.a040056">minimum evolution method</a>.</p>
<p>There has been <a href="https://doi.org/10.1093/molbev/msl072">some debate</a> about the mathematical behavior of neighbor-joining trees. Originally, neighbor joining was thought to be most closely related to tree methods that use <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">ordinary least squares</a> to estimate branch lengths, but <a href="https://doi.org/10.1093/molbev/msl072">further investigation</a> showed that they actually shared more properties with “balanced” minimum evolution methods. You don’t need to know anything about these different methods in order to perform neighbor joining, but if you would like to read more about them, there is an excellent explanation in <a href="https://doi.org/10.1007/s11538-010-9510-y">this paper</a>.</p>
<p>The type of tree produced depends on the input. If you provide a distance matrix based on evolutionary data (e.g., multiple sequence alignment), you will get a phylogenetic tree. If you input distances based on non-evolutionary data (e.g., phenotypic traits), then you will get a phenetic tree. Note that a NJ tree doesn’t have to contain only organisms. You can make NJ trees for anything you can represent/compare with a distance matrix.</p>
<p>NJ trees are simple to make and require only basic operations (addition, subtraction, division), but can seem daunting because of the number of steps required. Here, I will show you how to make two small neighbor-joining trees by hand (or, by spreadsheet).</p>
<h2 id="pros-and-cons-of-neighbor-joining-trees">Pros and cons of neighbor-joining trees</h2>
<p>There are a lot of different ways to build phylogenetic and other trees, so how does neighbor-joining compare?</p>
<h3 id="advantages">Advantages</h3>
<ul>
<li>It’s simple and easy to understand.</li>
<li>It’s <a href="https://doi.org/10.1093/oxfordjournals.molbev.a040126">fast</a> and computationally inexpensive compared to other popular methods. Maximum-likelihood and Bayesian methods especially are <a href="https://doi.org/10.1093/molbev/msw042">much slower</a>.</li>
<li>It works. Neighbor-joining has been found to be <a href="https://doi.org/10.1007/s00453-007-9116-4">topologically accurate</a> and to sometimes <a href="https://doi.org/10.1093/molbev/msw042">out-perform more complicated methods</a> like maximum-likelihood and Bayesian inference.</li>
</ul>
<h3 id="disadvantages">Disadvantages</h3>
<ul>
<li>You lose data. When you squish down sequence alignment or other data into distances, you are performing <a href="https://en.wikipedia.org/wiki/Data_reduction">data reduction</a>. This isn’t necessarily a bad thing (ordination methods like <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a> also do this), but you should keep it in mind.</li>
<li>You only get one possible tree. Other methods such as maximum-likelihood and Bayesian inference return multiple different trees, i.e. evolutionary hypotheses, which can be useful for some analyses.</li>
<li>Neighbor-joining can sometimes result in <a href="https://doi.org/10.1093/oxfordjournals.molbev.a040126">negative branch lengths</a>. Note that this does not affect the topology of the tree, just branch lengths.</li>
</ul>
<h2 id="how-to-neighbor-join">How to neighbor-join</h2>
<p>To begin neighbor-joining, you need a distance matrix. A distance matrix is a square matrix containing pairwise distances between members of some group. It must be symmetric (e.g., the distance from A to B is the same as the distance from B to A) and the distance from an object to itself must be 0. The distance does not necessarily need to be <a href="https://en.wikipedia.org/wiki/Metric_(mathematics)#Definition">metric</a>, but in at least one instance <a href="https://doi.org/10.1093/oxfordjournals.molbev.a040454">a metric distance slightly outperformed a non-metric distance</a>.</p>
<p>Once you have a matrix, you can begin neighbor-joining.</p>
<p>The neighbor-joining process consists of three steps:</p>
<ol>
<li>Initiation</li>
<li>Iteration</li>
<li>Termination</li>
</ol>
<p><em>A quick note on the formulas (which can be found in the section below this one): You may notice a slight difference in the equations between this tutorial and another. Do not panic. These are only slight algebraic differences that do not affect the final answer, only the intermediate numbers.</em></p>
<h3 id="initiation">Initiation</h3>
<p>In the <strong>initiation</strong> step, we define a set of leaf nodes, <code class="language-plaintext highlighter-rouge">T</code>, and set <code class="language-plaintext highlighter-rouge">L</code> equal to the number of leaf nodes. These are the nodes at the “ends” of trees and therefore do not have any child nodes. You should have one leaf node for each item you want to compare. For example, if you are placing sequences on a tree, you will have one leaf node per sequence.</p>
<h3 id="iteration">Iteration</h3>
<p>The <strong>iteration</strong> step is where most of the action takes place. Virtually all of our calculations are made in this step, and, as the name implies, we will repeat these calculations over and over until some conclusion is reached.</p>
<p>First, we calculate the <strong>net divergence (r)</strong> of each leaf node. You can think of this as being essentially the distance from each leaf node to all of the others.</p>
<p>Next, we calculate the <strong>adjusted distance (D)</strong> between each pair of nodes, which is based on the pairwise distance in the starting matrix and the divergence of each node. The pair of nodes with the lowest adjusted distance are <strong>neighbors</strong> and share a parent node.</p>
<p>Next, we declare the parent node and calculate the distance from each of the neighbors to the shared parent. This is also the step where I like to add the siblings and parent to the tree.</p>
<p>At this point, our goal is to construct a new distance matrix. To do this, we remove the two nodes that we earlier determined to be neighbors from the distance matrix and replace them with the newly formed parent node. New <strong>pairwise distances (d)</strong> are calculated between the new parent node and other nodes in the matrix. Any other distances (i.e., pairwise comparisons present in the new matrix and the previous matrix) can simply be transferred to the new matrix.</p>
<p><em>Note: In the formulas and calculations below, adjusted distances use a capital <code class="language-plaintext highlighter-rouge">D</code>, whereas pairwise distance use a lowercase <code class="language-plaintext highlighter-rouge">d</code>. Try not to get them mixed up!</em></p>
<p>One thing to be aware of is that, after the first iteration, the neighbors are not restricted to being leaves, and may in fact be internal parent nodes.</p>
<p>Each iteration step ends with a new distance matrix that is one node smaller than the one in the previous step (e.g., <code class="language-plaintext highlighter-rouge">(L-1) by (L-1)</code> after the first iteration). Iteration continues until there are only two nodes remaining in the matrix.</p>
<h3 id="termination">Termination</h3>
<p>The final step is <strong>termination</strong>.</p>
<p>The only task remaining is to join the two nodes that remain after iteration with a single edge to complete the tree!</p>
<p>Now that we’ve braved the written explanation, it’s time to look at some examples to make all of these steps clearer!</p>
<h2 id="formulas">Formulas</h2>
<p>These are the formulas for each of the calculations we will perform (you can find more formatted version in the <a href="/assets/data/posts/nj_trees/neighbor-joining_examples_spreadsheet.xlsx">excel file</a> containing the examples).</p>
<h3 id="net-divergence">Net divergence</h3>
<p>Net divergence <code class="language-plaintext highlighter-rouge">r</code> for a node <code class="language-plaintext highlighter-rouge">i</code> with 3 other nodes <code class="language-plaintext highlighter-rouge">(j, k, and l)</code>:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">r(i) = [1/(L-2)] \* [d(ij) + d(ik) + d(il)]</code></pre></figure>
<h3 id="adjusted-distance">Adjusted distance</h3>
<p>Adjusted distance <code class="language-plaintext highlighter-rouge">D</code> for two nodes <code class="language-plaintext highlighter-rouge">i</code> and <code class="language-plaintext highlighter-rouge">j</code>:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">D(ij) = d(ij) - [r(i) + r(j)]</code></pre></figure>
<h3 id="distance-from-child-to-parent">Distance from child to parent</h3>
<p>Distance from child <code class="language-plaintext highlighter-rouge">i</code> to parent <code class="language-plaintext highlighter-rouge">k</code>, <code class="language-plaintext highlighter-rouge">d(ik)</code>, where <code class="language-plaintext highlighter-rouge">j</code> is the neighbor of <code class="language-plaintext highlighter-rouge">i</code>:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">d(ik) = [d(ij) + r(i) + r(j)] / 2</code></pre></figure>
<h3 id="distance-from-non-child-to-new-node">Distance from non-child to new node</h3>
<p>Distance from other non-child node, <code class="language-plaintext highlighter-rouge">m</code> to new node <code class="language-plaintext highlighter-rouge">d(mk)</code>:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">d(mk) = [d(im) + d(jm) - d(ij)] / 2</code></pre></figure>
<h2 id="example-1">Example 1</h2>
<p>There’s a good chance that even if you read the description of neighbor-joining above, you still don’t have a great idea of how to do it. That should become clearer with some examples.</p>
<p>Here is our starting matrix:</p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center"><strong>A</strong></th>
<th style="text-align: center"><strong>B</strong></th>
<th style="text-align: center"><strong>C</strong></th>
<th style="text-align: center"><strong>D</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>A</strong></td>
<td style="text-align: center">0</td>
<td style="text-align: center">4</td>
<td style="text-align: center">5</td>
<td style="text-align: center">10</td>
</tr>
<tr>
<td><strong>B</strong></td>
<td style="text-align: center">4</td>
<td style="text-align: center">0</td>
<td style="text-align: center">7</td>
<td style="text-align: center">12</td>
</tr>
<tr>
<td><strong>C</strong></td>
<td style="text-align: center">5</td>
<td style="text-align: center">7</td>
<td style="text-align: center">0</td>
<td style="text-align: center">9</td>
</tr>
<tr>
<td><strong>D</strong></td>
<td style="text-align: center">10</td>
<td style="text-align: center">12</td>
<td style="text-align: center">9</td>
<td style="text-align: center">0</td>
</tr>
</tbody>
</table>
<h4 id="step-1-initiation">Step 1: Initiation</h4>
<p>All we do here is define a set of leaf nodes, <code class="language-plaintext highlighter-rouge">T</code>, and set <code class="language-plaintext highlighter-rouge">L</code> equal to the number of leaf nodes.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">T = { A, B, C, D }
L = 4</code></pre></figure>
<h4 id="step-2-iteration">Step 2: Iteration</h4>
<p>Now for the real action. Remember, this will consist of multiple iterations.</p>
<h5 id="iteration-1">Iteration 1</h5>
<p>First, we calculate the net divergence <code class="language-plaintext highlighter-rouge">r</code> of each node:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">r(A) = [1/(L-2)] * [d(AB) + d(AC) + d(AD)] = (1/2) * (4 + 5 + 10) = 9.5
r(B) = [1/(L-2)] * [d(AB) + d(BC) + d(BD)] = (1/2) * (4 + 7 + 12) = 11.5
r(C) = [1/(L-2)] * [d(AC) + d(BC) + d(CD)] = (1/2) * (5 + 7 + 9) = 10.5
r(D) = [1/(L-2)] * [d(AD) + d(BD) + d(CD)] = (1/2) * (10 + 12 + 9) = 15.5</code></pre></figure>
<p>Next, the adjusted distance <code class="language-plaintext highlighter-rouge">D</code> for each node pair:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">D(AB) = d(AB) - [r(A) + r(B)] = 4 - (9.5 + 11.5) = -17
D(AC) = d(AC) - [r(A) + r(C)] = 5 - (9.5 + 10.5) = -15
D(AD) = d(AD) - [r(A) + r(D)] = 10 - (9.5 + 15.5) = -15
D(BC) = d(BC) - [r(B) + r(C)] = 7 - (11.5 + 10.5) = -15
D(BD) = d(BD) - [r(B) + r(D)] = 12 - (11.5 + 15.5) = -15
D(CD) = d(CD) - [r(C) + r(D)] = 9 - (10.5 + 15.5) = -17</code></pre></figure>
<p>The pair of nodes with the smallest adjusted distance are neighbors. In this case, we have a tie between the pairs <code class="language-plaintext highlighter-rouge">AB</code> and <code class="language-plaintext highlighter-rouge">CD</code>. We can only move forward with one pair, so we’ll pick <code class="language-plaintext highlighter-rouge">AB</code>. We now define a new node that connects these neighbors; we’ll call this new node <code class="language-plaintext highlighter-rouge">Z</code>.</p>
<p>We’re close now to constructing our first bit of the tree. To do that, we need to calculate the distance from each neighbor (child) node to the connecting (parent) node.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">d(AZ) = [d(AB) + r(A) - r(B)]/2 = (4 + 9.5 - 11.5)/2 = 1
d(BZ) = [d(AB) + r(B) - r(A)]/2 = (4 + 11.5 - 9.5)/2 = 3</code></pre></figure>
<p>With this information, we can draw the first two branches on our tree:</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//nj_trees/Example1_iteration1.png" alt="Example 1 tree first iteration" />
<figcaption>Example 1 tree first iteration</figcaption>
</figure>
<p>Lastly, we need to reconstruct the distance matrix, replacing <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code> with <code class="language-plaintext highlighter-rouge">Z</code>. Some distances can be transferred, but others (represented by question marks), need to be calculated:</p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center"><strong>Z</strong></th>
<th style="text-align: center"><strong>C</strong></th>
<th style="text-align: center"><strong>D</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Z</strong></td>
<td style="text-align: center">0</td>
<td style="text-align: center">?</td>
<td style="text-align: center">?</td>
</tr>
<tr>
<td><strong>C</strong></td>
<td style="text-align: center">?</td>
<td style="text-align: center">0</td>
<td style="text-align: center">9</td>
</tr>
<tr>
<td><strong>D</strong></td>
<td style="text-align: center">?</td>
<td style="text-align: center">9</td>
<td style="text-align: center">0</td>
</tr>
</tbody>
</table>
<p>Here are the formulas for calculating <code class="language-plaintext highlighter-rouge">d(ZC)</code> and <code class="language-plaintext highlighter-rouge">d(ZD)</code>.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">d(ZC) = [d(AC) + d(BC) - d(AB)]/2 = (5 + 7 - 4)/2 = 4
d(ZD) = [d(AD) + d(BD) - d(AB)]/2 = (10 + 12 - 4)/2 = 9</code></pre></figure>
<p>With these calculations done, we can replace the question marks in our distance matrix:</p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center"><strong>Z</strong></th>
<th style="text-align: center"><strong>C</strong></th>
<th style="text-align: center"><strong>D</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Z</strong></td>
<td style="text-align: center">0</td>
<td style="text-align: center">4</td>
<td style="text-align: center">9</td>
</tr>
<tr>
<td><strong>C</strong></td>
<td style="text-align: center">4</td>
<td style="text-align: center">0</td>
<td style="text-align: center">9</td>
</tr>
<tr>
<td><strong>D</strong></td>
<td style="text-align: center">9</td>
<td style="text-align: center">9</td>
<td style="text-align: center">0</td>
</tr>
</tbody>
</table>
<p>And we’re done…with the first iteration. Remember, the iteration step ends when there are only two nodes left in the matrix, and we have three. On to the next iteration!</p>
<h5 id="iteration-2">Iteration 2</h5>
<p>For this iteration, we use the latest version of the distance matrix, constructed at the end of the previous iteration and reset <code class="language-plaintext highlighter-rouge">L</code> (the number of nodes in the matrix).</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">L = 3</code></pre></figure>
<p>Calculate the net divergence <code class="language-plaintext highlighter-rouge">r</code> of each node:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">r(Z) = [1/(L-2)] * [d(ZC) + d(ZD)] = 1 * (4 + 9) = 13
r(C) = [1/(L-2)] * [d(ZC) + d(CD)] = 1 * (4 + 9) = 13
r(D) = [1/(L-2)] * [d(ZD) + d(CD)] = 1 * (9 + 9) = 18</code></pre></figure>
<p>Next, the adjusted distance <code class="language-plaintext highlighter-rouge">D</code> for each node pair:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">D(ZC) = d(ZC) - [r(Z) + r(C)] = 4 - (13 + 13) = -22
D(ZD) = d(ZD) - [r(Z) + r(D)] = 9 - (13 + 18) = -22
D(CD) = d(CD) - [r(C) + r(D)] = 9 - (13 + 18) = -22</code></pre></figure>
<p>All of the pairs are tied for lowest adjusted distance, so we’ll select <code class="language-plaintext highlighter-rouge">ZC</code> because it’s first in the list and define a new node <code class="language-plaintext highlighter-rouge">Y</code> that connects the neighbors.</p>
<p>Calculate the distances from the new parent node to it’s children:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">d(ZY) = [d(ZC) + r(Z) - r(C)]/2 = (4 + 13 - 13)/2 = 2
d(CY) = [d(ZC) + r(C) - r(Z)]/2 = (4 + 13 - 13)/2 = 2</code></pre></figure>
<p>Add the new branches to the tree:</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//nj_trees/Example1_iteration2.png" alt="Example 1 tree second iteration" />
<figcaption>Example 1 tree second iteration</figcaption>
</figure>
<p>Calculate any other new distances and construct the new distance matrix:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">d(YD) = [d(ZD) + d(CD) - d(ZC)]/2 = (9 + 9 - 4)/2 = 7</code></pre></figure>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center"><strong>Y</strong></th>
<th style="text-align: center"><strong>D</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Z</strong></td>
<td style="text-align: center">0</td>
<td style="text-align: center">7</td>
</tr>
<tr>
<td><strong>D</strong></td>
<td style="text-align: center">7</td>
<td style="text-align: center">0</td>
</tr>
</tbody>
</table>
<h4 id="step-3-termination">Step 3: Termination</h4>
<p><code class="language-plaintext highlighter-rouge">L</code> now consists of only 2 nodes (<code class="language-plaintext highlighter-rouge">Y</code> and <code class="language-plaintext highlighter-rouge">D</code>), so we add the edge between them to finish the tree:</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//nj_trees/Example1_termination.png" alt="Example 1 tree termination" />
<figcaption>Example 1 tree termination</figcaption>
</figure>
<h4 id="summary">Summary</h4>
<p>And with that, we’ve built our first neighbor-joining tree! Here is the tree coming together in each step:</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//nj_trees/Example1_tree_step-by-step.png" alt="Example 1 tree step-by-step" />
<figcaption>Example 1 tree step-by-step</figcaption>
</figure>
<h2 id="on-distance-matrices">On Distance Matrices</h2>
<p>Now, you may have noticed that to build the tree in Example 1, we didn’t actually need all of those formulas. In iteration 1, for example, we can figure out the distance from <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code> to their parent just by noticing that <code class="language-plaintext highlighter-rouge">B</code> is always 2 units further from other nodes than <code class="language-plaintext highlighter-rouge">A</code>. Therefore, <code class="language-plaintext highlighter-rouge">d(BZ)</code> must equal <code class="language-plaintext highlighter-rouge">d(AZ) + 2</code>. If their combined distance from <code class="language-plaintext highlighter-rouge">Z</code> is 4, then the only possible branch lengths are 1 and 3.</p>
<p>So, why did we go through the trouble of neighbor-joining? And when do we actually need neighbor-joining?</p>
<h3 id="additive-matrices">Additive matrices</h3>
<p>The distance matrix that we used for example 1 is what’s called an <strong>additive</strong> matrix. Simply put, a matrix is additive if you are able to reproduce the starting matrix by adding together the branch lengths along the paths between nodes. To demonstrate this, let’s look back at example 1.</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//nj_trees/Example1_reconstruct.png" alt="Reconstruct the example 1 distance matrix from the tree" />
<figcaption>Reconstruct the example 1 distance matrix from the tree</figcaption>
</figure>
<p>In the figure above, I’ve deconstructed the tree so that you can see the individual paths between each pair of leaf nodes. Notice that we can reconstruct the starting matrix exactly using only the distances on the tree, which is the main trait of an additive matrix (for a more technical and thorough look at additive matrices, <a href="http://people.cs.uchicago.edu/~ridg/digbio08/talkaddree.pdf">see this presentation</a>).</p>
<p>I like to use an additive matrix as the first neighbor-joining example because, 1) it gives me an excuse to discuss additive matrices, and 2) it’s very easy to check your work. If you are unable to reconstruct the starting matrix in example 1 using the tree, you know you have a problem in your calculations, which is harder to catch with non-additive matrices.</p>
<p>Alright, so if we don’t need neighbor-joining for additive distance matrices, then when do we need it? Neighbor-joining is said to work best for near-additive matrices, i.e. matrices for which the tree <em>almost</em> reconstructs the starting matrix, though they have been reported to be <a href="https://doi.org/10.1007/s00453-007-9116-4">topologically accurate</a> even when this is not the case. And I should note here that the vast majority of distance matrices based on biological data are <a href="https://doi.org/10.1016/j.tcs.2008.12.040">not additive or even nearly additive</a>.</p>
<p>Without further ado, here is another example using a nearly-additive matrix.</p>
<h2 id="example-2">Example 2</h2>
<p>Here is our starting matrix:</p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center"><strong>A</strong></th>
<th style="text-align: center"><strong>B</strong></th>
<th style="text-align: center"><strong>C</strong></th>
<th style="text-align: center"><strong>D</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>A</strong></td>
<td style="text-align: center">0</td>
<td style="text-align: center">2</td>
<td style="text-align: center">2</td>
<td style="text-align: center">2</td>
</tr>
<tr>
<td><strong>B</strong></td>
<td style="text-align: center">2</td>
<td style="text-align: center">0</td>
<td style="text-align: center">3</td>
<td style="text-align: center">2</td>
</tr>
<tr>
<td><strong>C</strong></td>
<td style="text-align: center">2</td>
<td style="text-align: center">3</td>
<td style="text-align: center">0</td>
<td style="text-align: center">2</td>
</tr>
<tr>
<td><strong>D</strong></td>
<td style="text-align: center">2</td>
<td style="text-align: center">2</td>
<td style="text-align: center">2</td>
<td style="text-align: center">0</td>
</tr>
</tbody>
</table>
<h3 id="step-1-initiation-1">Step 1: Initiation</h3>
<p>Again, we define <code class="language-plaintext highlighter-rouge">T</code> and <code class="language-plaintext highlighter-rouge">L</code>. They are the same as example 1.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">T = { A, B, C, D }
L = 4</code></pre></figure>
<h3 id="step-2-iteration-1">Step 2: Iteration</h3>
<h4 id="iteration-1-1">Iteration 1</h4>
<p>First, we calculate the net divergence <code class="language-plaintext highlighter-rouge">r</code> of each node:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">r(A) = [1/(L-2)] * [d(AB) + d(AC) + d(AD)] = (1/2) * (2 + 2 + 2) = 3
r(B) = [1/(L-2)] * [d(AB) + d(BC) + d(BD)] = (1/2) * (2 + 3 + 2) = 3.5
r(C) = [1/(L-2)] * [d(AC) + d(BC) + d(CD)] = (1/2) * (2 + 3 + 2) = 3.5
r(D) = [1/(L-2)] * [d(AD) + d(BD) + d(CD)] = (1/2) * (2 + 2 + 2) = 3</code></pre></figure>
<p>Next, the adjusted distance <code class="language-plaintext highlighter-rouge">D</code> for each node pair:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">D(AB) = d(AB) - [r(A) + r(B)] = 2 - (3 + 3.5) = -4.5
D(AC) = d(AC) - [r(A) + r(C)] = 2 - (3 + 3.5) = -4.5
D(AD) = d(AD) - [r(A) + r(D)] = 2 - (3 + 3) = -4
D(BC) = d(BC) - [r(B) + r(C)] = 3 - (3.5 + 3.5) = -4
D(BD) = d(BD) - [r(B) + r(D)] = 2 - (3.5 + 3 = -4.5
D(CD) = d(CD) - [r(C) + r(D)] = 2 - (3.5 + 3) = -4.5</code></pre></figure>
<p>A lot of ties here. Again, we’ll pick the tied pair that is closest to the top of the list, <code class="language-plaintext highlighter-rouge">AB</code>, and assign them a parent node, <code class="language-plaintext highlighter-rouge">Z</code>.</p>
<p>Now, calculate the distance from each neighbor (child) node to the connecting (parent) node.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">d(AZ) = [d(AB) + r(A) - r(B)]/2 = (2 + 3 - 3.5)/2 = 0.75
d(BZ) = [d(AB) + r(B) - r(A)]/2 = (2 + 3.5 - 3)/2 = 1.25</code></pre></figure>
<p>And draw the first two branches on our tree:</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//nj_trees/Example2_iteration1.png" alt="Example 2 tree first iteration" />
<figcaption>Example 2 tree first iteration</figcaption>
</figure>
<p>Lastly, we calculate new distances and reconstruct the distance matrix:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">d(ZC) = [d(AC) + d(BC) - d(AB)]/2 = (2 + 3 - 2)/2 = 1.5
d(ZD) = [d(AD) + d(BD) - d(AB)]/2 = (2 + 2 - 2)/2 = 1</code></pre></figure>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center"><strong>Z</strong></th>
<th style="text-align: center"><strong>C</strong></th>
<th style="text-align: center"><strong>D</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Z</strong></td>
<td style="text-align: center">0</td>
<td style="text-align: center">1.5</td>
<td style="text-align: center">1</td>
</tr>
<tr>
<td><strong>C</strong></td>
<td style="text-align: center">1.5</td>
<td style="text-align: center">0</td>
<td style="text-align: center">2</td>
</tr>
<tr>
<td><strong>D</strong></td>
<td style="text-align: center">1</td>
<td style="text-align: center">2</td>
<td style="text-align: center">0</td>
</tr>
</tbody>
</table>
<p>On to the next iteration!</p>
<h4 id="iteration-2-1">Iteration 2</h4>
<p>For this iteration, we use the latest version of the distance matrix, constructed at the end of the previous iteration and reset <code class="language-plaintext highlighter-rouge">L</code>.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">L = 3</code></pre></figure>
<p>Calculate the net divergence <code class="language-plaintext highlighter-rouge">r</code> of each node:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">r(Z) = [1/(L-2)] * [d(ZC) + d(ZD)] = 1 * (1.5 + 1) = 2.5
r(C) = [1/(L-2)] * [d(ZC) + d(CD)] = 1 * (1.5 + 2) = 3.5
r(D) = [1/(L-2)] * [d(ZD) + d(CD)] = 1 * (1 + 2) = 3</code></pre></figure>
<p>Next, the adjusted distance <code class="language-plaintext highlighter-rouge">D</code> for each node pair:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">D(ZC) = d(ZC) - [r(Z) + r(C)] = 1.5 - (2.5 + 3.5) = -4.5
D(ZD) = d(ZD) - [r(Z) + r(D)] = 1 - (2.5 + 3) = -4.5
D(CD) = d(CD) - [r(C) + r(D)] = 2 - (3.5 + 3) = -4.5</code></pre></figure>
<p>All of the pairs are tied for lowest adjusted distance, so we’ll select <code class="language-plaintext highlighter-rouge">ZC</code> because it’s first in the list and define a new node <code class="language-plaintext highlighter-rouge">Y</code> that connects the neighbors.</p>
<p>Calculate the distances from the new parent node to it’s children:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">d(ZY) = [d(ZC) + r(Z) - r(C)]/2 = (1.5 + 2.5 - 3.5)/2 = 0.25
d(CY) = [d(ZC) + r(C) - r(Z)]/2 = (1.5 + 3.5 - 2.5)/2 = 1.25</code></pre></figure>
<p>Add the new branches to the tree:</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//nj_trees/Example2_iteration2.png" alt="Example 2 tree second iteration" />
<figcaption>Example 2 tree second iteration</figcaption>
</figure>
<p>Calculate any other new distances and construct the new distance matrix:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">d(YD) = [d(ZD) + d(CD) - d(ZC)]/2 = (1 + 2 - 1.5)/2 = 0.75</code></pre></figure>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center"><strong>Y</strong></th>
<th style="text-align: center"><strong>D</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Z</strong></td>
<td style="text-align: center">0</td>
<td style="text-align: center">0.75</td>
</tr>
<tr>
<td><strong>D</strong></td>
<td style="text-align: center">0.75</td>
<td style="text-align: center">0</td>
</tr>
</tbody>
</table>
<h4 id="step-3-termination-1">Step 3: Termination</h4>
<p><code class="language-plaintext highlighter-rouge">L</code> now consists of only 2 nodes (<code class="language-plaintext highlighter-rouge">Y</code> and <code class="language-plaintext highlighter-rouge">D</code>), so we add the edge between them to finish the tree:</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//nj_trees/Example2_termination.png" alt="Example 2 tree termination" />
<figcaption>Example 2 tree termination</figcaption>
</figure>
<h4 id="summary-1">Summary</h4>
<p>Here is our second tree in completion:</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//nj_trees/Example2_step-by-step.png" alt="Example 2 tree step-by-step" />
<figcaption>Example 2 tree step-by-step</figcaption>
</figure>
<p>Lastly, let’s make a distance matrix using the tree to provide the distances. Notice that these distances are just a little bit off from the starting matrix. Hence, “near-additive”.</p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center"><strong>A</strong></th>
<th style="text-align: center"><strong>B</strong></th>
<th style="text-align: center"><strong>C</strong></th>
<th style="text-align: center"><strong>D</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>A</strong></td>
<td style="text-align: center">0</td>
<td style="text-align: center">2</td>
<td style="text-align: center">2.25</td>
<td style="text-align: center">1.75</td>
</tr>
<tr>
<td><strong>B</strong></td>
<td style="text-align: center">2</td>
<td style="text-align: center">0</td>
<td style="text-align: center">2.75</td>
<td style="text-align: center">2.25</td>
</tr>
<tr>
<td><strong>C</strong></td>
<td style="text-align: center">2.25</td>
<td style="text-align: center">2.75</td>
<td style="text-align: center">0</td>
<td style="text-align: center">2</td>
</tr>
<tr>
<td><strong>D</strong></td>
<td style="text-align: center">1.75</td>
<td style="text-align: center">2.25</td>
<td style="text-align: center">2</td>
<td style="text-align: center">0</td>
</tr>
</tbody>
</table>
<h2 id="wrapping-up">Wrapping up</h2>
<p>Having reached the end of this lesson, you should have learned how to construct neighbor-joining trees by hand from additive and nearly additive matrices. If you want to take a closer look at the examples (and access one additional example), you can check out <a href="/assets/data/posts/nj_trees/neighbor-joining_examples_spreadsheet.xlsx">this excel file</a>.</p>Amelia HarrisionGenerating Python bindings for OCaml with pyml_bindgen2022-04-12T00:00:00+00:002022-04-12T00:00:00+00:00https://www.tenderisthebyte.com/blog/2022/04/12/ocaml-python-bindgen<p><code class="language-plaintext highlighter-rouge">pyml_bindgen</code> is a command line app that generates Python bindings via <a href="https://github.com/thierry-martinez/pyml">pyml</a> directly from OCaml value specifications. While you could write <code class="language-plaintext highlighter-rouge">pyml</code> bindings by hand, it can get repetitive, especially if you are binding a decent sized Python library.</p>
<p>In this post, I will introduce <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> and go through a couple of common tasks.</p>
<div class="post-toc">
<h4 class="post-toc--header" id="contents">Contents</h4>
<ul>
<li><a href="#install">Install</a></li>
<li><a href="#a-simple-example">A simple example</a></li>
<li><a href="#controlling-the-bindings">Controlling the bindings</a></li>
<li><a href="#binding-cyclic-python-classes">Binding cyclic Python classes</a></li>
<li><a href="#other-stuff">Other stuff</a></li>
<li><a href="#wrap-up">Wrap-up</a></li>
</ul>
</div>
<h2 id="install">Install</h2>
<p>To get started with <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>, you will need to install it. It is available on <a href="https://opam.ocaml.org/packages/pyml_bindgen/">opam</a> (<code class="language-plaintext highlighter-rouge">opam install pyml_bindgen</code>).</p>
<h2 id="a-simple-example">A simple example</h2>
<p>Let’s start with a simple example.</p>
<h3 id="python-code">Python code</h3>
<p>Here is the Python class that we want to bind (<code class="language-plaintext highlighter-rouge">hobbit.py</code>).</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Hobbit</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
<span class="bp">self</span><span class="p">.</span><span class="n">age</span> <span class="o">=</span> <span class="n">age</span>
<span class="k">def</span> <span class="nf">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="sa">f</span><span class="s">'Hobbit -- </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">age</span><span class="si">}</span><span class="s">'</span></code></pre></figure>
<p>As you see, it’s pretty simple! It’s just the <code class="language-plaintext highlighter-rouge">__init__</code> method to create the class and the <code class="language-plaintext highlighter-rouge">__str__</code> method for converting it to a string with the Python <code class="language-plaintext highlighter-rouge">str</code> or <code class="language-plaintext highlighter-rouge">print</code> functions.</p>
<p>Here’s an example of using it in Python.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">hobbit</span> <span class="kn">import</span> <span class="n">Hobbit</span>
<span class="n">bilbo</span> <span class="o">=</span> <span class="n">Hobbit</span><span class="p">(</span><span class="s">'Bilbo'</span><span class="p">,</span> <span class="mi">111</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">bilbo</span><span class="p">)</span>
<span class="c1">#=> Hobbit -- Bilbo, 111</span></code></pre></figure>
<h3 id="write-value-specifications">Write value specifications</h3>
<p>To bind Python classes with <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>, you first need to write value specifications to define the OCaml interface for the Python code we are binding.</p>
<p>To start, we will keep the functions and argument names the same.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">__init__</span> <span class="o">:</span> <span class="n">name</span><span class="o">:</span><span class="kt">string</span> <span class="o">-></span> <span class="n">age</span><span class="o">:</span><span class="kt">int</span> <span class="o">-></span> <span class="kt">unit</span> <span class="o">-></span> <span class="n">t</span>
<span class="k">val</span> <span class="n">__str__</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-></span> <span class="kt">unit</span> <span class="o">-></span> <span class="kt">string</span>
<span class="k">val</span> <span class="n">name</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-></span> <span class="kt">string</span>
<span class="k">val</span> <span class="n">age</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-></span> <span class="kt">int</span></code></pre></figure>
<p>There are a couple things call your attention to here:</p>
<ul>
<li>I haven’t defined <code class="language-plaintext highlighter-rouge">type t</code> anywhere yet. Depending on the command line arguments you pass to <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>, it will take care of this for you.</li>
<li>For the <code class="language-plaintext highlighter-rouge">__init__</code> function, I have used all named arguments plus the <code class="language-plaintext highlighter-rouge">unit</code> argument. The <code class="language-plaintext highlighter-rouge">unit</code> argument tells <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> that you are binding a normal Python method or function call (as opposed to a Python attribute or property).</li>
<li>The <code class="language-plaintext highlighter-rouge">__str__</code> function takes <code class="language-plaintext highlighter-rouge">t</code> as the first argument. Value specifications that start with <code class="language-plaintext highlighter-rouge">t</code>, will bind to object method calls on the Python side.</li>
<li><code class="language-plaintext highlighter-rouge">name</code> and <code class="language-plaintext highlighter-rouge">age</code> both take <code class="language-plaintext highlighter-rouge">t</code> as the first and only argument. If a value specification takes <code class="language-plaintext highlighter-rouge">t</code> and nothing else, it binds to the Python attribute of that name.</li>
</ul>
<p>Save the above in a file called <code class="language-plaintext highlighter-rouge">hobbit.txt</code>.</p>
<h3 id="generate-bindings">Generate bindings</h3>
<p>Now, we’re ready to generate the OCaml bindings.</p>
<p>Here’s how you would run <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> for this example.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>pyml_bindgen hobbit.txt hobbit Hobbit <span class="se">\</span>
<span class="nt">--of-pyo-ret-type</span> no_check <span class="se">\</span>
<span class="o">></span> hobbit.ml</code></pre></figure>
<p>Let’s unpack that.</p>
<ul>
<li>The first three arguments are the path to the OCaml value specifications, the name of the Python module we are binding, and the Python class name.
<ul>
<li>Since we named the Python file <code class="language-plaintext highlighter-rouge">hobbit.py</code>, its module name is <code class="language-plaintext highlighter-rouge">hobbit</code>.</li>
<li>Depending on the directory structure you’re using, this may change.</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">--of-pyo-ret-type</code> specifies the return type for functions that generate Python objects.
<ul>
<li>Using <code class="language-plaintext highlighter-rouge">no_check</code> means the generated functions will assume the Python object is the correct type.</li>
<li>You can also use <code class="language-plaintext highlighter-rouge">option</code> and <code class="language-plaintext highlighter-rouge">or_error</code> as well.</li>
</ul>
</li>
<li>The output is redirected to a file called <code class="language-plaintext highlighter-rouge">hobbit.ml</code>. Thus, our generated code will be in a module called <code class="language-plaintext highlighter-rouge">Hobbit</code>.</li>
<li>We did not tell <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> that it should generate a full module with a signature, so it will just write the implementation.
<ul>
<li>In this example it is fine, but you will often want to generate the module and signature, so that your types will be abstract.</li>
<li>For example, you could use <code class="language-plaintext highlighter-rouge">--caml-module Hobbit --split-caml-module</code> to generate both an <code class="language-plaintext highlighter-rouge">ml</code> and <code class="language-plaintext highlighter-rouge">mli</code> file.</li>
</ul>
</li>
<li>If you look at the generated code, it will be kind of messy. I usually run the output through <code class="language-plaintext highlighter-rouge">ocamlformat</code> if I need to edit the output, or check the generated code into version control or something like that.</li>
</ul>
<h3 id="test-it-out">Test it out</h3>
<p>Now we can make a program to test it out. Don’t forget to call <a href="https://github.com/thierry-martinez/pyml#getting-started">initialize</a> before running the rest of your code!</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">Py</span><span class="p">.</span><span class="n">initialize</span> <span class="bp">()</span>
<span class="k">let</span> <span class="n">bilbo</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">__init__</span> <span class="o">~</span><span class="n">name</span><span class="o">:</span><span class="s2">"Bilbo"</span> <span class="o">~</span><span class="n">age</span><span class="o">:</span><span class="mi">111</span> <span class="bp">()</span>
<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
<span class="k">assert</span> <span class="p">(</span><span class="s2">"Hobbit -- Bilbo, 111"</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">__str__</span> <span class="n">bilbo</span> <span class="bp">()</span><span class="p">);</span>
<span class="k">assert</span> <span class="p">(</span><span class="s2">"Bilbo"</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">name</span> <span class="n">bilbo</span><span class="p">);</span>
<span class="k">assert</span> <span class="p">(</span><span class="mi">111</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">age</span> <span class="n">bilbo</span><span class="p">)</span></code></pre></figure>
<p>Since we didn’t generate a signature to go with our implementation, the type of the value returned by <code class="language-plaintext highlighter-rouge">Hobbit.__init__</code> will be <code class="language-plaintext highlighter-rouge">Pytypes.pyobject</code>. In this way, we can pass any <code class="language-plaintext highlighter-rouge">pyobject</code> to the <code class="language-plaintext highlighter-rouge">Hobbit.__str__</code> function. Let’s see.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="nn">Py</span><span class="p">.</span><span class="nn">Int</span><span class="p">.</span><span class="n">of_int</span> <span class="mi">1234</span>
<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="n">print_endline</span> <span class="o">@@</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">__str__</span> <span class="n">x</span> <span class="bp">()</span></code></pre></figure>
<p>If you run that, it will print <code class="language-plaintext highlighter-rouge">1234</code>. Huh? Well, if you look at the generated code for the <code class="language-plaintext highlighter-rouge">Hobbit.__str__</code> function, it looks something like this:</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">__str__</span> <span class="n">t</span> <span class="bp">()</span> <span class="o">=</span>
<span class="k">let</span> <span class="n">callable</span> <span class="o">=</span> <span class="nn">Py</span><span class="p">.</span><span class="nn">Object</span><span class="p">.</span><span class="n">find_attr_string</span> <span class="n">t</span> <span class="s2">"__str__"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">kwargs</span> <span class="o">=</span> <span class="n">filter_opt</span> <span class="bp">[]</span> <span class="k">in</span>
<span class="nn">Py</span><span class="p">.</span><span class="nn">String</span><span class="p">.</span><span class="n">to_string</span>
<span class="o">@@</span> <span class="nn">Py</span><span class="p">.</span><span class="nn">Callable</span><span class="p">.</span><span class="n">to_function_with_keywords</span> <span class="n">callable</span> <span class="p">[</span><span class="o">||</span><span class="p">]</span> <span class="n">kwargs</span></code></pre></figure>
<p>Without going into too much detail, essentially all it is doing is calling the <code class="language-plaintext highlighter-rouge">__str__</code> method on the Python object passed in. While this is fine on the Python side, it doesn’t work the way we might want it to on the OCaml side. Ideally, we only want the <code class="language-plaintext highlighter-rouge">Hobbit</code> module functions to work on values of type <code class="language-plaintext highlighter-rouge">Hobbit.t</code>.</p>
<h3 id="generating-abstract-types">Generating abstract types</h3>
<p>If we were writing the bindings by hand, we would make <code class="language-plaintext highlighter-rouge">Hobbit.t</code> abstract. With <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>, we can do that using the <code class="language-plaintext highlighter-rouge">--caml-module</code> option.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>pyml_bindgen hobbit_specs.txt hobbit Hobbit <span class="se">\</span>
<span class="nt">--of-pyo-ret-type</span> no_check <span class="se">\</span>
<span class="nt">--caml-module</span> Hobbit <span class="se">\</span>
<span class="nt">--split-caml-module</span> <span class="nb">.</span> <span class="se">\</span>
<span class="o">></span> hobbit.ml</code></pre></figure>
<p>Notice that I also used <code class="language-plaintext highlighter-rouge">--split-caml-module .</code> which tells <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> to split the implementation and signature into separate <code class="language-plaintext highlighter-rouge">ml</code> and <code class="language-plaintext highlighter-rouge">mli</code> files, and to put the output in the directory in which the command is run. You can pass in whatever directory you want to this option.</p>
<p>Now if we tried something like this:</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="nn">Py</span><span class="p">.</span><span class="nn">Int</span><span class="p">.</span><span class="n">of_int</span> <span class="mi">1234</span>
<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="n">print_endline</span> <span class="o">@@</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">__str__</span> <span class="n">x</span> <span class="bp">()</span></code></pre></figure>
<p>It would be a compile-time error.</p>
<h2 id="controlling-the-bindings">Controlling the bindings</h2>
<p>Let’s clean up this example a little bit.</p>
<h3 id="using-different-function-names">Using different function names</h3>
<p>While <code class="language-plaintext highlighter-rouge">__init__</code> and <code class="language-plaintext highlighter-rouge">__str__</code> are fine for OCaml function names, they don’t feel quite right. <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> lets you bind Python functions to different names on the OCaml side using <a href="https://ocaml.org/manual/attributes.html">attributes</a> on the value specifications. To bind to a different function name, we use the <code class="language-plaintext highlighter-rouge">py_fun_name</code> attribute. Check it out.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">create</span> <span class="o">:</span> <span class="n">name</span><span class="o">:</span><span class="kt">string</span> <span class="o">-></span> <span class="n">age</span><span class="o">:</span><span class="kt">int</span> <span class="o">-></span> <span class="kt">unit</span> <span class="o">-></span> <span class="n">t</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__init__</span><span class="p">]</span>
<span class="k">val</span> <span class="n">to_string</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-></span> <span class="kt">unit</span> <span class="o">-></span> <span class="kt">string</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__str__</span><span class="p">]</span></code></pre></figure>
<p>We bind the <code class="language-plaintext highlighter-rouge">__init__</code> function to an OCaml function called <code class="language-plaintext highlighter-rouge">create</code>, and the Python function <code class="language-plaintext highlighter-rouge">__str__</code> to the OCaml function <code class="language-plaintext highlighter-rouge">to_string</code>. That’s much more natural!</p>
<p>As you can see, the syntax is like this: <code class="language-plaintext highlighter-rouge">[@@attr-id attr-payload]</code>. In this case, the attribute id is <code class="language-plaintext highlighter-rouge">py_fun_name</code> and the payload is the name of the Python function that we want to bind. Put another way, the attribute payload should be the name of the function as it is defined in the Python library you are binding to (i.e., <code class="language-plaintext highlighter-rouge">__init__</code> is the name of the function on the Python side, not <code class="language-plaintext highlighter-rouge">create</code>).</p>
<p>Putting it together, you get <code class="language-plaintext highlighter-rouge">[@@py_fun_name __init__]</code> for the Python <code class="language-plaintext highlighter-rouge">__init__</code> function and <code class="language-plaintext highlighter-rouge">[@@py_fun_name __str__]</code> for the Python <code class="language-plaintext highlighter-rouge">__str__</code> function.</p>
<h3 id="using-different-argument-names">Using different argument names</h3>
<p>The other available attribute is <code class="language-plaintext highlighter-rouge">py_arg_name</code>. With this, we can bind arguments to different names on the OCaml and Python sides. This can be useful in situations in which Python argument names use reserved OCaml keywords, or simply to make the generated API feel more natural for use in OCaml.</p>
<p>For example, you may have a Python function that has an argument name <code class="language-plaintext highlighter-rouge">method</code>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">cluster</span><span class="p">(</span><span class="n">method</span><span class="o">=</span><span class="s">'ward'</span><span class="p">):</span>
<span class="p">...</span></code></pre></figure>
<p>Since <code class="language-plaintext highlighter-rouge">method</code> is a <a href="https://ocaml.org/manual/lex.html#sss:keywords">reserved keyword</a> in OCaml, we can’t use it directly. Instead, we want to name it <code class="language-plaintext highlighter-rouge">method_</code> in our OCaml code.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">cluster</span> <span class="o">:</span> <span class="n">method_</span><span class="o">:</span><span class="kt">string</span> <span class="o">-></span> <span class="o">...</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_arg_name</span> <span class="n">method_</span> <span class="n">method</span><span class="p">]</span></code></pre></figure>
<p>In this case, the payload is two items: the first is the argument name on the OCaml side, and the second is the argument name on the Python side.</p>
<p>Note that in cases in which you need <a href="https://github.com/mooreryan/ocaml_python_bindgen/tree/main/examples/attributes#multiple-attributes">multiple attributes</a> per specification, they must be placed one per line. (This is a <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> specific restriction.) E.g., something like this:</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">run_clustering</span> <span class="o">:</span> <span class="n">method_</span><span class="o">:</span><span class="kt">string</span> <span class="o">-></span> <span class="o">...</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">cluster</span><span class="p">]</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_arg_name</span> <span class="n">method_</span> <span class="n">method</span><span class="p">]</span></code></pre></figure>
<p>This will bind the OCaml function <code class="language-plaintext highlighter-rouge">run_clustering</code> to the corresponding Python function <code class="language-plaintext highlighter-rouge">cluster</code>.</p>
<h2 id="binding-cyclic-python-classes">Binding cyclic Python classes</h2>
<p>Often you will need to bind Python classes that refer to each other. One way to bind these is to use <a href="https://ocaml.org/manual/recursivemodules.html">recursive modules</a>. Let’s update our Hobbit example to show how you can do this in <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Hobbit</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
<span class="bp">self</span><span class="p">.</span><span class="n">age</span> <span class="o">=</span> <span class="n">age</span>
<span class="bp">self</span><span class="p">.</span><span class="n">house</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">def</span> <span class="nf">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="sa">f</span><span class="s">'Hobbit -- </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, age: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">age</span><span class="si">}</span><span class="s">, house: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">house</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">'</span>
<span class="k">def</span> <span class="nf">buy_house</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">house</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">house</span> <span class="o">=</span> <span class="n">house</span>
<span class="bp">self</span><span class="p">.</span><span class="n">house</span><span class="p">.</span><span class="n">owner</span> <span class="o">=</span> <span class="bp">self</span>
<span class="k">class</span> <span class="nc">House</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
<span class="bp">self</span><span class="p">.</span><span class="n">owner</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">def</span> <span class="nf">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="sa">f</span><span class="s">'House -- </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, owner: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">owner</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">'</span></code></pre></figure>
<p>So this is a pretty silly example, but it’s just to illustrate the point. In this case, a <code class="language-plaintext highlighter-rouge">Hobbit</code> can own a <code class="language-plaintext highlighter-rouge">House</code> and a <code class="language-plaintext highlighter-rouge">House</code> can have a <code class="language-plaintext highlighter-rouge">Hobbit</code> for an owner.</p>
<p>To bind these classes, I will use the <code class="language-plaintext highlighter-rouge">gen_multi</code> and <code class="language-plaintext highlighter-rouge">combine_rec_modules</code> helper programs that come with <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>.</p>
<h3 id="gen_multi">gen_multi</h3>
<p><code class="language-plaintext highlighter-rouge">gen_multi</code> is a wrapper script that runs <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> multiple times to generate multiple OCaml modules in one go. It takes a tsv file specifying the same set of options that you would pass in to <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> if you used it directly.</p>
<p>Assume this is in a file called <code class="language-plaintext highlighter-rouge">gen_multi_cli.tsv</code>.</p>
<table class="scroll">
<thead>
<tr>
<th>signatures</th>
<th>py_module</th>
<th>py_class</th>
<th>associated_with</th>
<th>caml_module</th>
<th>split_caml_module</th>
<th>embed_python_source</th>
<th>of_pyo_ret_type</th>
</tr>
</thead>
<tbody>
<tr>
<td>hobbit.txt</td>
<td>hobbit</td>
<td>Hobbit</td>
<td>class</td>
<td>Hobbit</td>
<td>NA</td>
<td>hobbit.py</td>
<td>no_check</td>
</tr>
<tr>
<td>house.txt</td>
<td>house</td>
<td>House</td>
<td>class</td>
<td>House</td>
<td>NA</td>
<td>house.py</td>
<td>no_check</td>
</tr>
</tbody>
</table>
<p>The order of the columns must as shown above. <em>(For more info on each of these options, run <code class="language-plaintext highlighter-rouge">pyml_bindgen --help</code>.)</em></p>
<p>You will see that we refer to <code class="language-plaintext highlighter-rouge">hobbit.txt</code> and <code class="language-plaintext highlighter-rouge">house.txt</code>. These are the value specifications for each of the Python classes. Here are there contents.</p>
<p><code class="language-plaintext highlighter-rouge">hobbit.txt</code></p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">create</span> <span class="o">:</span> <span class="n">name</span><span class="o">:</span><span class="kt">string</span> <span class="o">-></span> <span class="n">age</span><span class="o">:</span><span class="kt">int</span> <span class="o">-></span> <span class="kt">unit</span> <span class="o">-></span> <span class="n">t</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__init__</span><span class="p">]</span>
<span class="k">val</span> <span class="n">to_string</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-></span> <span class="kt">unit</span> <span class="o">-></span> <span class="kt">string</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__str__</span><span class="p">]</span>
<span class="k">val</span> <span class="n">buy_house</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-></span> <span class="n">house</span><span class="o">:</span><span class="nn">House</span><span class="p">.</span><span class="n">t</span> <span class="o">-></span> <span class="kt">unit</span> <span class="o">-></span> <span class="kt">unit</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">house.txt</code></p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">create</span> <span class="o">:</span> <span class="n">name</span><span class="o">:</span><span class="kt">string</span> <span class="o">-></span> <span class="kt">unit</span> <span class="o">-></span> <span class="n">t</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__init__</span><span class="p">]</span>
<span class="k">val</span> <span class="n">to_string</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-></span> <span class="kt">unit</span> <span class="o">-></span> <span class="kt">string</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__str__</span><span class="p">]</span></code></pre></figure>
<h3 id="combine_rec_modules">combine_rec_modules</h3>
<p><code class="language-plaintext highlighter-rouge">combine_rec_modules</code> takes a file of OCaml modules and “converts” them into recursive modules. It does this using a simple text transformation.</p>
<p>Often you will want to pipe the output of <code class="language-plaintext highlighter-rouge">gen_multi</code> directly into <code class="language-plaintext highlighter-rouge">combine_rec_modules</code>.</p>
<h3 id="generate-the-modules--test-it-out">Generate the modules & test it out</h3>
<p>Now let’s see it in action.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>gen_multi gen_multi_cli.tsv | combine_rec_modules /dev/stdin <span class="o">></span> lib.ml</code></pre></figure>
<p>We put that in a module called <code class="language-plaintext highlighter-rouge">Lib</code>. And here is how we might use that.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">open</span> <span class="nc">Lib</span>
<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">Py</span><span class="p">.</span><span class="n">initialize</span> <span class="bp">()</span>
<span class="k">let</span> <span class="n">bilbo</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">create</span> <span class="o">~</span><span class="n">name</span><span class="o">:</span><span class="s2">"Bilbo"</span> <span class="o">~</span><span class="n">age</span><span class="o">:</span><span class="mi">111</span> <span class="bp">()</span>
<span class="k">let</span> <span class="n">bag_end</span> <span class="o">=</span> <span class="nn">House</span><span class="p">.</span><span class="n">create</span> <span class="o">~</span><span class="n">name</span><span class="o">:</span><span class="s2">"Bag End"</span> <span class="bp">()</span>
<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">buy_house</span> <span class="n">bilbo</span> <span class="o">~</span><span class="n">house</span><span class="o">:</span><span class="n">bag_end</span> <span class="bp">()</span>
<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
<span class="k">assert</span> <span class="p">(</span>
<span class="s2">"Hobbit -- Bilbo, age: 111, house: Bag End"</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">to_string</span> <span class="n">bilbo</span> <span class="bp">()</span><span class="p">)</span></code></pre></figure>
<h2 id="other-stuff">Other stuff</h2>
<p>Let me mention a couple of other things before we go…</p>
<ul>
<li>In this post we ran <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> (or its helper scripts) manually, it’s not too hard to set up Dune <a href="https://dune.readthedocs.io/en/stable/dune-files.html#rule">rules</a> to automatically generate bindings. See the <code class="language-plaintext highlighter-rouge">dune</code> files in the <a href="https://github.com/mooreryan/ocaml_python_bindgen/tree/main/examples">example</a> directory on the <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> GitHub for more information.</li>
<li>While I only showed how to bind to Python classes, you can also bind to functions associated with modules rather than with classes.</li>
<li>Another cool feature is that you can embed Python source code directly into your generated OCaml modules. See <a href="https://github.com/mooreryan/ocaml_python_bindgen/tree/main/examples/embedding_python_source">here</a> for more details.</li>
</ul>
<h2 id="wrap-up">Wrap-up</h2>
<p><code class="language-plaintext highlighter-rouge">pyml_bindgen</code> is a command line app for generating Python bindings using pyml. It makes incorporating Python libraries into your OCaml projects as easy as writing regular OCaml value specifications.</p>
<p>To get more information on setting up and using <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>, including ideas on how to structure your projects, check out the <a href="https://github.com/mooreryan/ocaml_python_bindgen/tree/main/examples">examples</a>, <a href="https://github.com/mooreryan/ocaml_python_bindgen/tree/main/test">tests</a>, and <a href="https://mooreryan.github.io/ocaml_python_bindgen/">docs</a>.</p>Ryan Moorepyml_bindgen is a command line app that generates Python bindings via pyml directly from OCaml value specifications. While you could write pyml bindings by hand, it can get repetitive, especially if you are binding a decent sized Python library.An introduction to the re2 regular expression library for OCaml2021-10-02T00:00:00+00:002021-10-02T00:00:00+00:00https://www.tenderisthebyte.com/blog/2021/10/02/ocaml-re2-tutorial<p>In this tutorial, we will talk about <a href="https://github.com/janestreet/re2">re2</a>, an OCaml library providing bindings to <a href="https://github.com/google/re2">RE2</a>, Google’s regular expression library.</p>
<p>This post is intended for newer OCaml programmers, or those who want to use the <code class="language-plaintext highlighter-rouge">re2</code> library, but could use a couple of examples to help get started. This is not a general introduction to regular expressions, however. If you have never used regular expressions before, read up a little bit on the syntax before tackling this post.</p>
<div class="post-toc">
<h4 class="post-toc--header" id="contents">Contents</h4>
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#creating-regular-expressions">Creating regular expressions</a></li>
<li><a href="#checking-for-a-match">Checking for a match</a></li>
<li><a href="#finding-matches">Finding matches</a></li>
<li><a href="#finding-submatches">Finding submatches</a></li>
<li><a href="#splitting-strings">Splitting strings</a></li>
<li><a href="#replacing">Replacing</a></li>
<li><a href="#miscellaneous-info">Miscellaneous info</a></li>
<li><a href="#wrap-up">Wrap up</a></li>
</ul>
</div>
<h2 id="overview">Overview</h2>
<p>The there are few choices for regular expression libraries available for OCaml on <a href="https://opam.ocaml.org/">Opam</a>. Some of the most popular include</p>
<ul>
<li><a href="https://opam.ocaml.org/packages/re">re</a>, a pure OCaml library (installed 7667 times last month),</li>
<li><a href="https://opam.ocaml.org/packages/pcre">pcre</a>, bindings to the Perl Compatibility Regular Expressions library (<a href="https://www.pcre.org/">PCRE</a>), (installed 1115 times last month), and</li>
<li><a href="https://opam.ocaml.org/packages/re2">re2</a>, OCaml bindings for RE2, Google’s regular expression library (installed 114 times last month).</li>
</ul>
<p>The first two are by far the most popular in terms of raw Opam install counts. However, <code class="language-plaintext highlighter-rouge">re2</code> integrates nicely into the Jane Street Base/Core/Async ecosystem (it’s a Jane Street package after all!), and is covered under the MIT license rather than the <a href="https://spdx.org/licenses/OCaml-LGPL-linking-exception.html">LGPL with OCaml linking exception</a>, which may be appealing depending on your situation.</p>
<p><em>Note: According to this <a href="https://blog.janestreet.com/what-the-interns-have-wrought-2020/">blog post</a> and this <a href="https://github.com/janestreet/re2/issues/26#issuecomment-395870146">GitHub issue</a>, Jane Street is phasing out its use of re2. The <a href="https://github.com/janestreet/re2">re2 GitHub</a> does have recent commits, though, so your mileage may vary.</em></p>
<p>One issue that newcomers may face when getting started with the <code class="language-plaintext highlighter-rouge">re2</code> library is the slightly terse <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html">API documentation</a>. While it is detailed and thorough, it can be hard to get started with if you’re not already used to reading Jane Street <code class="language-plaintext highlighter-rouge">mli</code> files and source code.</p>
<p><em>Note: if you want to follow along, you can paste the examples into the toplevel (or <a href="https://opam.ocaml.org/blog/about-utop/">utop</a>). However, don’t paste in lines starting with <code class="language-plaintext highlighter-rouge">- :</code>. These lines show the type of the expression as reported by <code class="language-plaintext highlighter-rouge">utop</code>.</em></p>
<h2 id="creating-regular-expressions">Creating regular expressions</h2>
<p>You create regular expressions with <code class="language-plaintext highlighter-rouge">Re2.create</code> and <code class="language-plaintext highlighter-rouge">Re2.create_exn</code>. The former returns <code class="language-plaintext highlighter-rouge">Re2.t Or_error.t</code> and the latter <code class="language-plaintext highlighter-rouge">Re2.t</code>.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Or_error</span><span class="p">.</span><span class="n">ok_exn</span> <span class="o">@@</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create</span> <span class="s2">"apple"</span><span class="p">;;</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span><span class="p">;;</span></code></pre></figure>
<h3 id="matching-options">Matching options</h3>
<p>You can control how regular expression matching works by passing the <code class="language-plaintext highlighter-rouge">options</code> argument to the <code class="language-plaintext highlighter-rouge">create</code> and <code class="language-plaintext highlighter-rouge">create_exn</code> functions. If you omit this argument, the default options will be passed. Here they are:</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="n">default</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
<span class="p">{</span>
<span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="n">case_sensitive</span> <span class="o">=</span> <span class="bp">true</span><span class="p">;</span>
<span class="n">dot_nl</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="n">encoding</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="nn">Encoding</span><span class="p">.</span><span class="nc">Utf8</span><span class="p">;</span>
<span class="n">literal</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="n">log_errors</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="n">longest_match</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="n">max_mem</span> <span class="o">=</span> <span class="mi">8388608</span><span class="p">;</span>
<span class="n">never_capture</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="n">never_nl</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="n">one_line</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="n">perl_classes</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="n">posix_syntax</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="n">word_boundary</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>For a more detailed description of these options, see the <a href="https://github.com/janestreet/re2/blob/89373a48bc786be9b2a7f530dd5954222515c048/src/re2_c/libre2/re2/re2.h#L509">re2.h</a> header filer.</p>
<p>By default, <code class="language-plaintext highlighter-rouge">re2</code> uses case-sensitive matching. To create a case-insensitive regex, pass in an options map like so.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re_i</span> <span class="o">=</span>
<span class="k">let</span> <span class="n">options</span> <span class="o">=</span> <span class="p">{</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="n">default</span> <span class="k">with</span> <span class="n">case_sensitive</span> <span class="o">=</span> <span class="bp">false</span> <span class="p">}</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="o">~</span><span class="n">options</span> <span class="s2">"abc"</span></code></pre></figure>
<h2 id="checking-for-a-match">Checking for a match</h2>
<p>Perhaps the most basic regex task is to check if a string matches a given regular expression. You can use <code class="language-plaintext highlighter-rouge">Re2.matches</code> for this.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="c">(* Case sensitive *)</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="k">assert</span> <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">matches</span> <span class="n">re</span> <span class="s2">"apple pie"</span><span class="p">);</span>
<span class="k">assert</span> <span class="p">(</span><span class="n">not</span> <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">matches</span> <span class="n">re</span> <span class="s2">"Apple pie"</span><span class="p">));;</span>
<span class="c">(* Case insensitive *)</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span>
<span class="k">let</span> <span class="n">options</span> <span class="o">=</span> <span class="p">{</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="n">default</span> <span class="k">with</span> <span class="n">case_sensitive</span> <span class="o">=</span> <span class="bp">false</span> <span class="p">}</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="o">~</span><span class="n">options</span> <span class="s2">"apple"</span>
<span class="k">in</span>
<span class="k">assert</span> <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">matches</span> <span class="n">re</span> <span class="s2">"apple pie"</span><span class="p">);</span>
<span class="k">assert</span> <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">matches</span> <span class="n">re</span> <span class="s2">"Apple pie"</span><span class="p">);;</span></code></pre></figure>
<h2 id="finding-matches">Finding matches</h2>
<p>To find all matches of a regular expression in a string, you can use the <code class="language-plaintext highlighter-rouge">find_*</code> functions.</p>
<h3 id="find-first-match">Find first match</h3>
<p>To return the first match in the query string, use <code class="language-plaintext highlighter-rouge">find_first</code> or <code class="language-plaintext highlighter-rouge">find_first_exn</code>. These functions return matched string rather than the underlying <code class="language-plaintext highlighter-rouge">Re2.Match.t</code>.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_first_exn</span> <span class="n">re</span> <span class="s2">"apple pie is made from apples"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"apple"</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"[ab]{2}"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_first_exn</span> <span class="n">re</span> <span class="s2">"ababa"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"ab"</span></code></pre></figure>
<h3 id="find-all-matches">Find all matches</h3>
<p>While <code class="language-plaintext highlighter-rouge">find_first</code> returns the first match in a query string, <code class="language-plaintext highlighter-rouge">find_all</code> and <code class="language-plaintext highlighter-rouge">find_all_exn</code> return lists of all non-overlapping matches in the query string.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all</span> <span class="n">re</span> <span class="s2">"apple pie"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="nn">Or_error</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span> <span class="nn">Result</span><span class="p">.</span><span class="nc">Ok</span> <span class="p">[</span><span class="s2">"apple"</span><span class="p">]</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all_exn</span> <span class="n">re</span> <span class="s2">"apple pie is made from apples"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"apple"</span><span class="p">;</span> <span class="s2">"apple"</span><span class="p">]</span></code></pre></figure>
<h4 id="submatches-and-capturing-groups">Submatches and capturing groups</h4>
<p>You can use the <code class="language-plaintext highlighter-rouge">sub</code> argument to return submatches defined by capturing groups rather than the whole match.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">s</span> <span class="o">=</span> <span class="s2">"ab ac ab"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all_exn</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">)</span> <span class="n">re</span> <span class="n">s</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"b"</span><span class="p">;</span> <span class="s2">"c"</span><span class="p">;</span> <span class="s2">"b"</span><span class="p">]</span></code></pre></figure>
<p>Be aware that passing index greater than the amount of capturing groups will raise an error.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">s</span> <span class="o">=</span> <span class="s2">"ab ac ab"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all_exn</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">10</span><span class="p">)</span> <span class="n">re</span> <span class="n">s</span><span class="p">;;</span>
<span class="nc">Exception</span><span class="o">:</span> <span class="nn">Re2__Regex</span><span class="p">.</span><span class="nn">Exceptions</span><span class="p">.</span><span class="nc">Regex_no_such_subpattern</span><span class="p">(</span><span class="mi">10</span><span class="o">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span></code></pre></figure>
<h4 id="or_error-returning-vs-exception-raising">Or_error returning vs. Exception raising</h4>
<p>Like most of the functions in the <code class="language-plaintext highlighter-rouge">Re2</code> module, the <code class="language-plaintext highlighter-rouge">find</code> functions come in both <code class="language-plaintext highlighter-rouge">Or_error.t</code> returning and exception raising versions. If the regular expression doesn’t match, <code class="language-plaintext highlighter-rouge">find_all</code> returns a <code class="language-plaintext highlighter-rouge">Result.Error.t</code> whereas <code class="language-plaintext highlighter-rouge">find_all_exn</code> raises an exception.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all</span> <span class="n">re</span> <span class="s2">"peach pie"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="nn">Or_error</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
<span class="nn">Result</span><span class="p">.</span><span class="nc">Error</span>
<span class="p">(</span><span class="s2">"Re2__Regex.Exceptions.Regex_match_failed(</span><span class="se">\"</span><span class="s2">apple</span><span class="se">\"</span><span class="s2">)"</span><span class="p">)</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all_exn</span> <span class="n">re</span> <span class="s2">"peach pie"</span><span class="p">;;</span>
<span class="nc">Exception</span><span class="o">:</span> <span class="nn">Re2__Regex</span><span class="p">.</span><span class="nn">Exceptions</span><span class="p">.</span><span class="nc">Regex_match_failed</span><span class="p">(</span><span class="s2">"apple"</span><span class="p">)</span><span class="o">.</span>
<span class="c">(* ...output omitted... *)</span></code></pre></figure>
<p>It is important to remember that the <code class="language-plaintext highlighter-rouge">find_all</code> functions return <em>non-overlapping</em> matches.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"[ab]{2}"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all_exn</span> <span class="n">re</span> <span class="s2">"ababa"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"ab"</span><span class="p">;</span> <span class="s2">"ab"</span><span class="p">]</span></code></pre></figure>
<h2 id="finding-submatches">Finding submatches</h2>
<p>If you need a bit more control than provided by <code class="language-plaintext highlighter-rouge">find_all</code> with the <code class="language-plaintext highlighter-rouge">sub</code> argument (e.g., <code class="language-plaintext highlighter-rouge">find_all ~sub:(` Index 1)</code>), the you may need to use <code class="language-plaintext highlighter-rouge">find_submatches</code> or <code class="language-plaintext highlighter-rouge">find_submatches_exn</code>. These return the first match in the query string. The match is returned as a <code class="language-plaintext highlighter-rouge">string option array</code>, where the first element is the whole match, and subsequent elements are submatches as defined by any capturing groups.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])([de])"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_submatches_exn</span> <span class="n">re</span> <span class="s2">"abdace"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="n">option</span> <span class="kt">array</span> <span class="o">=</span> <span class="p">[</span><span class="o">|</span><span class="nc">Some</span> <span class="s2">"abd"</span><span class="p">;</span> <span class="nc">Some</span> <span class="s2">"b"</span><span class="p">;</span> <span class="nc">Some</span> <span class="s2">"d"</span><span class="o">|</span><span class="p">]</span></code></pre></figure>
<p>You may wonder why <code class="language-plaintext highlighter-rouge">find_submatches_exn</code> returns a <code class="language-plaintext highlighter-rouge">string option array</code> and not simply a <code class="language-plaintext highlighter-rouge">string array</code>. <code class="language-plaintext highlighter-rouge">find_submatches_exn</code> uses <code class="language-plaintext highlighter-rouge">Match.get</code> <a href="https://github.com/janestreet/re2/blob/72e01a088b48791aa6387dc3a093d3806122e2bd/src/regex.ml#L307">under-the-hood</a>. Basically, <code class="language-plaintext highlighter-rouge">find_submatches_exn</code> processes a <code class="language-plaintext highlighter-rouge">Match.t Sequence.t</code> of matches, calling <code class="language-plaintext highlighter-rouge">get</code> on each one. And the <code class="language-plaintext highlighter-rouge">Match.get</code> function <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/Match/index.html#val-get">returns</a> a <code class="language-plaintext highlighter-rouge">string option</code>.</p>
<p>This little code snippet will hopefully give you an idea of what’s going on.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])([de])"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">match_</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">first_match_exn</span> <span class="n">re</span> <span class="s2">"abdace"</span> <span class="k">in</span>
<span class="p">[</span><span class="o">|</span>
<span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get</span> <span class="n">match_</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">0</span><span class="p">);</span>
<span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get</span> <span class="n">match_</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">);</span>
<span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get</span> <span class="n">match_</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">2</span><span class="p">);</span>
<span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get</span> <span class="n">match_</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">3</span><span class="p">);</span>
<span class="o">|</span><span class="p">]</span>
<span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="n">option</span> <span class="kt">array</span> <span class="o">=</span> <span class="p">[</span><span class="o">|</span> <span class="nc">Some</span> <span class="s2">"abd"</span><span class="p">;</span> <span class="nc">Some</span> <span class="s2">"b"</span><span class="p">;</span> <span class="nc">Some</span> <span class="s2">"d"</span><span class="p">;</span> <span class="nc">None</span> <span class="o">|</span><span class="p">]</span></code></pre></figure>
<p>If the <code class="language-plaintext highlighter-rouge">Index</code> you pass to <code class="language-plaintext highlighter-rouge">~sub</code> is higher than the of capturing groups plus one (e.g., the number returned from <code class="language-plaintext highlighter-rouge">Re2.num_submatches</code>), <code class="language-plaintext highlighter-rouge">None</code> is returned.</p>
<h3 id="more-complicated-submatch-interface">More complicated submatch interface</h3>
<p>If you want to work with the <code class="language-plaintext highlighter-rouge">Re2.Match.t</code> directly, you can use functions from the <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#complicated-interface">complicated interface</a> like <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#val-first_match">first_match</a> and <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#val-get_matches">get_matches</a>.</p>
<p>If you need to work with submatches of every match in a string rather than just the first, and you need direct access to the <code class="language-plaintext highlighter-rouge">Match.t</code>, you will want to use <code class="language-plaintext highlighter-rouge">get_matches</code> or <code class="language-plaintext highlighter-rouge">get_matches_exn</code>. Let’s try it out with a weird, little example.</p>
<p>Say we have a string made up of chunks. Each chunk is a number followed by an <code class="language-plaintext highlighter-rouge">A</code> (for add) or an <code class="language-plaintext highlighter-rouge">S</code> (for subtract) (e.g., <code class="language-plaintext highlighter-rouge">50A</code> and <code class="language-plaintext highlighter-rouge">3S</code>). The chunk describes an arithmetic operation: <code class="language-plaintext highlighter-rouge">12A</code> means add 12 to the previous total; <code class="language-plaintext highlighter-rouge">3S</code> means subtract 3 from the previous total.</p>
<p>A full string then might look something like this: <code class="language-plaintext highlighter-rouge">10A5S2S3A</code>, which represents the following sequence of operations: <code class="language-plaintext highlighter-rouge">0 + 10 - 5 - 2 + 3</code>.</p>
<p>One way to solve this little problem using regexes and the <code class="language-plaintext highlighter-rouge">get_matches</code> function. Let’s see how it might go.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">total</span> <span class="o">=</span>
<span class="k">let</span> <span class="n">s</span> <span class="o">=</span> <span class="s2">"10A5S2S3A"</span> <span class="k">in</span>
<span class="c">(* Make the regex *)</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"([0-9]*)([AS])"</span> <span class="k">in</span>
<span class="c">(* Get a Match.t list *)</span>
<span class="k">let</span> <span class="n">matches</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">get_matches_exn</span> <span class="n">re</span> <span class="n">s</span> <span class="k">in</span>
<span class="c">(* Fold over the matches to get the total. *)</span>
<span class="nn">List</span><span class="p">.</span><span class="n">fold</span> <span class="n">matches</span> <span class="o">~</span><span class="n">init</span><span class="o">:</span><span class="mi">0</span> <span class="o">~</span><span class="n">f</span><span class="o">:</span><span class="p">(</span><span class="k">fun</span> <span class="n">total</span> <span class="n">m</span> <span class="o">-></span>
<span class="c">(* The first capturing group is the "count". *)</span>
<span class="k">let</span> <span class="n">number</span> <span class="o">=</span> <span class="nn">Int</span><span class="p">.</span><span class="n">of_string</span> <span class="o">@@</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get_exn</span> <span class="n">m</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">)</span> <span class="k">in</span>
<span class="c">(* The second capturing group represents the operation. *)</span>
<span class="k">match</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get_exn</span> <span class="n">m</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">2</span><span class="p">)</span> <span class="k">with</span>
<span class="o">|</span> <span class="s2">"A"</span> <span class="o">-></span> <span class="n">total</span> <span class="o">+</span> <span class="n">number</span>
<span class="o">|</span> <span class="s2">"S"</span> <span class="o">-></span> <span class="n">total</span> <span class="o">-</span> <span class="n">number</span>
<span class="o">|</span> <span class="n">_</span> <span class="o">-></span> <span class="k">assert</span> <span class="bp">false</span><span class="p">)</span>
<span class="p">;;</span>
<span class="k">assert</span> <span class="p">(</span><span class="n">total</span> <span class="o">=</span> <span class="mi">0</span> <span class="o">+</span> <span class="mi">10</span> <span class="o">-</span> <span class="mi">5</span> <span class="o">-</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">3</span><span class="p">);;</span></code></pre></figure>
<p><em>Note: This weird format is actually loosely based on the <a href="https://en.wikipedia.org/wiki/Sequence_alignment#Representations">CIGAR</a> strings found in <a href="http://samtools.github.io/hts-specs/SAMv1.pdf">SAM files</a> describing <a href="https://en.wikipedia.org/wiki/Sequence_alignment">biological sequence alignments</a>.</em></p>
<h3 id="controlling-submatches">Controlling submatches</h3>
<p>In the last two examples, we used the <code class="language-plaintext highlighter-rouge">sub</code> argument along with a polymorphic variant to select capture groups. Let’s take a closer look at the type used for that.</p>
<p>To select submatches, we use <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#type-id_t">id_t</a>, which looks like this:</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">type</span> <span class="n">id_t</span> <span class="o">=</span> <span class="p">[</span> <span class="err">`</span> <span class="nc">Index</span> <span class="k">of</span> <span class="kt">int</span> <span class="o">|</span> <span class="err">`</span> <span class="nc">Name</span> <span class="k">of</span> <span class="kt">string</span> <span class="p">]</span></code></pre></figure>
<p>This type is used to refer to submatches. E.g., <code class="language-plaintext highlighter-rouge">` Index 1</code> would be the result of first capturing group, <code class="language-plaintext highlighter-rouge">` Index 2</code> the 2nd, etc. Remember that <code class="language-plaintext highlighter-rouge">` Index 0</code> refers to the whole match.</p>
<p>In addition to referring to submatches/capturing groups by index, you can refer to them by name.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a(?P<second_letter>[bc])"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">m</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">first_match_exn</span> <span class="n">re</span> <span class="s2">"abc"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get_exn</span> <span class="n">m</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Name</span> <span class="s2">"second_letter"</span><span class="p">)</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">y</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get_exn</span> <span class="n">m</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">)</span> <span class="k">in</span>
<span class="k">assert</span> <span class="nn">String</span><span class="p">.(</span><span class="n">x</span> <span class="o">=</span> <span class="n">y</span><span class="p">);;</span></code></pre></figure>
<p>When using a complicated regular expression with multiple capturing groups, it is often less error prone to use named submatches rather than numbered ones.</p>
<p><em>Note: It is not a compile-error to try an access a capturing group that doesn’t exist in the regular expression. Depending on the function, you may get <code class="language-plaintext highlighter-rouge">None</code> or raise an exception.</em></p>
<h3 id="using-id_t-to-control-match-efficiency">Using <code class="language-plaintext highlighter-rouge">id_t</code> to control match efficiency</h3>
<p>Many of the regex matching functions take a <code class="language-plaintext highlighter-rouge">?sub:id_t</code> argument.</p>
<p>In some cases, you can increase the efficiency of matching by restricting the number of submatches. If you only care about whether a pattern matches, and not about submatches, you could pass in <code class="language-plaintext highlighter-rouge">~sub:(` Index -1)</code> to many of the above functions.</p>
<p>You can get increasingly more information by increasing the <code class="language-plaintext highlighter-rouge">n</code> to the index.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="c">(* Get only the whole match. *)</span>
<span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">0</span><span class="p">)</span>
<span class="c">(* Get the whole match and first submatch. *)</span>
<span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">)</span></code></pre></figure>
<p><a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#type-id_t">This section</a> of the documentation has more info on how specifying the <code class="language-plaintext highlighter-rouge">sub</code> argument can have an impact on regex performance, and which functions are affected by its usage.</p>
<h2 id="splitting-strings">Splitting strings</h2>
<p>Another common regex task is splitting an input string based on a regular expression pattern. <code class="language-plaintext highlighter-rouge">Re2</code> provides the <code class="language-plaintext highlighter-rouge">split</code> function for this purpose.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"[.,! ]+"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">split</span> <span class="n">re</span> <span class="s2">"Hello, world! I like pie."</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"Hello"</span><span class="p">;</span> <span class="s2">"world"</span><span class="p">;</span> <span class="s2">"I"</span><span class="p">;</span> <span class="s2">"like"</span><span class="p">;</span> <span class="s2">"pie"</span><span class="p">;</span> <span class="s2">""</span><span class="p">]</span></code></pre></figure>
<p>If you need to include the actual matches in the output, you can. Passing <code class="language-plaintext highlighter-rouge">~include_matches:true</code> ensures the “separators” are in there with the rest of the output.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"[.,! ]+"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">split</span> <span class="o">~</span><span class="n">include_matches</span><span class="o">:</span><span class="bp">true</span> <span class="n">re</span> <span class="s2">"Hello, world! I like pie."</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span>
<span class="p">[</span><span class="s2">"Hello"</span><span class="p">;</span> <span class="s2">", "</span><span class="p">;</span> <span class="s2">"world"</span><span class="p">;</span> <span class="s2">"! "</span><span class="p">;</span> <span class="s2">"I"</span><span class="p">;</span> <span class="s2">" "</span><span class="p">;</span> <span class="s2">"like"</span><span class="p">;</span> <span class="s2">" "</span><span class="p">;</span> <span class="s2">"pie"</span><span class="p">;</span> <span class="s2">"."</span><span class="p">;</span> <span class="s2">""</span><span class="p">]</span></code></pre></figure>
<p>Just be aware of that final empty string at the end!</p>
<p>You can also limit the number of matches with the <code class="language-plaintext highlighter-rouge">max</code> argument. You could use this to get the first value separated from the remaining values in a string of tab-separated values, for example.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"</span><span class="se">\t</span><span class="s2">"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">split</span> <span class="o">~</span><span class="n">max</span><span class="o">:</span><span class="mi">1</span> <span class="n">re</span> <span class="s2">"apple</span><span class="se">\t</span><span class="s2">pie</span><span class="se">\t</span><span class="s2">is</span><span class="se">\t</span><span class="s2">good"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"apple"</span><span class="p">;</span> <span class="s2">"pie</span><span class="se">\t</span><span class="s2">is</span><span class="se">\t</span><span class="s2">good"</span><span class="p">]</span></code></pre></figure>
<p>If the regular expression has no matches in the query string, then a one element list is returned.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"</span><span class="se">\t</span><span class="s2">"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">split</span> <span class="o">~</span><span class="n">max</span><span class="o">:</span><span class="mi">1</span> <span class="n">re</span> <span class="s2">"apple pie is good"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"apple pie is good"</span><span class="p">]</span></code></pre></figure>
<h2 id="replacing">Replacing</h2>
<h3 id="using-rewrite">Using <code class="language-plaintext highlighter-rouge">rewrite</code></h3>
<p>The simpler interface for regex replacement consists of the <code class="language-plaintext highlighter-rouge">rewrite</code> and <code class="language-plaintext highlighter-rouge">rewrite_exn</code> functions. The <code class="language-plaintext highlighter-rouge">template</code> argument defines how you want to replace any matches in the query string. In this case, we replace any matches with a capital A.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">rewrite_exn</span> <span class="n">re</span> <span class="o">~</span><span class="n">template</span><span class="o">:</span><span class="s2">"A"</span> <span class="s2">"apple peach"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"Apple peAch"</span></code></pre></figure>
<p>You can reference the submatches in the template string using the syntax <code class="language-plaintext highlighter-rouge">\\n</code>. Check it out.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"([ae])"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">rewrite_exn</span> <span class="n">re</span> <span class="o">~</span><span class="n">template</span><span class="o">:</span><span class="s2">"( </span><span class="se">\\</span><span class="s2">1 )"</span> <span class="s2">"apple peach"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"( a )ppl( e ) p( e )( a )ch"</span></code></pre></figure>
<p>If you have multiple submatches, just keep referring to them in the same way: <code class="language-plaintext highlighter-rouge">\\1 ... \\2 ...</code> etc.</p>
<p>If you need to check if your rewrite template is valid before running <code class="language-plaintext highlighter-rouge">rewrite</code>, use <code class="language-plaintext highlighter-rouge">valid_rewrite_template</code> function.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"([ae])([io])([uy])"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">template</span> <span class="o">=</span> <span class="s2">"</span><span class="se">\\</span><span class="s2">3 - </span><span class="se">\\</span><span class="s2">2 - </span><span class="se">\\</span><span class="s2">1"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">valid_rewrite_template</span> <span class="n">re</span> <span class="o">~</span><span class="n">template</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">bool</span> <span class="o">=</span> <span class="bp">true</span></code></pre></figure>
<h3 id="using-replace">Using <code class="language-plaintext highlighter-rouge">replace</code></h3>
<p>The <code class="language-plaintext highlighter-rouge">re2</code> library also provides more powerful replacing functions: <code class="language-plaintext highlighter-rouge">replace</code> and <code class="language-plaintext highlighter-rouge">replace_exn</code>. You can use them if you need direct access to the <code class="language-plaintext highlighter-rouge">Match.t</code>.</p>
<p>Here is a silly example that picks a different replacement value depending on the match.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"([ae])"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">replace_exn</span> <span class="n">re</span> <span class="s2">"apple peach"</span> <span class="o">~</span><span class="n">f</span><span class="o">:</span><span class="p">(</span><span class="k">fun</span> <span class="n">m</span> <span class="o">-></span>
<span class="k">match</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get_exn</span> <span class="n">m</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">)</span> <span class="k">with</span>
<span class="o">|</span> <span class="s2">"a"</span> <span class="o">-></span> <span class="s2">"u"</span>
<span class="o">|</span> <span class="s2">"e"</span> <span class="o">-></span> <span class="s2">"o"</span>
<span class="o">|</span> <span class="n">_</span> <span class="o">-></span> <span class="k">assert</span> <span class="bp">false</span><span class="p">)</span>
<span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"upplo pouch"</span></code></pre></figure>
<p>While the <code class="language-plaintext highlighter-rouge">replace</code> function is more complicated than <code class="language-plaintext highlighter-rouge">rewrite</code>, it gives you more control and has a few <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#val-replace">other options</a> you may find useful.</p>
<h2 id="miscellaneous-info">Miscellaneous info</h2>
<h3 id="escaping-strings-for-regular-expressions">Escaping strings for regular expressions</h3>
<p>Properly escaping regular expressions can sometimes be tricky, especially if you want to avoid illegal backslash characters in your strings.</p>
<p><code class="language-plaintext highlighter-rouge">Re2</code> provides a function <code class="language-plaintext highlighter-rouge">escape</code> that escapes its input in such a way that if you create a regex from the resulting escaped string, it would match the original string. Here’s how it works.</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="nn">Re2</span><span class="p">.</span><span class="n">escape</span> <span class="s2">"Apple. (Pie)!!"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"Apple</span><span class="se">\\</span><span class="s2">.</span><span class="se">\\</span><span class="s2"> </span><span class="se">\\</span><span class="s2">(Pie</span><span class="se">\\</span><span class="s2">)</span><span class="se">\\</span><span class="s2">!</span><span class="se">\\</span><span class="s2">!"</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">matches</span>
<span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="o">@@</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">escape</span> <span class="s2">"Apple. (Pie)!!"</span><span class="p">)</span>
<span class="s2">"Apple. (Pie)!!"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">bool</span> <span class="o">=</span> <span class="bp">true</span></code></pre></figure>
<p>Depending on how many special characters are in the string you use to build the regex, escaping can be pretty noisy! In these cases, <code class="language-plaintext highlighter-rouge">escape</code> is especially useful.</p>
<h3 id="infix-matching-operator">Infix matching operator</h3>
<p>If you’re feeling nostalgic for Perl, feel free to use the <code class="language-plaintext highlighter-rouge">=~</code> infix operator!</p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"ab"</span><span class="p">;;</span>
<span class="nn">Re2</span><span class="p">.</span><span class="nn">Infix</span><span class="p">.(</span><span class="s2">"abc"</span> <span class="o">=~</span> <span class="n">re</span><span class="p">);;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">bool</span> <span class="o">=</span> <span class="bp">true</span>
<span class="c">(* Let's get crazy and open the module! *)</span>
<span class="k">open</span> <span class="nn">Re2</span><span class="p">.</span><span class="nc">Infix</span><span class="p">;;</span>
<span class="s2">"abc"</span> <span class="o">=~</span> <span class="n">re</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">bool</span> <span class="o">=</span> <span class="bp">true</span></code></pre></figure>
<h3 id="precompiling-your-regular-expressions">“Precompiling” your regular expressions</h3>
<p>Unless you have a good reason not to, you will probably want to create your regular expression outside of the function that will be using it.</p>
<p>To see why, let’s check out this little benchmark program that compares two functions. The first one reuses a regex that is created outside of the function, whereas the second one creates a new regex each time the function is called.</p>
<p><em>Note: This benchmark program uses Jane Street’s <a href="https://github.com/janestreet/core_bench">core_bench</a> micro-benchmarking library.</em></p>
<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">open</span><span class="o">!</span> <span class="nc">Core</span>
<span class="k">open</span><span class="o">!</span> <span class="nc">Core_bench</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])"</span>
<span class="k">let</span> <span class="n">find</span> <span class="n">re</span> <span class="n">s</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">find_first_exn</span> <span class="n">re</span> <span class="n">s</span>
<span class="k">let</span> <span class="n">find'</span> <span class="n">s</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">find_first_exn</span> <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])"</span><span class="p">)</span> <span class="n">s</span>
<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
<span class="nn">Command</span><span class="p">.</span><span class="n">run</span>
<span class="p">(</span><span class="nn">Bench</span><span class="p">.</span><span class="n">make_command</span>
<span class="p">[</span>
<span class="nn">Bench</span><span class="p">.</span><span class="nn">Test</span><span class="p">.</span><span class="n">create</span> <span class="o">~</span><span class="n">name</span><span class="o">:</span><span class="s2">"outside"</span> <span class="p">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-></span>
<span class="n">find</span> <span class="n">re</span> <span class="s2">"abcabcabc"</span><span class="p">);</span>
<span class="nn">Bench</span><span class="p">.</span><span class="nn">Test</span><span class="p">.</span><span class="n">create</span> <span class="o">~</span><span class="n">name</span><span class="o">:</span><span class="s2">"inside"</span> <span class="p">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-></span>
<span class="n">find'</span> <span class="s2">"abcabcabc"</span><span class="p">);</span>
<span class="p">])</span></code></pre></figure>
<table>
<thead>
<tr>
<th>Name</th>
<th style="text-align: right">Time/Run</th>
<th style="text-align: right">mWd/Run</th>
<th style="text-align: right">Percentage</th>
</tr>
</thead>
<tbody>
<tr>
<td>outside</td>
<td style="text-align: right">272.60 ns</td>
<td style="text-align: right">2.00 w</td>
<td style="text-align: right">3.74%</td>
</tr>
<tr>
<td>inside</td>
<td style="text-align: right">7_281.55 ns</td>
<td style="text-align: right">91.00 w</td>
<td style="text-align: right">100.00%</td>
</tr>
</tbody>
</table>
<p>As you can see, reusing a regex rather than creating a new one each time a function is called makes a big difference in this benchmark. Keep in mind that this is a micro-benchmark, and that this difference may not be that important to the run time of your program as a whole. That said, if you had the slow version of the above function in a hot loop, it could really be wasting a lot of CPU cycles.</p>
<h2 id="wrap-up">Wrap up</h2>
<p>Hopefully this overview helps you get started with using <code class="language-plaintext highlighter-rouge">re2</code>!</p>
<p>To get more info about using <code class="language-plaintext highlighter-rouge">re2</code>, check out the <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html">API docs</a>. Additionally, the <code class="language-plaintext highlighter-rouge">re2</code> <a href="https://github.com/janestreet/re2/tree/master/src">source code</a> is quite readable. I encourage you to take a look at how the functions are defined–it may help clear up any additional questions you have!</p>Ryan MooreIn this tutorial, we will talk about re2, an OCaml library providing bindings to RE2, Google’s regular expression library.Styling plots in base R graphics to match ggplot2 classic theme2021-05-09T00:00:00+00:002021-05-09T00:00:00+00:00https://www.tenderisthebyte.com/blog/2021/05/09/pretty-plots-with-base-r-grahpics<p><a href="https://ggplot2.tidyverse.org/">ggplot2</a> is an R package for creating graphics in a declarative way and is based on <a href="https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html">The Grammar of Graphics</a>. If you have never used ggplot2, it’s a nice library for making publication ready figures with much less hassle than the base R graphics.</p>
<p>Something I think is pretty fun is to try and recreate ggplot2 style figures using base R graphics. Sometimes, I look at the actual plotting code in the ggplot2 package, but I think it is more fun to just make a figure with ggplot and then try and get a reasonable match with base R. Doing so, you really get an appreciation of the convencience of the ggplot2 package.</p>
<p>With that, let’s try and recreate a figure using the “classic” ggplot2 theme: <a href="https://ggplot2.tidyverse.org/reference/ggtheme.html">theme_classic</a>.</p>
<p><em>If you want to learn more about base R graphics, check out my <a href="https://www.tenderisthebyte.com/blog/2019/04/25/rotating-axis-labels-in-r/">deep dive into rotating axis labels in base R plots</a>.</em></p>
<div class="post-toc">
<h4 class="post-toc--header" id="contents">Contents</h4>
<ul>
<li><a href="#set-up">Set up</a></li>
<li><a href="#fixing-the-axes">Fixing the axes</a></li>
<li><a href="#fixing-the-points">Fixing the points</a></li>
<li><a href="#adding-a-legend">Adding a legend</a></li>
<li><a href="#some-final-touchups">Some final touchups</a></li>
<li><a href="#wrap-up">Wrap up</a></li>
</ul>
</div>
<h2 id="set-up">Set up</h2>
<p>First, here is some “set up” code where we create some data and set some variables to hold colors and stuff like that.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">k_purple</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"#875692"</span><span class="w">
</span><span class="n">k_orange</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"#F38400"</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">12341234</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="n">group</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">),</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="s2">"B"</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">))</span></code></pre></figure>
<p>With that out of the way, let’s see the ggplot2 classic theme that we will try and match. Here it is:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="p">),</span><span class="w">
</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">group</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_color_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">k_orange</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme_classic</span><span class="p">()</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/ggplot_theme_classic.png" alt="ggplot2 classic theme" />
<figcaption>ggplot2 classic theme</figcaption>
</figure>
<p>And finally, let’s compare the simplest possible base R graphics plot. I’m sure that you’re familiar with what it looks like!</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base.png" alt="Base R graphics plot" />
<figcaption>Base R graphics plot</figcaption>
</figure>
<p>You can see that that plot is pretty far from where we want to be. Let’s go step-by-step getting closer to the <code class="language-plaintext highlighter-rouge">theme_classic</code> ggplot version each time.</p>
<h2 id="fixing-the-axes">Fixing the axes</h2>
<p>The first thing you see is that box around the plot that isn’t present in the ggplot version. Let’s remove it by passing <code class="language-plaintext highlighter-rouge">bty = "n"</code> to the plot function.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w">
</span><span class="c1">## Remove the box around the plot.</span><span class="w">
</span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_no_box.png" alt="Removing the box" />
<figcaption>Removing the box</figcaption>
</figure>
<p>You can see that the axes are a bit different than in the ggplot2 version. Here, the final ticks are the edges of the axis. The ggplot version has a nice, solid line for the x and y axes that connects at the bottom left corner. You can get that effect with the <code class="language-plaintext highlighter-rouge">bty</code> option to <code class="language-plaintext highlighter-rouge">plot</code>.</p>
<p>The <code class="language-plaintext highlighter-rouge">bty</code> parameter is an interesting one. Here is the section from the <code class="language-plaintext highlighter-rouge">par</code> help file describing <code class="language-plaintext highlighter-rouge">bty</code>:</p>
<blockquote>
<p>‘bty’ A character string which determined the type of box which
is drawn about plots. If ‘bty’ is one of ‘”o”’ (the
default), ‘”l”’, ‘”7”’, ‘”c”’, ‘”u”’, or ‘”]”’ the resulting
box resembles the corresponding upper case letter. A value
of ‘”n”’ suppresses the box.</p>
</blockquote>
<p>Those options look pretty weird, but they each show the “shape” of what the box will look like: <code class="language-plaintext highlighter-rouge">l</code> will look like a upper case <code class="language-plaintext highlighter-rouge">L</code>, or have a line on the left and the right only. The <code class="language-plaintext highlighter-rouge">7</code> will look sort of like a <code class="language-plaintext highlighter-rouge">7</code>, or have the box lines on the top and right only. Since we want lines on the left and bottom, we can use <code class="language-plaintext highlighter-rouge">bty = "l"</code>. I will also remove the default x and y axes (using <code class="language-plaintext highlighter-rouge">xaxt</code> and <code class="language-plaintext highlighter-rouge">yaxt</code>) since we don’t want it to overlap the lines of the box. Also we can increase the width a bit with <code class="language-plaintext highlighter-rouge">lwd</code>.</p>
<p>While you can control the box inside the plot function, I will use the <code class="language-plaintext highlighter-rouge">box</code> function instead. That way, it will be a little easier to customize. To do that, we will keep the <code class="language-plaintext highlighter-rouge">bty = "n"</code> in the plot function to turn the box off, then add it back in after with <code class="language-plaintext highlighter-rouge">box</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w">
</span><span class="c1">## Remove box.</span><span class="w">
</span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
</span><span class="c1">## Remove default x and y axis.</span><span class="w">
</span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w">
</span><span class="c1">## Add 'box' lines to the bottom and left of the plot.</span><span class="w">
</span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w">
</span><span class="c1">## Increase width of box lines.</span><span class="w">
</span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb.png" alt="With nice axis lines" />
<figcaption>With nice axis lines</figcaption>
</figure>
<h3 id="add-the-tick-marks">Add the tick marks</h3>
<p>Now let’s add the axis ticks and labels back in. For that we use the
<code class="language-plaintext highlighter-rouge">axis</code> function. We will change a few of the options at once, so I
will go over them first. The <code class="language-plaintext highlighter-rouge">side</code> parameter controls where the axis
is drawn with respect to the plot: 1 = below, 2 = to the left, 3 =
above, and 4 = to the right. Remember how the axis is drawn with the
line by default? We turn that off with <code class="language-plaintext highlighter-rouge">lwd = 0</code> and then we set the
tick width to match the box width using <code class="language-plaintext highlighter-rouge">lwd.ticks = 2</code>. Finally, we
want to <a href="https://www.tenderisthebyte.com/blog/2019/04/25/rotating-axis-labels-in-r/">rotate the tick labels of the y
axis</a>
so they are perpendicular to the axis. Here it is.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1">## X Axis</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="c1">## Don't draw the axis line.</span><span class="w">
</span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w">
</span><span class="c1">## Match the width of the tick marks to the box lines.</span><span class="w">
</span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1">## Y axis</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="c1">## Rotate tick labels prependicular to the axis.</span><span class="w">
</span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes.png" alt="With ticks and tick labels" />
<figcaption>With ticks and tick labels</figcaption>
</figure>
<h3 id="adjusting-ticks-and-tick-labels">Adjusting ticks and tick labels</h3>
<p>Next, we are going to make some adjustments to the length of the tick
marks and to where the axis labels are drawn. This can get a little
weird, and there are multiple ways to do it. Let’s go through some of
the options we will need.</p>
<p>The <code class="language-plaintext highlighter-rouge">mgp</code> parameter is <a href="https://www.tenderisthebyte.com/blog/2019/04/25/rotating-axis-labels-in-r/#the-las-and-mgp-parameters">a little
tricky</a>.
It is a three part vector that controls the margin for the axis title
(<code class="language-plaintext highlighter-rouge">mgp[1]</code>), axis (tick) labels (<code class="language-plaintext highlighter-rouge">mgp[2]</code>), and the axis line
(<code class="language-plaintext highlighter-rouge">mgp[3]</code>). The default value is <code class="language-plaintext highlighter-rouge">c(3, 1, 0)</code>. The units are in
lines of text.</p>
<p>We want to move the axis labels and tick labels closer to the axis, so
we need to reduce the first two numbers in that vector. This time,
I’m going to use the
<a href="https://stat.ethz.ch/R-manual/R-patched/library/graphics/html/par.html">par</a>
function to set the parameter since I want it to apply to all the
plotting functions.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">## Move the axis label and tick labels closer to the axis line.</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted.png" alt="Adjusting the axis labels" />
<figcaption>Adjusting the axis labels</figcaption>
</figure>
<h3 id="adjusting-tick-label-length">Adjusting tick label length</h3>
<p>Now that we’ve tweaked the label positions, we need to adjust the
tick length. We do that with <code class="language-plaintext highlighter-rouge">tcl</code> parameter to the <code class="language-plaintext highlighter-rouge">par</code> function,
which specifies tick mark length as a fraction of the height of a line
of text. So <code class="language-plaintext highlighter-rouge">tcl = 1</code> will make tick labels the same height as a line
of text, <code class="language-plaintext highlighter-rouge">tcl = -0.5</code> (the default) will make them 1/2 the line
height. The sign of the argument controls the direction the ticks
point: positive values point into the chart, negative values point
away. Let’s make them half as long as they are now with <code class="language-plaintext highlighter-rouge">tcl =
-0.25</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="c1">## Reduce the size of the tick marks.</span><span class="w">
</span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_2.png" alt="Shrinking the tick marks" />
<figcaption>Shrinking the tick marks</figcaption>
</figure>
<h3 id="moving-the-x-labels-a-bit-more">Moving the x labels a bit more</h3>
<p>That’s pretty good, but to my eye, the x axis tick labels are still a
bit too far away from the ticks. To fix that, we can pass the <code class="language-plaintext highlighter-rouge">mgp</code>
param directly to the <code class="language-plaintext highlighter-rouge">axis</code> function that we use to draw the axis.
It will overwrite the global value set by the <code class="language-plaintext highlighter-rouge">par</code> function, but only
for the function we pass it to. The 2nd element in the <code class="language-plaintext highlighter-rouge">mgp</code> vector
controls the axis tick labels, so we will reduce it from <code class="language-plaintext highlighter-rouge">0.4</code> to
<code class="language-plaintext highlighter-rouge">0.2</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="c1">## Reducing the 2nd element from 0.4 to 0.2 moves the x axis</span><span class="w">
</span><span class="c1">## tick labels closer to the axis line.</span><span class="w">
</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3.png" alt="Moving the x axis labels in" />
<figcaption>Moving the x axis labels in</figcaption>
</figure>
<p>That’s better!</p>
<h2 id="fixing-the-points">Fixing the points</h2>
<p>Now that the axes are looking pretty good, let’s move on to the
points. To change the type of point that is plotted, you use the
<code class="language-plaintext highlighter-rouge">pch</code> parameter. I like <code class="language-plaintext highlighter-rouge">pch = 20</code> for little dots, but <code class="language-plaintext highlighter-rouge">pch = 16</code>
could work as well. We can also change the size of the points with
the <code class="language-plaintext highlighter-rouge">cex</code> parameter. The default size is <code class="language-plaintext highlighter-rouge">cex = 1</code> and increasing the
number will increase the size (e.g., <code class="language-plaintext highlighter-rouge">cex = 2</code> will be twice as big).
We will use <code class="language-plaintext highlighter-rouge">cex = 1.4</code> to approximate the size of the ggplot points.</p>
<p>Finally, to change the color, we will use the <code class="language-plaintext highlighter-rouge">col</code> parameter to the
<code class="language-plaintext highlighter-rouge">plot</code> function. For this parameter, we can pass in a vector the same
length as the <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> data vectors to specify the color for each
data point. The <code class="language-plaintext highlighter-rouge">group</code> vector we created at the beginning gives two
groups, <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code>, for the points. We want to associate each group
with a color so we make a named color vector like this: <code class="language-plaintext highlighter-rouge">colors <- c(A
= k_purple, B = k_orange)</code>. Then we use the <code class="language-plaintext highlighter-rouge">groups</code> vector to index
the <code class="language-plaintext highlighter-rouge">colors</code> vector like this: <code class="language-plaintext highlighter-rouge">colors[group]</code>.</p>
<p>If that doesn’t make sense, here is a simple example.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">tastiness</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">Cookie</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"yummy"</span><span class="p">,</span><span class="w"> </span><span class="n">Cake</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"yucky"</span><span class="p">)</span><span class="w">
</span><span class="n">desserts</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Cookie"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Cake"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Cookie"</span><span class="p">)</span><span class="w">
</span><span class="n">tastiness</span><span class="p">[</span><span class="n">desserts</span><span class="p">]</span><span class="w">
</span><span class="c1">## Cookie Cake Cookie</span><span class="w">
</span><span class="c1">## "yummy" "yucky" "yummy"</span></code></pre></figure>
<p>Let’s use that idea for our plot.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">## Associate group A with purple and group B with orange.</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
</span><span class="c1">## Draw filled in dots instead of open circles.</span><span class="w">
</span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
</span><span class="c1">## Increase the size of the dots.</span><span class="w">
</span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w">
</span><span class="c1">## Set the color of each dot based on its group.</span><span class="w">
</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points.png" alt="Fixing the points" />
<figcaption>Fixing the points</figcaption>
</figure>
<p>Now that’s looking pretty good!</p>
<h2 id="adding-a-legend">Adding a legend</h2>
<p>It’s time now to put in the legend. We will start with something
basic and then adjust it to match the legend in the ggplot2 figure.</p>
<p>To make a legend in base R graphics, use the
<a href="https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/legend.html">legend</a>
function. We set the legend location with the <code class="language-plaintext highlighter-rouge">x</code> parameter. To put
the legend on the right side of the plot, we use <code class="language-plaintext highlighter-rouge">x = "right"</code>. We
use the <code class="language-plaintext highlighter-rouge">legend</code> param to actually tell the legend the names of the
groups: <code class="language-plaintext highlighter-rouge">legend = c("A", "B")</code>. Now for the points, we specify the
style we used (<code class="language-plaintext highlighter-rouge">pch = 20</code>) and the different colors for the each group
(<code class="language-plaintext highlighter-rouge">col = colors</code>). Here it is.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
</span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1">## Add a legend to the right side of the plot.</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w">
</span><span class="c1">## Specify the group names.</span><span class="w">
</span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">),</span><span class="w">
</span><span class="c1">## And the colors of the dots.</span><span class="w">
</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">,</span><span class="w">
</span><span class="c1">## And the shape of the dots.</span><span class="w">
</span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend.png" alt="Adding a legend" />
<figcaption>Adding a legend</figcaption>
</figure>
<p>That’s not bad, but not quite the look we are going for. We need to
add a legend title, remove the box around the legend, and tweak the
size and spacing of the elements.</p>
<h3 id="adjusting-the-legend">Adjusting the legend</h3>
<p>To set the title, we can do this: <code class="language-plaintext highlighter-rouge">title = "group"</code>. Removing the box
is done as in the main plot by setting <code class="language-plaintext highlighter-rouge">bty = "n"</code>. I think it looks
nice when the size of the points in a legend to match the size of the
points in the plot. To do that, we can use the <code class="language-plaintext highlighter-rouge">pt.cex</code> option. We
set it to <code class="language-plaintext highlighter-rouge">1.4</code> to match the <code class="language-plaintext highlighter-rouge">cex</code> parameter that we passed in to
<code class="language-plaintext highlighter-rouge">plot</code> like so: <code class="language-plaintext highlighter-rouge">pt.cex = 1.4</code>.</p>
<p>It’s a subtle thing, but the spacing between the legend elements in
the ggplot figure are a bit more spaced out than in the base graphics
figure. To adjust that, we use <code class="language-plaintext highlighter-rouge">x.intersp</code> and <code class="language-plaintext highlighter-rouge">y.intersp</code>
parameters, which adjust the character spacing in the horizontal and
vertical directions (the units are line heights again). The default
is <code class="language-plaintext highlighter-rouge">1</code> for both. Since we want a little more space, we increase them
to something like this: <code class="language-plaintext highlighter-rouge">x.intersp = 1.4, y.intersp = 1.15</code>.</p>
<p>Here’s what those changes look like.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
</span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
</span><span class="c1">## Add a title</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"group"</span><span class="p">,</span><span class="w">
</span><span class="c1">## Remove the box around the legend.</span><span class="w">
</span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
</span><span class="c1">## Increase the size of the points to match those in the plot.</span><span class="w">
</span><span class="n">pt.cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w">
</span><span class="c1">## Increase the spacing in the x and y directions.</span><span class="w">
</span><span class="n">x.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">y.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.15</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend_2.png" alt="Adjusting the legend" />
<figcaption>Adjusting the legend</figcaption>
</figure>
<p>outside of the plotting area</p>
<h3 id="move-the-legend-outside-of-the-plotting-area">Move the legend outside of the plotting area</h3>
<p>Next we need to adjust the position of the whole legend. Do you see
how it is actually inside the plot on the base graphics version, but
outside of it in the ggplot version? We can move the legend around
with the <code class="language-plaintext highlighter-rouge">inset</code> parameter. The default value is <code class="language-plaintext highlighter-rouge">0</code>. If you pass in
a positive number, the legend moves into the plot, whereas if you pass
in a negative number the legend moves out away from the plot. We will
pass in <code class="language-plaintext highlighter-rouge">inset = -0.1</code> to bump it to the right to get it outside of
the plot.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
</span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"group"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w">
</span><span class="n">x.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">y.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.15</span><span class="p">,</span><span class="w">
</span><span class="c1">## Nudge the legend to the right.</span><span class="w">
</span><span class="n">inset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.1</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend_3.png" alt="Moving the legend outside of the plot area" />
<figcaption>Moving the legend outside of the plot area</figcaption>
</figure>
<p>Whoops! Do you see how the legend went right off the chart? To make
sure the legend doesn’t get clipped, we need to pass in <code class="language-plaintext highlighter-rouge">xpd = TRUE</code>
to the <code class="language-plaintext highlighter-rouge">legend</code> function. The <code class="language-plaintext highlighter-rouge">xpd</code> parameter affects how the plot
elements are clipped if they exceed the edges of the plot. Here is
how you move the legend outside of the plotting area using the <code class="language-plaintext highlighter-rouge">xpd</code>
parameter.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
</span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"group"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w">
</span><span class="n">x.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">y.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.15</span><span class="p">,</span><span class="w">
</span><span class="n">inset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.1</span><span class="p">,</span><span class="w">
</span><span class="c1">## Ensure the legend is not clipped even though it is</span><span class="w">
</span><span class="c1">## outside of the plotting area.</span><span class="w">
</span><span class="n">xpd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend_4.png" alt="Do not clip the legend outside the plotting area" />
<figcaption>Do not clip the legend outside the plotting area</figcaption>
</figure>
<h2 id="some-final-touchups">Some final touchups</h2>
<p>We’re almost there now! Just a few more adjustments to make: tick
label size, plot element colors, and plot margins.</p>
<h3 id="tick-label-size">Tick label size</h3>
<p>Right now, the tick labels are a lot bigger than they are in the
ggplot version. To fix it, we can pass in <code class="language-plaintext highlighter-rouge">cex.axis = 0.85</code> to the
<code class="language-plaintext highlighter-rouge">par</code> function. That way, it will be applied to both the x and y axes
and we don’t have to specify it twice. Remember that the normal <code class="language-plaintext highlighter-rouge">cex</code>
is 1 so any number less than that will be smaller than the default.</p>
<h3 id="plot-element-colors">Plot element colors</h3>
<p>Setting the plot element colors can be a little tricky because we have
to specify them in a few different places. I should mention that
there are quite a few ways to control the colors in plots made with
base R graphics. It can get a little confusing as to what parameter
is controlling what aspect of the plot, especially when you consider
that the options passed in to the <code class="language-plaintext highlighter-rouge">par</code> function control lots of
different plot elements. For example, <code class="language-plaintext highlighter-rouge">par(fg = "green")</code> will turn a
lot of plot elements green, but not all of them. Rather than do that,
we will adjust colors mostly inside the functions that they will
affect.</p>
<p>We will first set a variable to hold the color and use that:
<code class="language-plaintext highlighter-rouge">base_color <- "#444444"</code>. The axes label colors are controlled with
the <code class="language-plaintext highlighter-rouge">col.lab</code> parameter to the <code class="language-plaintext highlighter-rouge">par</code> function (<code class="language-plaintext highlighter-rouge">col.lab =
base_color</code>). To change the axis (box) line color, we pass in <code class="language-plaintext highlighter-rouge">col =
base_color</code> to the <code class="language-plaintext highlighter-rouge">box</code> function. For the axes ticks and tick
labels, we the <code class="language-plaintext highlighter-rouge">col</code> and <code class="language-plaintext highlighter-rouge">col.axis</code> parameters to the <code class="language-plaintext highlighter-rouge">axis</code> function
to control the tick color and the tick label color, respectively
(e.g., <code class="language-plaintext highlighter-rouge">col = base_color, col.axis = base_color</code>). To change the
legend color, we pass <code class="language-plaintext highlighter-rouge">text.col = base_color</code> directly to the <code class="language-plaintext highlighter-rouge">legend</code>
function.</p>
<h3 id="plot-margins">Plot margins</h3>
<p>As with many other things in base R graphics, there are a couple ways
to control the plot margins. We are going to be using the <code class="language-plaintext highlighter-rouge">mar</code>
parameter to the <code class="language-plaintext highlighter-rouge">par</code> function. To do so, you pass in a 4 part
vector specifying the size of the margin (in lines of text) of the
bottom, left, top, and right sides of the plot, in that order. The
default is <code class="language-plaintext highlighter-rouge">c(5, 4, 4, 2) + 0.1</code>. We will shrink all the margins
except for the right, which we need to increase to make enough room
for our legend: <code class="language-plaintext highlighter-rouge">mar = c(3, 3, 1, 3.5)</code>. Just to make it clear, that
is three lines of text for the bottom and left margins, one line of
text for the top margin, and 3.5 lines of text for the right margin.</p>
<h3 id="all-the-final-adjustments">All the final adjustments</h3>
<p>Let’s put all the final touchups in now.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">base_color</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"#444444"</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">,</span><span class="w">
</span><span class="c1">## Shrink the tick labels.</span><span class="w">
</span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.85</span><span class="p">,</span><span class="w">
</span><span class="c1">## Set the axis label color</span><span class="w">
</span><span class="n">col.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">,</span><span class="w">
</span><span class="c1">## Adjust the margin: bottom, left, top, right</span><span class="w">
</span><span class="n">mar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">3.5</span><span class="p">))</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
</span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="c1">## Set the box color.</span><span class="w">
</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="c1">## Set the axis tick and tick label colors.</span><span class="w">
</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">,</span><span class="w"> </span><span class="n">col.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="c1">## Set the axis tick and tick label colors.</span><span class="w">
</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">,</span><span class="w"> </span><span class="n">col.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"group"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w">
</span><span class="n">x.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">y.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.15</span><span class="p">,</span><span class="w">
</span><span class="n">inset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.1</span><span class="p">,</span><span class="w"> </span><span class="n">xpd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
</span><span class="c1">## Set the legend text color.</span><span class="w">
</span><span class="n">text.col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">)</span></code></pre></figure>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend_5.png" alt="Applying the final adjustments" />
<figcaption>Applying the final adjustments</figcaption>
</figure>
<p>Looking good! So that’s almost the same as the original “classic”
theme ggplot2 plot. One thing you may notice is that there are a
different number of tick marks on the axes. You can actually adjust
this in base R graphics, but it can be a little bit tricky, so we will
leave that for another post.</p>
<h2 id="wrap-up">Wrap up</h2>
<p>Whew, that was a lot of stuff! As we saw, copying the style of the
ggplot <code class="language-plaintext highlighter-rouge">theme_classic</code> requires quite a lot of fiddling around with a
lot of different parameters to a few different functions. If I was
making a plot for a publication or blog post or something, I would
definitely just use ggplot, but it can be fun and educational to try
to reproduce something that an awesome library does with base R
graphics. Hopefully, you enjoyed the process and learned a lot about
base R graphics!</p>Ryan Mooreggplot2 is an R package for creating graphics in a declarative way and is based on The Grammar of Graphics. If you have never used ggplot2, it’s a nice library for making publication ready figures with much less hassle than the base R graphics.Computational lab notebooks using git and git-annex2021-05-07T00:00:00+00:002021-05-07T00:00:00+00:00https://www.tenderisthebyte.com/blog/2021/05/07/computational-lab-notebooks<p><em>Disclaimer: if you need a lab notebook for legal records, copyright,
patent rights, or anything like that, then this article probably isn’t
for you. This post is <strong>not</strong> providing any recommendations for those
cases.</em></p>
<div class="post-toc">
<h4 class="post-toc--header" id="contents">Contents</h4>
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#provenance-tracking">Provenance tracking</a></li>
<li><a href="#a-git-based-lab-notebook">A git-based lab notebook</a></li>
<li><a href="#a-cli-app-to-help-manage-git-based-lab-notebooks">A CLI app to help manage git-based lab notebooks</a></li>
<li><a href="#a-super-simple-example">A super simple example</a></li>
</ul>
</div>
<p><em><strong>Too long; didn’t read</strong>: Check out the <a href="https://github.com/mooreryan/computational_lab_notebooks">cln
app</a> on
GitHub. It helps you manage a computational lab notebook using git
and git-annex. You can find the documentation
<a href="https://mooreryan.github.io/computational_lab_notebooks/">here</a>.</em></p>
<h2 id="overview">Overview</h2>
<p>Keeping a good lab notebook for your computational work is important,
but it can be challenging. A quick Google search will show you lots
of examples of people talking about it:</p>
<ul>
<li><a href="https://doi.org/10.1371/journal.pcbi.1004385">Ten Simple Rules for a Computational Biologist’s Laboratory Notebook</a></li>
<li><a href="https://ori.hhs.gov/education/products/wsu/data.html">Notebook & Data Management</a></li>
<li><a href="https://scicomp.stackexchange.com/questions/35854/lab-notebooks-for-computational-science">Lab Notebooks for Computational Science</a></li>
<li><a href="https://blog.addgene.org/how-to-keep-a-lab-notebook-for-bioinformatic-analyses">How to Keep a Lab Notebook for Bioinformatic Analyses</a></li>
<li><a href="https://www.reddit.com/r/labrats/comments/66dlgq/keeping_a_good_lab_notebook_in_a_computational/">Keeping a good lab notebook in a computational field?</a></li>
</ul>
<p>I have tried a lot of different methods, but they all more or less
boil down to a workflow sort of like this:</p>
<ul>
<li>Write down some summary of what I’m about to do and why.</li>
<li>Run some commands, programs, or bash stuff.</li>
<li>Copy what I did into a document. (e.g., <a href="https://www.markdownguide.org/getting-started/">Markdown
notes</a> files,
<a href="https://tiddlywiki.com/">TiddlyWiki</a>, etc.)</li>
<li>Write a bit more about what happened.</li>
<li>Rinse and repeat.</li>
</ul>
<p>Then, depending on my needs, I may clean up the analysis and put it
into an <a href="https://rmarkdown.rstudio.com/">R Markdown</a> or <a href="https://jupyter.org/">Jupyter
notebooks</a> notebook so it will be easier to
reproduce later.</p>
<p>One problem with this general workflow is that it requires tracking a
lot of things manually (e.g., copying and pasting). Whenever you do a
lot of that, you will inevitably forget to paste a command into your
notebook. You might make a mistake or typo when running a command,
and rather than noting it down in your notebook, you just rerun it and
pretty soon your lab notebook is out of sync with the commands that
you have actually run. Another issue is that you may be running a
bunch of commands quickly, just testing some ideas out. When doing
this, you end up needing to track a ton of things in an ad-hoc manner
leading to a messy lab notebook that you need to come back to later
and reorganize.</p>
<p>In other words, you need to manually track a lot of information, and
it can be quite a challenge to keep track of everything!</p>
<h2 id="provenance-tracking">Provenance tracking</h2>
<p>One approach to dealing with this problem is by tracking the
provenance of files. An example of this is how <a href="https://doi.org/10.1038/s41587-019-0209-9">QIIME
2</a> includes metadata in
their artifact files (<code class="language-plaintext highlighter-rouge">.qza</code> files) to <a href="https://docs.qiime2.org/2021.2/concepts/#data-files-qiime-2-artifacts">track things that were done in
an
analysis</a>.</p>
<p>I like the idea of provenance tracking, but even if you do use QIIME,
there are a lot of things you need to do outside of QIIME that will
need tracking. While not quite the same, this sort of provenance
tracking reminds me a bit of using git or other version control
software. <a href="https://git-scm.com/">Git</a> is software used to track
changes in a set of files, and is often used by programmers during
software development.</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//computational_lab_notebooks/git_logo.png" alt="git -- a distributed version control system" />
<figcaption>git -- a distributed version control system</figcaption>
</figure>
<p><em>Note: If you have never used git before, the <a href="https://git-scm.com/doc">official
docs</a> have a lot of info that may be of use
to you. I have also written a <a href="https://mooreryan.github.io/computational_lab_notebooks/git/">small git
tutorial</a>
that you may find useful!</em></p>
<p>While I had used git while working on software, I had never tried
using it to manage a computational lab notebook. One reason is that
it <a href="https://stackoverflow.com/questions/3055506/git-is-very-very-slow-when-tracking-large-binary-files">doesn’t handle large files
well</a>.
For computational work, whether bioinformatics or data science, you
will be dealing with a lot of large files. Sequencing files easily
get over 10 GB in size, so using git alone is going to be problematic.
However, there are extensions to git like <a href="https://git-lfs.github.com/">Git Large File
Storage</a> and
<a href="https://git-annex.branchable.com/">git-annex</a> that help to address
this problem. (Essentially, git-annex tracks <a href="https://en.wikipedia.org/wiki/Symbolic_link">symbolic
links</a> in the git
repository rather than the file itself. There is a lot more to it
than that, so you check out the <a href="https://git-annex.branchable.com/walkthrough/">git-annex
walkthrough</a> if you
want to know more.)</p>
<h2 id="a-git-based-lab-notebook">A git-based lab notebook</h2>
<p><em>Note: I’m not the first one to think of using git to help manage a
computational lab notebook. In fact, you can find some interesting
discussion on whether version control is even useful for lab notebooks
<a href="http://ivory.idyll.org/blog/is-version-control-an-electronic-lab-notebook.html">here</a>,
<a href="https://kbroman.org/blog/2013/08/20/electronic-lab-notebook/">here</a>,
and
<a href="https://yossadh.github.io/posts/2018/12/lab-notebook-part-2/">here</a>.</em></p>
<p>Using git and git-annex, I figured that I could get a pretty decent
workflow going for my computational lab notebook. After playing
around with it for a while (and seeing that git-annex was a good
solution to git’s large file problem), I settled into a pretty
familiar workflow:</p>
<ul>
<li>Run a program, script, whatever.</li>
<li>Track any new files or changes with git.</li>
<li>Commit the changes.</li>
<li>Repeat.</li>
</ul>
<p>One key difference from my “typical” workflow is that instead of
putting the commands that I ran and their explanations into some
external document like a markdown file, I would put all the
information into the commit message. That way, all the info about how
and why I did something would be tracked in the git repository along
with the actual files and changes.</p>
<p>That works pretty well, but you still run in to the issue of having to
remember what you ran, copy and paste it correctly into the commit
message, blah blah blah. In other words, it’s still a bit of a pain.
While you get the added benefits of git logs and history tracking, you
have to do a lot of repetitive, annoying stuff to get things to work.
So, of course, I wrote a little program to help automate some of the
tedious stuff!</p>
<h2 id="a-cli-app-to-help-manage-git-based-lab-notebooks">A CLI app to help manage git-based lab notebooks</h2>
<p>While working with the above workflow, in addition to QIIME’s
provenance tracking, I was also reminded of <a href="https://en.wikipedia.org/wiki/Schema_migration">database
migrations</a>.
Basically, the way they work is that you write some script that says
how the database is supposed to change (e.g., add column <code class="language-plaintext highlighter-rouge">first_name</code>
to table <code class="language-plaintext highlighter-rouge">authors</code>), and then <a href="https://guides.rubyonrails.org/active_record_migrations.html#running-migrations">some migration
tool</a>
handles actually making any changes to the database. In theory, this
gives you a simpler way to track how your database has changed over
time–you can just follow the paper trail of your migration files.</p>
<p>The app I wrote works in a similar way, except that instead of making
incremental changes to a database, you are formalizing making changes
to the repository itself. The app is called <code class="language-plaintext highlighter-rouge">cln</code> (it stands for
“computational lab notebooks”…clever, I know!). You can find it on
<a href="https://github.com/mooreryan/computational_lab_notebooks">GitHub</a>.
There is also some pretty extensive <a href="https://mooreryan.github.io/computational_lab_notebooks/">documentation
available</a>
to help you get started using the software.</p>
<p>While I suggest you check out the docs for a more detailed explanation
of its installation and usage, I want to show a quick, little
example to give you a flavor of how the <code class="language-plaintext highlighter-rouge">cln</code> program can help you
manage you git-based lab notebook.</p>
<h2 id="a-super-simple-example">A super simple example</h2>
<p>The <code class="language-plaintext highlighter-rouge">cln</code> command provides a couple of subcommands to help you manage
your lab notebook with git and git-annex. (For more details on
individual subcommands, see
<a href="https://mooreryan.github.io/computational_lab_notebooks/usage/">here</a>).</p>
<h3 id="create-a-project">Create a project</h3>
<p>To start, you make a new project.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ mkdir -p ~/projects/cln_example && cd ~/projects/cln_example
$ cln init 'Example Project'
$ tree -a -I .git
.
├── .actions
│ ├── completed
│ ├── failed
│ ├── ignored
│ └── pending
└── README.md</code></pre></figure>
<p>The <code class="language-plaintext highlighter-rouge">cln init</code> command initializes a new project, creates a git
repository, and generates some scaffolding for actions and git commit
templates.</p>
<h3 id="prepare-an-action">Prepare an action</h3>
<p>Next, you prepare an action to run. (Again, this is just a silly
example…for a more in depth tutorial, see the
<a href="https://mooreryan.github.io/computational_lab_notebooks/">documentation</a>).</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ cln prepare 'printf "I like apple pie\n" > msg.txt'</code></pre></figure>
<p>In this case the action is just running a <code class="language-plaintext highlighter-rouge">printf</code> command and saving
the contents in a file. Of course, you can prepare an action
containing anything that you would normally run at the command line.
For example, you could prepare a crazy action like this:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ cln prepare "$(cat <<'EOF'
cut -f2 seq_information.seq_id_eco.tsv \
| cut -d';' -f5 \
| ruby -e 'h = Hash.new 0; \
ARGF.each {|l| h[l.chomp] += 1 }; \
h.sort_by {|_, count| count }.reverse. \
each {|eco, count| puts "#{eco}\t#{count}" }' \
| column -t \
> seq_eco_counts.txt
EOF
)"</code></pre></figure>
<p><em>Note: That’s actually an action I prepared and ran in a real project.
Previously, I would have put that little ad-hoc
<a href="https://www.ruby-lang.org/en/">Ruby</a> script into a file and ran it in
a way that is easier to track, but with the <code class="language-plaintext highlighter-rouge">cln</code> to help me manage
things, everything will be nicely tracked automatically.</em></p>
<p>The <code class="language-plaintext highlighter-rouge">cln prepare</code> command creates an action file and a <a href="https://git-scm.com/docs/git-commit/2.10.5#Documentation/git-commit.txt---templateltfilegt">git commit
template</a>.
The action file is simply a bash script with the command you want to
run, but having it there in your repository as a standalone script
helps you see what is going on if you’re running a complicated command
or when you come back to the project a couple of months later.</p>
<h3 id="run-the-pending-action">Run the pending action</h3>
<p>Next, you can check that everything is okay doing a <a href="https://en.wikipedia.org/wiki/Dry_run_(testing)">dry
run</a>. It will spit
out some stuff to the terminal to let you know what’s going on and
suggests what steps to take next. <em>Note: I’ve edited the terminal
output a bit.</em></p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ cln run -dry-run
~~~
~~~
~~~ Hi! I just previewed an action for you.
~~~
~~~ I plan to run this action file:
~~~ '.actions/pending/action__ ...'
~~~
~~~ It's contents are:
~~~
printf "I like apple pie\n" > msg.txt
~~~
~~~ If that looks good, you can run the action:
~~~ $ cln run
~~~
~~~</code></pre></figure>
<p>If it looks good, you can go ahead and run the action.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ cln run
~~~
~~~
~~~ Hi! I just ran an action for you.
~~~
~~~ * The pending action was '.actions/pending/action__REDACTED.sh'.
~~~ * The completed action is '.actions/completed/action__REDACTED.sh'.
~~~
~~~ Now, there are a couple of things you should do.
~~~
~~~ * Check which files have changed:
~~~ $ git status
~~~ * Add actions and commit templates:
~~~ $ git add .actions
~~~ * Unless they are small, add other new files with git annex:
~~~ $ git annex add blah blah blah...
~~~ * After adding files, commit changes using the template:
~~~ $ git commit -t '.actions/completed/action__REDACTED.gc_template.txt'
~~~
~~~ After that you are good to go!
~~~
~~~ * You can now check the logs with git log,
~~~ or use a GUI like gitk to view the history.
~~~
~~~</code></pre></figure>
<p>See how the <code class="language-plaintext highlighter-rouge">cln run</code> command gives you hints on what to do next? I
tried to make all the <code class="language-plaintext highlighter-rouge">cln</code> commands spit out helpful info like that
to the terminal.</p>
<h3 id="track-and-commit-changes">Track and commit changes</h3>
<p>Now, you will be able to see any files that were created or changed as
the result of running the action using <code class="language-plaintext highlighter-rouge">git status</code>. Depending on the
size(s) of the file(s) that were created or changed, you can add them
to the <a href="https://mooreryan.github.io/computational_lab_notebooks/git/#what-is-an-index">git
index</a>
with either <code class="language-plaintext highlighter-rouge">git add</code> or <code class="language-plaintext highlighter-rouge">git-annex add</code>. Finally, you commit the
changes using the git commit template that was made when you prepared
the action.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ git commit -t '.actions/completed/action__REDACTED.gc_template.txt'</code></pre></figure>
<p>The template file will look something like this:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">PUT COMMIT MSG HERE.
== Details ==
PUT DETAILS HERE.
== Command(s) ==
printf "I like apple pie\n" > msg.txt
== Action file ==
action__REDACTED.sh</code></pre></figure>
<p>When you run the <code class="language-plaintext highlighter-rouge">git commit</code> command, a text editor will pop up with
the contents of the git template file ready for you to fill out. This
is nice because you can avoid manually copying in the commands you
ran. For such a small example it’s not really a big deal, but if
you’re running some complicated bioinformatics software with a lot of
flags and options, it’s pretty convenient!</p>
<h3 id="browse-the-git-history">Browse the git history</h3>
<p>After editing the message and saving the commit, you can browse
through your nicely organized repository history and see something
like this:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ git log
commit ebf738 (HEAD -> master)
Author: Ryan Moore <moorer@udel.edu>
Date: Mon Apr 5 18:44:54 2021 -0400
Created the msg.txt file
== Details ==
I needed to create a file that describes something that I like. I
used the `printf` rather than `echo` because it is more portable.
(See https://stackoverflow.com/a/11530298 for a discussion of this on
stack overflow).
== Command(s) ==
/usr/bin/printf "I like apple pie\n" > msg.txt
== Action file ==
action__460986084__2021-04-05_18:02:37.sh
commit 1a2e90
Author: Ryan Moore <moorer@udel.edu>
Date: Mon Apr 5 17:43:50 2021 -0400
Initial commit</code></pre></figure>
<p>Notice how I put a short, descriptive commit message for the first
line, and then added in any additional details that I think I will
need later. The <code class="language-plaintext highlighter-rouge">== Details ==</code> section would hold all the extra
stuff I would put in my lab notebook anyway, but it is really
convenient to have it right there in the git log.</p>
<p>Having the command that you ran, the details about that command, and
the changes that command effected in your repository opens up some
really powerful ways to track your analyses.</p>
<h3 id="get-individual-file-provenance-info">Get individual file provenance info</h3>
<p>For example, you can use the <code class="language-plaintext highlighter-rouge">git</code> cli app (e.g., <code class="language-plaintext highlighter-rouge">git whatchanged</code> or
<code class="language-plaintext highlighter-rouge">git log</code>) or a GUI like <a href="https://git-scm.com/docs/gitk/">gitk</a> to get
detailed info about the provenance of any files in the repository.
You could run something like this to see all the history for the
<code class="language-plaintext highlighter-rouge">msg.txt</code> file.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">$ git log --stat --follow -p -- msg.txt
commit ... (HEAD -> master)
Author: Ryan Moore <moorer@udel.edu>
Date: ....
Created the msg.txt file
== Details ==
I needed to create a file that describes something that I like. I
used the `printf` rather than `echo` because it is more portable.
(See https://stackoverflow.com/a/11530298 for a discussion of this on
stack overflow).
== Command(s) ==
printf "I like apple pie\n" > msg.txt
== Action file ==
action__467354640__.....sh
---
msg.txt | 1 +
1 file changed, 1 insertion(+)
diff --git a/msg.txt b/msg.txt
new file mode 100644
index 0000000..135d9d6
--- /dev/null
+++ b/msg.txt
@@ -0,0 +1 @@
+I like apple pie</code></pre></figure>
<p>As you can imagine, having output like that for all the files in your
project folder as well as the chronological logs is a very powerful
way to track your analyses and makes managing a computational lab
notebook so much easier.</p>
<h2 id="wrap-up">Wrap up</h2>
<p>Managing a computational lab notebook is tricky. I have found that
using git and git-annex can be a good way to keep all the info you
need right in the same directory as all your data files, scripts, and
analysis code. To help you more easily manage lab notebooks using git
and git-annex, I created a command line app called <code class="language-plaintext highlighter-rouge">cln</code>. You can
find the code on
<a href="https://github.com/mooreryan/computational_lab_notebooks">GitHub</a>.
Installation instructions and usage examples can be found in the
<a href="https://mooreryan.github.io/computational_lab_notebooks/">documentation</a>.</p>Ryan MooreDisclaimer: if you need a lab notebook for legal records, copyright, patent rights, or anything like that, then this article probably isn’t for you. This post is not providing any recommendations for those cases.divnet-rs: A Rust implementation for DivNet2021-01-18T00:00:00+00:002021-01-18T00:00:00+00:00https://www.tenderisthebyte.com/blog/2021/01/18/divnet-rust-implementation<ul>
<li><em>Update: divnet-rs now has a way to parallelize the bootstrapping procedure. With enough RAM, it can give <a href="https://github.com/mooreryan/divnet-rs/issues/4#issuecomment-955592257">approximately linear decreases</a> in run time with increasing number of cores. Consider it an <a href="https://github.com/mooreryan/divnet-rs/blob/main/CHANGELOG.md#unreleased">experimental</a> feature for now.</em></li>
<li><em>Update 2022-04-06: On the <a href="https://doi.org/10.3389/fmicb.2015.01470">Lee dataset</a>, <a href="https://github.com/mooreryan/divnet-rs/releases/tag/v0.3.0">v0.3.0</a> is around 3x faster and uses ~60% of the memory as compared to <a href="https://github.com/mooreryan/divnet-rs/releases/tag/v0.2.1">v0.2.1</a>.</em></li>
<li><em>Update 2021-01-22: <a href="https://github.com/mooreryan/divnet-rs/releases/tag/v0.2.1">v0.2.1</a> further decreases the run time and required memory</em></li>
<li><em>Update 2021-01-19: As of <code class="language-plaintext highlighter-rouge">divnet-rs</code> <a href="https://github.com/mooreryan/divnet-rs/releases/tag/v0.2.0">v0.2.0</a>, users can manually set the random seed. Also, <code class="language-plaintext highlighter-rouge">v0.2.0</code> uses only about 2/3 the memory that was used by <a href="https://github.com/mooreryan/divnet-rs/releases/tag/v0.1.1">v0.1.1</a>.</em></li>
</ul>
<h2 id="background">Background</h2>
<p>One reason for doing microbiome sequencing is to learn about the microbial diversity of the ecosystems of interest. Estimating the diveristy of microbial communities is hard. Essentially every step of a sample to sequence pipeline <a href="https://doi.org/10.7554/eLife.46923">introduces biases</a> into your analyses, meaning the community composition you observe is likely quite different from the true community composition. Further, <a href="https://doi.org/10.3389/fmicb.2017.02224">microbiome datasets are compositional</a>, and must be treated with <a href="https://doi.org/10.1093/gigascience/giz107">statistical and computational methods</a> designed to handle such data.</p>
<p>Most communities are incredibly complex so you’re going to nearly always have issues with undersampling – there are just too many microbes to sequence them all, so you have to work with samples. Even though you cannot practically observe all the taxa in your environment, you still need to estimate the diversity of that environment. So why don’t we just “plug-in” our data into one of the common diversity indices borrowed from macroecology like Shannon or Simpson and be done with it? You will actually see this a lot in the literature: plugging in the observed relative abundances (sometimes after <a href="https://doi.org/10.1371/journal.pcbi.1003531">rarefying</a> the data first) from our samples into standard “plug-in” diversity formulas.</p>
<p>There are a couple of problems with this. Undersampling is problematic because alpha diversity metrics are <a href="https://doi.org/10.3389/fmicb.2019.02407">heavily biased when there are unobserved taxa</a>. The random sampling variation combined with biases introduced in the sample-to-sequence pipeline mean your observed relative abundances probably don’t faithfully represent the true community you want to study. Additionally, many commonly used methods for generating confidence intervals assume that taxa are independent (i.e., if taxa A is present in a community, it doesn’t provide any information about whether taxa B is there too).</p>
<h3 id="what-is-divnet">What is DivNet?</h3>
<p>So how are you supposed to measure diversity of microbial communities then? One method that is designed to address a lot of these problems is <a href="https://github.com/adw96/DivNet">DivNet</a>, an R package for estimating diversity when taxa in the community occur in an ecological network (i.e., a pattern of microbial co-occurence). DivNet leverages info from multiple samples and can estimate relative abundance of taxon in communities where it was unobserved. It also gives accurate estimates of variance in the measured diversity by taking into account sample metadata/covariates.</p>
<p>Probably the most interesting aspect of DivNet is that it allows you to account for ecological networks where taxa positively and negatively co-occur. DivNet estimates diversity using models from <a href="https://en.wikipedia.org/wiki/Compositional_data">compositional data analysis</a> that can handle co-occurance networks. This is in contrast to most common diversity estimates that are based on the <a href="https://en.wikipedia.org/wiki/Multinomial_distribution">multinomial model</a> that makes assumptions about sampling that prohibit ecological networks (i.e., situations in which taxa positively and negatively co-occur). (<em>Note: you may know the multinomial model from your stats courses in modeling the probability of counts for dice rolls or as generalization of the <a href="https://en.wikipedia.org/wiki/Binomial_distribution">binomial distribution</a>.</em>)</p>
<p>You can find a lot more information about DivNet, including algorithmic details, validation, comparison to other methods of estimating diversity, and some important details to keep in mind when using DivNet on your data in the <a href="https://doi.org/10.1093/biostatistics/kxaa015">DivNet manuscript</a>.</p>
<h3 id="why-make-divnet-rs">Why make divnet-rs?</h3>
<p>In the <a href="https://github.com/adw96/DivNet/blob/31e04e29e4f3c02ea07c7f35873ee6743b79170a/vignettes/getting-started.Rmd">getting started tutorial</a>, there is a section called “What does DivNet do that I can’t do already?” (it is worth reading if you haven’t!). So I thought it would be good to answer the question, “What does <code class="language-plaintext highlighter-rouge">divnet-rs</code> do that the R implementation of DivNet can’t do aleady?” The answer is simple: <code class="language-plaintext highlighter-rouge">divnet-rs</code> gives you the ability to apply the DivNet algorithm to large datasets. For those without easy access to high performance computing facilities, you will be able to run <code class="language-plaintext highlighter-rouge">divnet-rs</code> on typically sized SSU rRNA microbiome datasets on your laptop. <code class="language-plaintext highlighter-rouge">divnet-rs</code> is both faster and much more memory efficent that the R implementation. Of course, bioinformatics software is all about tradeoffs and <code class="language-plaintext highlighter-rouge">divnet-rs</code> is no different. <a href="#differences-in-the-implementations">Comapared to the R implementation</a>, it’s harder to install, you have to write some R code specifically to get data in and out of <code class="language-plaintext highlighter-rouge">divnet-rs</code>, and not all network and boostrapping options offered by the R implementation are available in the Rust implementation. That said, I think <code class="language-plaintext highlighter-rouge">divnet-rs</code> still fulfills a useful niche by allowing researchers to apply the DivNet algorithm to datasets that are currently too large for the R implementation to handle.</p>
<h2 id="comparing-run-time-and-memory-usage">Comparing run time and memory usage</h2>
<h3 id="set-up">Set up</h3>
<p>While developing <code class="language-plaintext highlighter-rouge">divnet-rs</code>, I spent a good amount of time profiling and optimizing the code. Rather than talk about that, I wanted to get a high level overview of how the performance of the R and Rust implementation compared on a real dataset. The data I used was the <a href="https://doi.org/10.3389/fmicb.2015.01470">Lee dataset</a> that is incuded with the DivNet R package. It has 1490 <a href="https://doi.org/10.1038/ismej.2017.119">amplicon sequence variants</a> (ASVs), 16 samples, and associated taxonomy and sample info.</p>
<p>So what did I do? First, I took the Lee data and sorted the ASV table in decreasing abundance order. Then I created new datasets from the top 10, 20, 40, 80, 160, 320, 640, and 1280 ASVs. In addition to the full 16 sample datasets, I also created test datasets with only eight samples by randomly picking samples from the ASV table, remiving any ASVs that had zero count in the remaining samples, and then took the top 10, 20, …, 1280 ASVs just like for the 16 sample datasets. I ran everything with the default algorithm tuning in DivNet (6 expectation maximization (EM) iterations (3 burn), 500 Monte-Carlo (MC) iterations (250 burn)) and 2 replicates. I would probably use the “careful” setting (10 EM iterations and 1000 MC iterations) as well as running more replicates if I was actually analyzing data, but this was good enough for this little profiling experiment.</p>
<p>This isn’t the most scientific profiling job ever, but it should give you a sense of how the run time and memory scales with the number of taxa and samples for both the R and Rust versions of DivNet. For the timing, I ran each dataset three times, and I used the <code class="language-plaintext highlighter-rouge">time</code> function to get the elapsed time and the max memory used for each run. Since loading all the R dependencies takes a large proportion of the total run time in the smaller DivNet-R runs, I got the elapsed time of just the <code class="language-plaintext highlighter-rouge">divnet</code> function using the <a href="https://cran.r-project.org/web/packages/tictoc/index.html">tictoc</a> R package. I still used <code class="language-plaintext highlighter-rouge">time</code> to get the max memory for these runs though.</p>
<p>One other thing to mention, I ran all of these on a compute cluster. I didn’t think about it until after I had already run everthing, but I compiled both <code class="language-plaintext highlighter-rouge">divnet-rs</code> and <code class="language-plaintext highlighter-rouge">OpenBLAS</code> on a different node than the one that I used to actually run the tests. The compute cluster that I used has a bunch of different types of nodes, so the compiled output of both may not be ideal for the node I actually ran the timings on (e.g., different <a href="https://en.wikipedia.org/wiki/SIMD">SIMD instructions</a>, different CPU architectures, etc.). While the timing experiments were running, there were other jobs on the same node running at the same time, so that is another thing that may have influenced the results.</p>
<p>For the R tests, I used R v3.6.2 linked against <a href="https://www.openblas.net/">OpenBLAS</a> v0.3.7 and DivNet v0.3.6. I set DivNet to use only 1 core (<code class="language-plaintext highlighter-rouge">ncores = 1</code>) because in all my tests (and on multiple different machines), DivNet is actually slower when using more than one core. For <code class="language-plaintext highlighter-rouge">divnet-rs</code> I used v0.1.1 linked against OpenBLAS v0.3.13. I also forced OpenBLAS to use only 1 core (<code class="language-plaintext highlighter-rouge">OPENBLAS_NUM_THREADS=1</code>) as that is how the R was using OpenBLAS. (<em>As an aside, if you don’t have <a href="https://csantill.github.io/RPerformanceWBLAS/">R linking against an optimized BLAS implementation</a>, you should. It will give you a big perfomance increase.</em>)</p>
<p>Just keep all this stuff in mind while taking a look at these results.</p>
<h3 id="results">Results</h3>
<p>Here are the run time and memory profiling results:</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//divnet_rs_intro/timing_425_350.svg" alt="DivNet timing and memory requirements" />
<figcaption>DivNet timing and memory requirements</figcaption>
</figure>
<p>Let’s break down a couple of things. The Rust version is faster and more memory efficient, but that’s not surprising – a Rust program should be faster than an R program, and I spent a good amount of time profiling and optimizing the code. In this test, the Rust version is about 20 times faster than the R version.</p>
<p>The other interesting thing to measure is max memory usage. For the largest dataset that I tested (16 samples, 1280 taxa), the Rust version used ~300 MB of RAM as compared to the ~6000 MB used by the R version. When implementing DivNet in Rust, I spent a good amount of time and effort optimizing the run time, and much less worrying about the memory, so it was nice to see it being relatively frugal with the memory.</p>
<p>As you might expect, the 16 sample datasets took longer and used more memory than the 8 sample datasets, but not twice as much time and memory. There was a weird thing thing in the 1280 taxa test set in the Rust implementation. The 8 sample set actually took a bit more time (but still used less memory) than the 16 sample set. I thought this was strange so I actually ran the 16x1280 and 8x1280 datasets many more times to see if there was some weird random variation in the timings, or if I made some mistake in the testing and mislabeled the datasets or something, but each run gave me relatively the same result as you see here. I’m not honestly sure why this is, but like I mention above, these benchmarks aren’t prefect and could be improved.</p>
<h2 id="differences-in-the-implementations">Differences in the implementations</h2>
<p>Before wrapping up, I want to take a little time to highlight some of the more important differences in the R and Rust implementations of DivNet.</p>
<h3 id="estimating-the-network">Estimating the network</h3>
<p>While the original DivNet R code has multiple options for the <code class="language-plaintext highlighter-rouge">network</code> parameter, the only network option in <code class="language-plaintext highlighter-rouge">divnet-rs</code> is “diagonal”. To explain why this is, here is an excerpt from a <a href="https://github.com/adw96/DivNet/issues/32">GitHub issue</a> where <a href="https://github.com/adw96/DivNet/issues/32#issuecomment-521727997">Amy Willis is talking</a> about using DivNet on large datasets:</p>
<blockquote>
<p>I would recommend network=”diagonal” for a dataset of this size. This means you’re allowing overdispersion (compared to a plugin aka multinomial model) but not a network structure. This isn’t just about computational expense – it’s about the reliability of the network estimates. Essentially estimating network structure on 20k variables (taxa) with 50 samples with any kind of reliability is going to be very challenging, and I don’t think that it’s worth doing here. In our simulations we basically found that overdispersion contributes the bulk of the variance to diversity estimation (i.e. overdispersion is more important than network structure), so I don’t think you are going to lose too much anyway.</p>
</blockquote>
<p>Another benefit of the diagonal network is that it is fast: it’s a simple, vectorizable mathematical operation, as compared to the <a href="https://github.com/adw96/DivNet/blob/31e04e29e4f3c02ea07c7f35873ee6743b79170a/R/MCmat.R#L83">default method</a>, which will need to do either a Cholesky decomposition or a generalized matrix inversion, or to the <a href="https://github.com/adw96/DivNet/blob/31e04e29e4f3c02ea07c7f35873ee6743b79170a/R/MCmat.R#L114">“stars”</a> method, which does a whole lot more operations.</p>
<p><code class="language-plaintext highlighter-rouge">divnet-rs</code> isn’t a replacement for DivNet. It’s focus is on allowing the core algorithm to be applied to datasets that are too large for the R implementation to handle, and so, only the diagonal network is available in <code class="language-plaintext highlighter-rouge">divnet-rs</code>. If your data is small enough that the R implentation can handle it, then I recommend using the original!</p>
<h3 id="bootstrapping">Bootstrapping</h3>
<p>Another difference from the original is that only the parametric bootstrap is available – you can’t do the nonparametric bootstrap. The parametric bootstrap is the default in the R implementation, and, if you check out the <a href="https://doi.org/10.1093/biostatistics/kxaa015">DivNet manuscript</a>, you’ll see that the parametric and nonparametric bootstraps perform similarly.</p>
<h3 id="setting-the-random-seed">Setting the random seed</h3>
<p><code class="language-plaintext highlighter-rouge">divnet-rs</code> currently does not allow you to set the seed for the random number generator, which will have an impact on reproducibility across runs. While the DivNet R implementation does allow you to set the random seed prior to the run (for example, just use <code class="language-plaintext highlighter-rouge">set.seed(5623472)</code> before running the <code class="language-plaintext highlighter-rouge">divnet</code> function), there is a <a href="https://github.com/adw96/DivNet/blob/31e04e29e4f3c02ea07c7f35873ee6743b79170a/vignettes/getting-started.Rmd#L64">caveat about setting the random seed when running DivNet on multiple cores</a> that you should be aware of. In practice, if you are getting more variability across runs than desired, you can up the EM iterations, the MC iterations, and the replicates, and it <a href="https://github.com/adw96/DivNet/blob/31e04e29e4f3c02ea07c7f35873ee6743b79170a/vignettes/getting-started.Rmd#L64">should take care of things</a>.</p>
<h2 id="wrap-up">Wrap-up</h2>
<p>In this post, I introduced <code class="language-plaintext highlighter-rouge">divnet-rs</code>, a Rust implementation of the <a href="https://github.com/adw96/DivNet">DivNet R package</a>. It is both faster and more memory efficent than the original, allowing you to run much larger data sets even on your laptop, but it has fewer features and isn’t as straightforward to use. Like any bioinformatics software, there are always tradeoffs, so I encourage you to pick the right tool for the right job: if you have small enough datasets, stick with the R implementation, but if R keeps crashing on you or DivNet is just too slow for whatever reason, give <a href="https://github.com/mooreryan/divnet-rs">divnet-rs</a> try.</p>Ryan MooreUpdate: divnet-rs now has a way to parallelize the bootstrapping procedure. With enough RAM, it can give approximately linear decreases in run time with increasing number of cores. Consider it an experimental feature for now. Update 2022-04-06: On the Lee dataset, v0.3.0 is around 3x faster and uses ~60% of the memory as compared to v0.2.1. Update 2021-01-22: v0.2.1 further decreases the run time and required memory Update 2021-01-19: As of divnet-rs v0.2.0, users can manually set the random seed. Also, v0.2.0 uses only about 2/3 the memory that was used by v0.1.1.A simple dashboard for COVID-19 case counts2020-12-30T00:00:00+00:002020-12-30T00:00:00+00:00https://www.tenderisthebyte.com/blog/2020/12/30/covid-19-dashboard<p>I made a simple <a href="https://www.tenderisthebyte.com/apps/covid19dashboard">COVID-19 dashboard</a> that lets you compare the confirmed case counts for multiple counties as well as viewing the raw counts and the counts per 100,000 people. It plots the case counts over time for as many counties as you want to compare and lets you download the resulting chart. Here is an example for Delaware’s three counties:</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//covid_19_dashboard/delaware_covid_chart.svg" alt="Confirmed COVID-19 Cases for Delaware Counties" />
<figcaption>Confirmed COVID-19 Cases for Delaware Counties</figcaption>
</figure>
<p>Being a Delaware resident, I like to pretend everyone already knows everything about Delaware, but <em>just in case</em> you don’t, here you go: New Castle county is in the north and has Wilmington (our largest city) and Newark, home of the Univesity of Delaware. Kent county is in the middle and has Dover (the state capitol), and Sussex county is in the south with Lewes and all the beaches. It’s interesting to see the differences between New Castle and Kent counties, which look pretty similar to one another, and Sussex county. At some point, I would like to overlay some demographic or socio-economic data on this to look for any trends, but that’s for a different day.</p>
<h2 id="the-data">The data</h2>
<p>The COVID-19 case data is from the <a href="https://github.com/CSSEGISandData/COVID-19">Center for Systems Science and Engineering (CSSE) at Johns Hopkins University</a>. Their data is aggregated from a ton of different sources and I encourage you to check out <a href="https://github.com/CSSEGISandData/COVID-19">their GitHub page</a> for more information about the data. If you’re interested, they have <a href="https://doi.org/10.1016/S1473-3099(20)30120-1">an article</a> in the Lancet talking about the data and <a href="https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6">their dashboard</a>. Of course, their dashboard has a lot more bells and whistles than mine!</p>
<p>For the county level population info, I used data from the <a href="https://www.ers.usda.gov/data-products/atlas-of-rural-and-small-town-america/">Atlas of Rural and Small-Town America</a> from the <a href="https://www.ers.usda.gov/">USDA Economic Research Service</a>. It is a really cool and in-depth county level dataset. In addition to the population data, you can find info about jobs, income, veterans and more. They also have a <a href="https://www.ers.usda.gov/data-products/atlas-of-rural-and-small-town-america/go-to-the-atlas/">nice interactive map</a> to view everything county-by-county. If you want to download and remix the data yourself, it is all available in CSV and Excel format <a href="https://www.ers.usda.gov/data-products/atlas-of-rural-and-small-town-america/download-the-data/">on their site</a>.</p>
<p>One thing to note is that the county level population data is mostly from 2019 estimates. So, while weighting the case counts by the population data gives a nice way to compare COVID-19 cases across counties, just keep in mind that the population estimates are from last year.</p>
<h2 id="the-code">The code</h2>
<p>If you’re interested in the source code for the dashboard, you can find it on my <a href="https://github.com/mooreryan/Covid19Dashboard">GitHub page</a>.</p>
<p>It is an <a href="https://elm-lang.org/">Elm app</a>. I haven’t used Elm much before this project, but it was very easy to get started with. The <a href="https://guide.elm-lang.org/">documentaion</a> was awesome and the <a href="https://elmlang.herokuapp.com/">Elm Slack channel</a> is full of helpful people. I think having some experience in <a href="https://www.rust-lang.org/">Rust</a> and <a href="https://clojure.org/">Clojure</a> helped me feel right at home using Elm. Elm seems a bit like a gateway to <a href="https://github.com/alpacaaa/elm-to-purescript-cheatsheet">PureScript</a> or <a href="https://www.reddit.com/r/haskell/comments/6wbzer/elm_as_a_gateway_to_learn_haskell/">Haskell</a>, so I’m thinking of checking those out as well.</p>
<p>The charts are made with <a href="https://vega.github.io/vega-lite/">Vega-Lite</a>, a nice tool for data visualization based on <a href="https://vega.github.io/vega/">Vega</a> and the <a href="https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html">Grammar of Graphics</a>. It’s <a href="https://en.wikipedia.org/wiki/Declarative_programming">declarative</a>, in that you write <a href="https://www.json.org/json-en.html">JSON</a> specifications and Vega-Lite compiles the spec to Vega and Vega’s runtime hadles rendering the chart. To generate the Vega-Lite specs, I used <a href="https://package.elm-lang.org/packages/gicentre/elm-vegalite/latest/VegaLite">this Elm package</a> in conjunction with Elm <a href="https://guide.elm-lang.org/interop/ports.html">ports</a>.</p>Ryan MooreI made a simple COVID-19 dashboard that lets you compare the confirmed case counts for multiple counties as well as viewing the raw counts and the counts per 100,000 people. It plots the case counts over time for as many counties as you want to compare and lets you download the resulting chart. Here is an example for Delaware’s three counties:Virome Bytes: Microdiversity of Mediterranean Sea Viruses2020-02-29T00:00:00+00:002020-02-29T00:00:00+00:00https://www.tenderisthebyte.com/blog/2020/02/29/virome-bytes-mediterranean-sea-virus-microdiversity<h2 id="virus-microdiversity">Virus microdiversity</h2>
<p>Marine viruses are probably the most well-characterized group of environmental viruses. <a href="https://doi.org/10.1038/nature04160">The oceans were one of the first ecosystems where the abundance and importance of environmental viruses was truly realized</a>, and the relative ease of collecting viruses from seawater (as compared to, say, soils) has helped further their study in this environment. However, even within marine habitats, there’s still a lot that we don’t know about viruses and their ecology.</p>
<p>The microdiversity of viruses is a relatively new area of study in environmental viral ecology. Microdiversity, here, refers to mutation frequencies in genomes within the same population. It accompanies trends like the <a href="https://doi.org/10.1038/ismej.2017.119">shift from OTUs to ASVs</a> in focusing in on smaller differences in environmental DNA sequences. In a paper entitled <a href="https://doi.org/10.1128/mSystems.00554-19">Trends of microdiversity reveal depth-dependent evolutionary strategies of viruses in the mediterranean</a>, Felipe Coutinho and colleagues use microdiversity to study the selective pressures exerted on viral genomes at different depths in the ocean and Mediterranean Sea.</p>
<p>Coutinho et al. examined four viral shotgun metagenomes (viromes) sampled from the surface, the <a href="https://en.wikipedia.org/wiki/Deep_chlorophyll_maximum">deep chlorophyll maximum</a> (DCM), and the <a href="https://en.wikipedia.org/wiki/Bathyal_zone">bathypelagic</a>. To increase their sample size, the researchers supplemented their own samples with viromes from the <a href="https://oceans.taraexpeditions.org/en/m/about-tara/les-expeditions/tara-oceans/"><em>Tara</em> Oceans expedition</a> and <a href="http://aco-ssds.soest.hawaii.edu/ALOHA/">Station ALOHA</a>, which were also sampled over multiple depths. Microdiversity was measured using pN/pS ratios, similar to dN/dS ratios, which are calculated as the number of nonsynonymous polymorphisms per nonsynonymous site to the number synonymous polymorphisms per synonymous site.</p>
<h2 id="different-depths-different-selective-pressures">Different depths, different selective pressures</h2>
<p>The authors concluded that marine viruses at different depths show signs of being under different primary selection pressures.</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//mediterranean_virus_microdiversity/microdiversity_cartoon.jpg" alt="The author's model of the observed patterns of microdiversity" />
<figcaption>The author's model of the observed patterns of microdiversity</figcaption>
</figure>
<p>In the deep ocean, <a href="https://doi.org/10.1126/sciadv.1602565">where cells and viruses are found in lower numbers</a>, viral metabolism proteins are under the greatest selection pressure. This is presumably to help increase traits such as burst size that would maximize the number of viral progeny produced, thereby increasing the likelihood that one of those phages encounters a suitable host.</p>
<p>In the DCM, viruses accumulate mutations in genes used for host recognition, so that they can expand their host range to compete with other phages. This is necessary because while phage populations in the DCM are large, this study found them to be highly clonal (low diversity). Having lots of copies of the same phage would presumably make competition for hosts intense and encourage host switching.</p>
<p>Viruses from the surface samples had, on average, the greatest number of mutations, but the lowest rates of microdiversity. The high rate of mutation was attributed to high levels of UV radiation in surface waters. The low rate of microdiversity may be due to the combination of relatively high viral counts combined with intermediate diversity. This would result in lower rates of competition for host cells and less need to increase traits like burst size, that may be more important in low cell count environments.</p>
<p>Overall, this is an interesting study that used environmental gradients to examine specific factors driving viral ecology and evolution in the natural environment.</p>
<p class="gray"><em>Citation: Coutinho, FH. et al. Trends of Microdiversity Reveal Depth-Dependent Evolutionary Strategies of Viruses in the Mediterranean. mSystems 4 (6) e00554-19 (2019). <a href="https://doi.org/10.1128/mSystems.00554-19">doi: 10.1128/mSystems.00554-19</a>.</em></p>Amelia HarrisionVirus microdiversityBeginning Bioinformatics: What’s a terminal? What’s the command line?2019-12-15T00:00:00+00:002019-12-15T00:00:00+00:00https://www.tenderisthebyte.com/blog/2019/12/15/beginning-bioinformatics-command-line-terminal<p>Installing and running typical bioinformatics programs requires a lot of background knowledge. For beginners, terms like “command line,” “terminal,” “changing directories,” and “archive file” might be unfamiliar. Even instructions to type <code class="language-plaintext highlighter-rouge">make</code> can be confusing. There is a lot of prerequisite knowledge needed to get started with installing and using bioinformatics software.</p>
<p>So, let’s start with the basics: terminals and the the command line.</p>
<h2 id="graphical-vs-command-line-interfaces">Graphical vs. command line interfaces</h2>
<p>You’re probably reading this blog in a web-browser. Whether it’s on a phone or on a laptop, your web-browser is a <a href="https://en.wikipedia.org/wiki/Graphical_user_interface">graphical user interface</a> (GUI). We interact with GUIs by clicking around with the mouse, or if we’re using a mobile device, by tapping and swiping on the screen. Most of the programs we use on our phones and computers are GUIs. Finder on a Mac and Windows Explorer (or File Explorer) on a PC are GUIs that let you browse and manage files on your computer. Chrome, Safari, and Internet Explorer are GUIs for browsing the web. So a GUI is a program with a graphical user interface, and make up the majority of the programs you probably use on a daily basis.</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//command_line_terminal/firefox-gui.png" alt="Firefox is a GUI for web browsing" />
<figcaption>Firefox is a GUI for web browsing</figcaption>
</figure>
<p>Compare this to programs with so-called <a href="https://en.wikipedia.org/wiki/Command-line_interface">command-line interfaces</a> (CLI). Rather than pointing and clicking, you interact with these programs by typing things at the command line, generally through a <a href="https://askubuntu.com/questions/38162/what-is-a-terminal-and-how-do-i-open-and-use-it">terminal</a>. One example a program with a command-line interface is <a href="https://en.wikipedia.org/wiki/Find_(Unix)">find</a>, which <em>find</em>s files based on some user-specified criteria. Most bioinformatics programs don’t have graphical user interfaces. If you want to learn to do bioinformatics, you’re almost certainly going to have to get comfortable with the command line.</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//command_line_terminal/find-cli.png" alt="The find program's CLI" />
<figcaption>The find program's CLI</figcaption>
</figure>
<p>Some programs have both a graphical and a command line interface, like <a href="https://manual.cytoscape.org/en/3.5.0/Programmatic_Access_to_Cytoscape_Features_Scripting.html">Cytoscape</a>, a program for visualizing networks. Why have both? Well, there are some tasks that are easier to accomplish using a graphical user interface and some that are easier with a command line interface. For example, if you need to explore your network–color it, change the size of nodes and edges, make it look nice and pretty–you’re probably going to want to use the Cytoscape GUI for that. If you have an algorithm or process that you want to apply to hundreds of networks, then you’re definitely going to want to use the command line interface instead.</p>
<h2 id="the-terminal-and-the-command-line">The terminal and the command line</h2>
<p>A terminal is a text-based interface to your computer. Depending on who you’re talking to, you might hear the terminal called a couple of different things. The console, the shell, the command prompt–whatever they call it, people are generally talking about the same thing: a place where you enter commands and interact with command-line programs. Of course, all of these terms have <a href="https://askubuntu.com/questions/506510/what-is-the-difference-between-terminal-console-shell-and-command-line">more precise definitions</a>, (just ask a <a href="https://en.wikipedia.org/wiki/System_administrator">systems admin</a>!). For now though, let’s just agree to call it the terminal and not worry too much about it.</p>
<p>As I mentioned earlier, you control a program with a command line interface by typing commands into a terminal. Here is an example of how you might use the find command:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">find <span class="nb">.</span> <span class="nt">-name</span> <span class="s1">'*.txt'</span></code></pre></figure>
<p>Let’s talk about this just a bit. The first thing there is the word <code class="language-plaintext highlighter-rouge">find</code>. <code class="language-plaintext highlighter-rouge">find</code> is the name of the command we’re running. <em>(If you’re reading program documentation or a blog and you see a word in font that looks <code class="language-plaintext highlighter-rouge">like this</code>, then it generally means it’s either a command, something you’re typing at the terminal, or some snippet of code.)</em> Next are the <a href="https://www.computerhope.com/jargon/a/argument.htm">arguments</a> that we pass to the <code class="language-plaintext highlighter-rouge">find</code> command/program. Arguments let us modify the behavior of a command or program. In this case, the <code class="language-plaintext highlighter-rouge">.</code> tells <code class="language-plaintext highlighter-rouge">find</code> look in the current directory, and <code class="language-plaintext highlighter-rouge">-name '*.txt'</code> bit tells <code class="language-plaintext highlighter-rouge">find</code> to look for files that end with <code class="language-plaintext highlighter-rouge">.txt</code>. Don’t worry too much if that doesn’t make sense right now. We’ll get into the details of actually running command line programs in a different post. For now, just know that command line programs are those that you control by typing commands and arguments into the termial.</p>
<p>Let me just mention one more thing. If you’re reading program documentation or tutorials about the command line, you might see commands that look like they start with a <code class="language-plaintext highlighter-rouge">$</code> character like this:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>find <span class="nb">.</span> <span class="nt">-name</span> <span class="s1">'*.txt'</span></code></pre></figure>
<p>The <code class="language-plaintext highlighter-rouge">$</code> character isn’t actually part of the command. Some authors will put it in front of the actual command to represent the command prompt (the place where you’re actually typing in the terminal). It’s just there to make it clearer that what you see is a command that you should type into a terminal.</p>
<h2 id="how-do-i-get-a-terminal">How do I get a terminal?</h2>
<p>If you’re on a Mac, you should have a program called <code class="language-plaintext highlighter-rouge">Terminal</code> already installed. To open it, click on the Launchpad and type <code class="language-plaintext highlighter-rouge">Terminal</code> into the search box and double click on its icon. <a href="https://iterm2.com">iTerm2</a> is another popular <a href="https://en.wikipedia.org/wiki/Terminal_emulator">terminal emulator</a> for Macs. If you’re using <a href="https://opensource.com/resources/linux">Linux</a>, then you’ve got <a href="https://www.tecmint.com/linux-terminal-emulators/">tons of options</a> for terminals as well. Windows is a bit different from the other two, but it does have a terminal. Check out <a href="https://github.com/microsoft/terminal">this software repository</a> and <a href="https://www.lifewire.com/command-prompt-2625840">this guide</a> for more information on the Windows command prompt. I personally don’t use a PC for work, but many people I know who do use PCs for bioinformatics use <a href="https://www.cygwin.com">Cygwin</a>, which let’s you get a more Linux-y command line experience on your PC.</p>
<h2 id="wrap-up">Wrap up</h2>
<p>In this post, we talked about graphical user interfaces versus command line interfaces, what is a terminal, what is the command line and how to actually get a terminal for your computer. This is all foundational stuff that you’ll be getting a lot more experience with as you learn more about bioinformatics. Hopefully, this guide helps clear up any confusion you may have had!</p>
<p><em>If you want some more hands-on info about command line basics, check out <a href="https://tutorial.djangogirls.org/en/intro_to_command_line/">this nice tutorial</a> from Django Girls.</em></p>Ryan MooreInstalling and running typical bioinformatics programs requires a lot of background knowledge. For beginners, terms like “command line,” “terminal,” “changing directories,” and “archive file” might be unfamiliar. Even instructions to type make can be confusing. There is a lot of prerequisite knowledge needed to get started with installing and using bioinformatics software.Using Sass in Clojure Ring apps2019-12-12T00:00:00+00:002019-12-12T00:00:00+00:00https://www.tenderisthebyte.com/blog/2019/12/12/sass-in-clojure-ring-apps<p>So you want to use Sass instead of plain CSS in your Clojure Ring web app, but you’re not sure how to get it set up? No problem! Let’s walk through it together.</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//sass_clojure_ring_apps/sass-plus-clojure.svg" alt="Sass + Clojure" />
<figcaption>Sass + Clojure</figcaption>
</figure>
<p>According to the <a href="https://sass-lang.com">official website</a>, Sass is CSS with superpowers. It’s a stable and powerful CSS extension language with two different syntaxes, Sass, the original, and Sassy CSS (SCSS), a newer syntax that is a superset of CSS. If you’re not too familiar with Sass, check out <a href="https://sass-lang.com/guide">this tutorial</a>.</p>
<div class="post-toc">
<h4 class="post-toc--header" id="contents">Contents</h4>
<ul>
<li><a href="#install-the-sass-binary">Install the sass binary</a></li>
<li><a href="#set-up-a-toy-clojure-ring-app">Set up a toy Clojure Ring app</a></li>
<li><a href="#set-up-scss">Set up SCSS</a></li>
</ul>
</div>
<h2 id="install-the-sass-binary">Install the sass binary</h2>
<p>First off, you’re going to need to <a href="https://sass-lang.com/install">install</a> a Sass preprocessor. To use Sass, you write <code class="language-plaintext highlighter-rouge">.sass</code> (if you’re using the Sass syntax) or <code class="language-plaintext highlighter-rouge">.scss</code> (if you’re using the Sassy CSS syntax) files and then compile them to plain ol’ CSS using one of the many Sass compilers.</p>
<p>I’m using a Mac, so installing Sass is as easy as running this <a href="https://brew.sh/">Homebrew</a> command:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>brew <span class="nb">install </span>sass/sass/sass</code></pre></figure>
<p>Now, one tricky thing is that Sass has a lot of <a href="https://sass-lang.com/implementation">different implementations</a>. Sass was originally written in Ruby, so there’s the now deprecated <a href="https://sass-lang.com/ruby-sass">Ruby Sass</a>. Additionally, there is <a href="https://sass-lang.com/libsass">LibSass</a>, a C/C++ port of the Sass engine, <a href="https://sass-lang.com/dart-sass">Dart Sass</a>, which compiles to JavaScript, and many others. It really doesn’t matter which one you use as long as you’ve got one of them installed.</p>
<p><em>For the rest of the tutorial, I’m going to assume that you’ve got Dart Sass, as that is the primary Sass implementation. It’s binary is called <code class="language-plaintext highlighter-rouge">sass</code>.</em></p>
<h2 id="set-up-a-toy-clojure-ring-app">Set up a toy Clojure Ring app</h2>
<p>To show you how to get Sassy with your CSS, let’s start by setting up an example Clojure Ring app. Assuming that you already have <a href="https://leiningen.org">Leiningen</a> installed, run this in your favorite terminal app:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>lein new sassy-clj <span class="o">&&</span> <span class="nb">cd </span>sassy-clj</code></pre></figure>
<h3 id="fix-projectclj">Fix project.clj</h3>
<p>Alright, now we can make sure the <code class="language-plaintext highlighter-rouge">project.clj</code> file is set up nice and neat. To do that, we’re going to need to change a couple of different things in the <code class="language-plaintext highlighter-rouge">defproject</code> macro.</p>
<ul>
<li>Add the Ring libraries to the <code class="language-plaintext highlighter-rouge">:dependencies</code> vector.</li>
<li>Set up the Ring handler.</li>
<li>Add in the <a href="https://github.com/weavejester/lein-ring">lein-ring</a> and <a href="https://github.com/bluegray/lein-scss">lein-scss</a> plugins.</li>
</ul>
<p>All together, it should look something like this. (I’ve added comments to show the things that you need to add.)</p>
<figure class="highlight"><pre><code class="language-clj" data-lang="clj"><span class="p">(</span><span class="nf">defproject</span><span class="w"> </span><span class="n">sassy-clj</span><span class="w"> </span><span class="s">"0.1.0-SNAPSHOT"</span><span class="w">
</span><span class="no">:description</span><span class="w"> </span><span class="s">"FIXME: write description"</span><span class="w">
</span><span class="no">:url</span><span class="w"> </span><span class="s">"http://example.com/FIXME"</span><span class="w">
</span><span class="no">:license</span><span class="w"> </span><span class="p">{</span><span class="no">:name</span><span class="w"> </span><span class="s">"EPL-2.0 OR GPL-2.0-or-later WITH Classpath-exception-2.0"</span><span class="w">
</span><span class="no">:url</span><span class="w"> </span><span class="s">"https://www.eclipse.org/legal/epl-2.0/"</span><span class="p">}</span><span class="w">
</span><span class="no">:dependencies</span><span class="w"> </span><span class="p">[[</span><span class="n">org.clojure/clojure</span><span class="w"> </span><span class="s">"1.10.0"</span><span class="p">]</span><span class="w">
</span><span class="c1">;; Include the Ring libraries.</span><span class="w">
</span><span class="p">[</span><span class="n">ring</span><span class="w"> </span><span class="s">"1.8.0"</span><span class="p">]</span><span class="w">
</span><span class="c1">;; Include some nice app defaults.</span><span class="w">
</span><span class="p">[</span><span class="n">ring/ring-defaults</span><span class="w"> </span><span class="s">"0.3.2"</span><span class="p">]]</span><span class="w">
</span><span class="no">:repl-options</span><span class="w"> </span><span class="p">{</span><span class="no">:init-ns</span><span class="w"> </span><span class="n">sassy-clj.core</span><span class="p">}</span><span class="w">
</span><span class="c1">;; Include the needed plugins.</span><span class="w">
</span><span class="no">:plugins</span><span class="w"> </span><span class="p">[[</span><span class="n">lein-ring</span><span class="w"> </span><span class="s">"0.12.5"</span><span class="p">]</span><span class="w">
</span><span class="p">[</span><span class="n">lein-scss</span><span class="w"> </span><span class="s">"0.3.0"</span><span class="p">]]</span><span class="w">
</span><span class="c1">;; Set up the Ring server handler.</span><span class="w">
</span><span class="no">:ring</span><span class="w"> </span><span class="p">{</span><span class="no">:handler</span><span class="w"> </span><span class="n">sassy-clj.core/app</span><span class="p">})</span></code></pre></figure>
<p>After you’re made those changes, don’t forget to run <code class="language-plaintext highlighter-rouge">lein deps</code> in your project’s source directory to download the needed dependencies.</p>
<h3 id="set-up-the-assets-directories">Set up the assets directories</h3>
<p>Now then, let’s make some folders to hold the HTML, SCSS, and generated CSS files.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">mkdir</span> <span class="nt">-p</span> resources/html resources/scss resources/public/css</code></pre></figure>
<p>The <code class="language-plaintext highlighter-rouge">resources/scss</code> directory is where we’ll keep the <code class="language-plaintext highlighter-rouge">*.scss</code> files that we’ll actually be editing, and the <code class="language-plaintext highlighter-rouge">resources/public/css</code> directory will hold all of the generated CSS files. If you guessed that <code class="language-plaintext highlighter-rouge">resources/html</code> is where we will keep our HTML files, you guessed right!</p>
<h3 id="set-up-a-sweet-home-page">Set up a sweet home page</h3>
<p>Now let’s make a tiny little homepage for our app. First, make a new file called <code class="language-plaintext highlighter-rouge">resources/html/home.html</code> and put this in it.</p>
<figure class="highlight"><pre><code class="language-html" data-lang="html"><span class="cp"><!DOCTYPE html></span>
<span class="nt"><html></span>
<span class="nt"><head></span>
<span class="nt"><meta</span> <span class="na">charset=</span><span class="s">"UTF-8"</span><span class="nt">/></span>
<span class="nt"><link</span> <span class="na">rel=</span><span class="s">"stylesheet"</span> <span class="na">href=</span><span class="s">"css/main.css"</span><span class="nt">></span>
<span class="nt"><title></span>Sassy CSS for Clojure Ring Apps<span class="nt"></title></span>
<span class="nt"></head></span>
<span class="nt"><body></span>
<span class="nt"><h1></span>Sassy Clj<span class="nt"></h1></span>
<span class="nt"><p></span>Let's use Sassy CSS in a Clojure Ring app!<span class="nt"></p></span>
<span class="nt"></body></span>
<span class="nt"></html></span></code></pre></figure>
<p>You can see that we’ve linked to the <code class="language-plaintext highlighter-rouge">css/main.css</code> stylesheet. We won’t be writing this by hand, rather we will set up Leiningen so that it will be generated automatically!</p>
<p>Now, edit the <code class="language-plaintext highlighter-rouge">sassy-clj.core</code> namespace found in <code class="language-plaintext highlighter-rouge">src/sassy_clj/core.clj</code> like so:</p>
<figure class="highlight"><pre><code class="language-clj" data-lang="clj"><span class="p">(</span><span class="nf">ns</span><span class="w"> </span><span class="n">sassy-clj.core</span><span class="w">
</span><span class="p">(</span><span class="no">:require</span><span class="w"> </span><span class="p">[</span><span class="n">ring.middleware.defaults</span><span class="w"> </span><span class="no">:refer</span><span class="w"> </span><span class="p">[</span><span class="n">wrap-defaults</span><span class="w"> </span><span class="n">site-defaults</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="n">ring.util.response</span><span class="w"> </span><span class="no">:as</span><span class="w"> </span><span class="n">response</span><span class="p">]))</span></code></pre></figure>
<p>This will let us use the <code class="language-plaintext highlighter-rouge">site-defaults</code>, which <a href="https://github.com/ring-clojure/ring-defaults#customizing">among other things</a>, will allow serving static assets in the <code class="language-plaintext highlighter-rouge">resources/public</code> folder. Also, we want to use Ring’s response helpers.</p>
<p>Next, set up a basic <code class="language-plaintext highlighter-rouge">handler</code> function to <a href="https://github.com/ring-clojure/ring/wiki/Concepts#responses">respond</a> to <a href="https://github.com/ring-clojure/ring/wiki/Concepts#requests">requests</a>.</p>
<figure class="highlight"><pre><code class="language-clj" data-lang="clj"><span class="p">(</span><span class="k">defn</span><span class="w"> </span><span class="n">handler</span><span class="w"> </span><span class="p">[</span><span class="n">request</span><span class="p">]</span><span class="w">
</span><span class="p">(</span><span class="nb">-></span><span class="w"> </span><span class="p">(</span><span class="nf">response/resource-response</span><span class="w"> </span><span class="s">"/html/home.html"</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">response/content-type</span><span class="w"> </span><span class="s">"text/html"</span><span class="p">)))</span></code></pre></figure>
<p>This function will respond to all requests with our homepage.</p>
<p>Finally, we define an <code class="language-plaintext highlighter-rouge">app</code> var to be our app’s main handler. This matches what we specified in the <code class="language-plaintext highlighter-rouge">project.clj</code> file.</p>
<figure class="highlight"><pre><code class="language-clj" data-lang="clj"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">app</span><span class="w">
</span><span class="p">(</span><span class="nf">wrap-defaults</span><span class="w"> </span><span class="n">handler</span><span class="w"> </span><span class="n">site-defaults</span><span class="p">))</span></code></pre></figure>
<p>You’ll notice that I’ve used the <code class="language-plaintext highlighter-rouge">wrap-defaults</code> <a href="https://github.com/ring-clojure/ring/wiki/Concepts#middleware">middleware</a> function around the handler we wrote. This is to get those sweet <code class="language-plaintext highlighter-rouge">site-defaults</code> in the response.</p>
<h3 id="start-up-a-development-server">Start up a development server</h3>
<p>By now we should have a working app. Let’s check it out! To do so, start up the server like so:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>lein ring server-headless</code></pre></figure>
<p>Browse to <a href="http://localhost:3000/">http://localhost:3000/</a>, and you should see our beautiful home page!</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//sass_clojure_ring_apps/1-basic-home-page.png" alt="A very basic homepage" />
<figcaption>A very basic homepage</figcaption>
</figure>
<h2 id="set-up-scss">Set up SCSS</h2>
<h3 id="edit-projectclj-again">Edit project.clj again</h3>
<p>Now that we have our test project, it’s time to get sassy with some CSS. We don’t want to be compiling SCSS files by hand each time we edit them. Instead, we will be using the <code class="language-plaintext highlighter-rouge">lein-scss</code> plugin that we included in our <code class="language-plaintext highlighter-rouge">project.clj</code> file earlier. Before we can use it, we need a bit more set up.</p>
<p>We need to tell <code class="language-plaintext highlighter-rouge">lein-scss</code> how we want our SCSS files to be compiled. We do that by adding an <code class="language-plaintext highlighter-rouge">:scss</code> key with hash map of options to the end of the <code class="language-plaintext highlighter-rouge">defproject</code> macro in <code class="language-plaintext highlighter-rouge">project.clj</code>.</p>
<figure class="highlight"><pre><code class="language-clj" data-lang="clj"><span class="w"> </span><span class="no">:scss</span><span class="w"> </span><span class="p">{</span><span class="no">:builds</span><span class="w">
</span><span class="p">{</span><span class="no">:development</span><span class="w"> </span><span class="p">{</span><span class="no">:source-dir</span><span class="w"> </span><span class="s">"resources/scss"</span><span class="w">
</span><span class="no">:dest-dir</span><span class="w"> </span><span class="s">"resources/public/css"</span><span class="w">
</span><span class="no">:executable</span><span class="w"> </span><span class="s">"sass"</span><span class="w">
</span><span class="no">:args</span><span class="w"> </span><span class="p">[</span><span class="s">"--style"</span><span class="w"> </span><span class="s">"expanded"</span><span class="p">]}</span><span class="w">
</span><span class="no">:production</span><span class="w"> </span><span class="p">{</span><span class="no">:source-dir</span><span class="w"> </span><span class="s">"resources/scss"</span><span class="w">
</span><span class="no">:dest-dir</span><span class="w"> </span><span class="s">"resources/public/css"</span><span class="w">
</span><span class="no">:executable</span><span class="w"> </span><span class="s">"sass"</span><span class="w">
</span><span class="no">:args</span><span class="w"> </span><span class="p">[</span><span class="s">"--style"</span><span class="w"> </span><span class="s">"compressed"</span><span class="p">]}}}</span></code></pre></figure>
<p>In the options map, we specify the <code class="language-plaintext highlighter-rouge">:builds</code> key and then another map where we can specify multiple different builds. This is nice when you want different options for development and production. For example, we’ve specified the <code class="language-plaintext highlighter-rouge">expanded</code> style for development, but the <code class="language-plaintext highlighter-rouge">compressed</code> style for production.</p>
<p>There are a couple of other things to note here. We use the <code class="language-plaintext highlighter-rouge">:source-dir</code> key to specify that we will store our SCSS files in <code class="language-plaintext highlighter-rouge">resources/scss</code>, and the <code class="language-plaintext highlighter-rouge">:dest-dir</code> key to specify that we want the compiled CSS files to live in <code class="language-plaintext highlighter-rouge">resources/public/css</code>. Finally, we tell <code class="language-plaintext highlighter-rouge">lein-scss</code> to use the <code class="language-plaintext highlighter-rouge">sass</code> executable, and add some command line arguments to be passed in to the <code class="language-plaintext highlighter-rouge">sass</code> program.</p>
<p><em>Remember how I said there were a lot of different options for Sass compilers? Well the <code class="language-plaintext highlighter-rouge">:executable "sass"</code> option is for using <code class="language-plaintext highlighter-rouge">sass</code>. Of course, if you’re using <code class="language-plaintext highlighter-rouge">sassc</code> or <code class="language-plaintext highlighter-rouge">scss</code> instead, you can use <code class="language-plaintext highlighter-rouge">:executable "sassc"</code> or <code class="language-plaintext highlighter-rouge">:executable "scss"</code>, and it’ll work just fine!</em></p>
<h3 id="make-a-mainscss-file">Make a main.scss file</h3>
<p>Once that is set up, make a new file called <code class="language-plaintext highlighter-rouge">main.scss</code> in the <code class="language-plaintext highlighter-rouge">resources/scss</code> folder and add the following to it:</p>
<figure class="highlight"><pre><code class="language-scss" data-lang="scss"><span class="nv">$font-color</span><span class="p">:</span> <span class="mh">#E47320</span><span class="p">;</span>
<span class="nv">$font-family</span><span class="p">:</span> <span class="n">Courier</span><span class="p">;</span>
<span class="nt">body</span> <span class="p">{</span>
<span class="nl">font-family</span><span class="p">:</span> <span class="nv">$font-family</span><span class="p">;</span>
<span class="nl">color</span><span class="p">:</span> <span class="nv">$font-color</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<h3 id="compile-scss-to-css">Compile SCSS to CSS</h3>
<p>Those are some excellent styles, but if you reload the homepage now, you’ll see that they aren’t being applied. This is because we haven’t told <code class="language-plaintext highlighter-rouge">lein-scss</code> to actually compile <code class="language-plaintext highlighter-rouge">main.scss</code> to <code class="language-plaintext highlighter-rouge">main.css</code> yet. Here is how to do that.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>lein scss :development once
<span class="o">[</span>23:36:01] Running once
<span class="o">[</span>23:36:02] ./sassy-clj/resources/scss/main.scss
<span class="nt">--</span><span class="o">></span> ./sassy-clj/resources/public/css/main.css
Elapsed <span class="nb">time</span>: 226.240492 msecs <span class="o">[</span>Total <span class="nb">time</span><span class="o">]</span></code></pre></figure>
<p><em>Note that we typed <code class="language-plaintext highlighter-rouge">:development</code> and not <code class="language-plaintext highlighter-rouge">development</code>. The latter will not work.</em></p>
<p>If you reload the homepage again, you’ll see our beautiful styles have been applied!</p>
<figure class="figure figure--center figure--border">
<img src="/assets/img/posts//sass_clojure_ring_apps/2-styled-home-page.png" alt="A quite stylish homepage!" />
<figcaption>A quite stylish homepage!</figcaption>
</figure>
<h3 id="setting-up-auto-compilation">Setting up auto-compilation</h3>
<p>Now you probably don’t want to be manually running <code class="language-plaintext highlighter-rouge">lein scss</code> every time you edit your SCSS files. To avoid this, <code class="language-plaintext highlighter-rouge">lein-scss</code> comes with an <code class="language-plaintext highlighter-rouge">auto</code> mode that watches your SCSS source directory for changes and automatically recompiles the CSS as necessary. You can run it like this:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>lein scss :development auto</code></pre></figure>
<p>Finally, if you’re ready for production, you can pass in <code class="language-plaintext highlighter-rouge">:production</code> instead of <code class="language-plaintext highlighter-rouge">:development</code> and you’ll be good to go!</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>lein scss :production once</code></pre></figure>
<p>And that’s it! Go forth and be sassy!</p>Ryan MooreSo you want to use Sass instead of plain CSS in your Clojure Ring web app, but you’re not sure how to get it set up? No problem! Let’s walk through it together.