<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://www.tenderisthebyte.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.tenderisthebyte.com/" rel="alternate" type="text/html" /><updated>2026-02-01T05:55:14+00:00</updated><id>https://www.tenderisthebyte.com/feed.xml</id><title type="html">Tender Is The Byte</title><subtitle>Hi!  I&apos;m Ryan Moore, NBA fan &amp; PhD candidate in Eric Wommack&apos;s viral ecology lab @ UD.</subtitle><author><name>Ryan Moore</name><email>moorer@udel.edu</email></author><entry><title type="html">How to Clear Ghost Notifications on GitHub</title><link href="https://www.tenderisthebyte.com/blog/2025/09/29/github-ghost-notifications/" rel="alternate" type="text/html" title="How to Clear Ghost Notifications on GitHub" /><published>2025-09-29T00:00:00+00:00</published><updated>2025-09-29T00:00:00+00:00</updated><id>https://www.tenderisthebyte.com/blog/2025/09/29/github-ghost-notifications</id><content type="html" xml:base="https://www.tenderisthebyte.com/blog/2025/09/29/github-ghost-notifications/"><![CDATA[<p>Sometimes you may receive a notification in the GitHub UI that you can’t get rid of. These “ghost notifications” can arise from being notified by an account or repository that has since been deleted, for example, because it was a <a href="https://www.bleepingcomputer.com/news/security/github-notifications-abused-to-impersonate-y-combinator-for-crypto-theft/">spam account</a>. Regardless of the reason, you won’t be able to clear the notification using GitHub’s web interface, and you will seemingly be stuck with that little blue dot on your inbox. Luckily, there is a straightforward clear the notification using GitHub’s <a href="https://docs.github.com/en/rest?apiVersion=2022-11-28">REST API</a>.</p>

<p>First, you need to <a href="https://github.com/settings/tokens/new">create a token</a>. I created a classic token, and gave it the <code class="language-plaintext highlighter-rouge">notifications</code> scope. (Be sure to set a reasonable expiration date, or remember to delete the token after you have cleared the ghost notification.)</p>

<p>Once you have created the token, you can use GitHub’s notifications API to <a href="https://docs.github.com/en/rest/activity/notifications?apiVersion=2022-11-28#mark-notifications-as-read">mark all notifications as read</a>. You should definitely click that link and read the docs for this endpoint before running the following code to make sure you’re familiar with the options! (Alternatively, if you don’t want to use curl, you could also try the <code class="language-plaintext highlighter-rouge">gh</code> command as explained in this GitHub <a href="https://github.com/orgs/community/discussions/174283#discussioncomment-14473564">discussion comment</a>.)</p>

<p>Now, hop into your favorite shell and run something like the following:</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># Set your generated token to a shell variable.</span>
<span class="nb">set </span>GH_NOTIFICATION_TOKEN my_secret_github_token

<span class="c"># Call the GitHub API with curl</span>
curl <span class="nt">-L</span> <span class="se">\</span>
  <span class="nt">-X</span> PUT <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Accept: application/vnd.github+json"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Authorization: Bearer &lt;YOUR-TOKEN&gt;"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"X-GitHub-Api-Version: 2022-11-28"</span> <span class="se">\</span>
  https://api.github.com/notifications <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"last_read_at":"2022-06-10T00:00:00Z","read":true}'</span></code></pre></figure>

<p>In the above code, we set the generated token to a shell variable, and then use curl to hit the GitHub API to mark all notifications as read.</p>

<p><em>Note: That code is for <a href="https://fishshell.com/">fish</a> shell. Adjust the command to set a shell variable as required by your shell of choice. Additionally, check out curl’s <a href="https://curl.se/docs/manpage.html">manpage</a> if you need a refresher on its options.</em></p>

<p>In my case, I only had the one ghost notification, but if you have some legitimate notifications that you don’t want to mark as read, you should check out the <a href="https://docs.github.com/en/rest/activity/notifications?apiVersion=2022-11-28-use">docs</a> and adjust your command as required.</p>

<p>And that’s it! The ghost notifications should be cleared.</p>]]></content><author><name>Ryan Moore</name></author><category term="blog" /><summary type="html"><![CDATA[Learn how to remove stuck GitHub ghost notifications using the REST API and curl. Fix phantom alerts and clean up your GitHub inbox fast.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.tenderisthebyte.com/assets/img/posts/github-ghost-notifications/ghost_notifications.webp" /><media:content medium="image" url="https://www.tenderisthebyte.com/assets/img/posts/github-ghost-notifications/ghost_notifications.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Bioinformatics by hand: Neighbor-joining trees</title><link href="https://www.tenderisthebyte.com/blog/2022/08/31/neighbor-joining-trees/" rel="alternate" type="text/html" title="Bioinformatics by hand: Neighbor-joining trees" /><published>2022-08-31T00:00:00+00:00</published><updated>2022-08-31T00:00:00+00:00</updated><id>https://www.tenderisthebyte.com/blog/2022/08/31/neighbor-joining-trees</id><content type="html" xml:base="https://www.tenderisthebyte.com/blog/2022/08/31/neighbor-joining-trees/"><![CDATA[<div class="post-toc">

  <h4 class="post-toc--header" id="contents">Contents</h4>

  <ul>
    <li><a href="#bioinformatics-by-hand">Bioinformatics by hand</a></li>
    <li><a href="#neighbor-joining-trees">Neighbor-joining trees</a></li>
    <li><a href="#pros-and-cons-of-neighbor-joining-trees">Pros and cons of neighbor-joining trees</a></li>
    <li><a href="#how-to-neighbor-join">How to neighbor-join</a></li>
    <li><a href="#formulas">Formulas</a></li>
    <li><a href="#example-1">Example 1</a></li>
    <li><a href="#on-distance-matrices">On Distance Matrices</a></li>
    <li><a href="#example-2">Example 2</a></li>
    <li><a href="#wrapping-up">Wrapping up</a></li>
  </ul>

</div>

<h2 id="bioinformatics-by-hand">Bioinformatics by hand</h2>

<p>I’ve been teaching bioinformatics at the University of Delaware for roughly the last year now. I had never been in a bioinformatics class prior to teaching; my degrees are in ecology and marine science, so all of my bioinformatics knowledge came from research experience. It’s been really interesting to see bioinformatics taught in a formal setting. One thing I’ve noticed is the disconnect that can occur between students and instructors when students without programming experience are asked to perform “hands-on” exercises.</p>

<p>In an effort to de-mystify bioinformatics, instructors often have students manually perform a task that would normally be done computationally. While these exercises are valuable and often succeed in their goal, I have noticed that many students who are not used to being presented with code or equations tend to have difficulty implementing algorithms by hand, regardless of difficulty. This can cause students to shut down and question whether they are in the correct field, rather than empower them.</p>

<p>When this occurs, there seem to be two underlying issues: First, even at the collegiate level, many students are not confident in their ability to do math. This issue I will leave alone, as it cannot be solved in a single course or assignment at the graduate level. Second, the way that a computer would perform a procedure is <a href="https://news.mit.edu/2009/brain-data-0825">not necessarily the same</a> way a human would perform it. Sometimes, this can create a gap between students with little or no computing background and instructors who are highly familiar with algorithms.</p>

<p>In this post, I’ll walk you through the process of building neighbor-joining trees. Building phylogenetic trees by hand seems at first like a daunting task, but I promise it’s much easier than you think!</p>

<h2 id="neighbor-joining-trees">Neighbor-joining trees</h2>

<p>Neighbor-joining (NJ) is one of many methods used for creating phylogenetic (evolutionary) and phenetic (trait-based similarity) trees. The method was first introduced in a <a href="https://doi.org/10.1093/oxfordjournals.molbev.a040454">1987 paper</a> and is still in use today.</p>

<p>Neighbor-joining uses a distance matrix to construct a tree by determining which leaves are “neighbors” (i.e., children of the same internal parent node) via an iterative clustering process. A neighbor joining tree aims to show the minimum amount of evolution needed to explain differences among objects, which makes it a <a href="https://doi.org/10.1093/oxfordjournals.molbev.a040056">minimum evolution method</a>.</p>

<p>There has been <a href="https://doi.org/10.1093/molbev/msl072">some debate</a> about the mathematical behavior of neighbor-joining trees. Originally, neighbor joining was thought to be most closely related to tree methods that use <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">ordinary least squares</a> to estimate branch lengths, but <a href="https://doi.org/10.1093/molbev/msl072">further investigation</a> showed that they actually shared more properties with “balanced” minimum evolution methods. You don’t need to know anything about these different methods in order to perform neighbor joining, but if you would like to read more about them, there is an excellent explanation in <a href="https://doi.org/10.1007/s11538-010-9510-y">this paper</a>.</p>

<p>The type of tree produced depends on the input. If you provide a distance matrix based on evolutionary data (e.g., multiple sequence alignment), you will get a phylogenetic tree. If you input distances based on non-evolutionary data (e.g., phenotypic traits), then you will get a phenetic tree. Note that a NJ tree doesn’t have to contain only organisms. You can make NJ trees for anything you can represent/compare with a distance matrix.</p>

<p>NJ trees are simple to make and require only basic operations (addition, subtraction, division), but can seem daunting because of the number of steps required. Here, I will show you how to make two small neighbor-joining trees by hand (or, by spreadsheet).</p>

<h2 id="pros-and-cons-of-neighbor-joining-trees">Pros and cons of neighbor-joining trees</h2>

<p>There are a lot of different ways to build phylogenetic and other trees, so how does neighbor-joining compare?</p>

<h3 id="advantages">Advantages</h3>

<ul>
  <li>It’s simple and easy to understand.</li>
  <li>It’s <a href="https://doi.org/10.1093/oxfordjournals.molbev.a040126">fast</a> and computationally inexpensive compared to other popular methods. Maximum-likelihood and Bayesian methods especially are <a href="https://doi.org/10.1093/molbev/msw042">much slower</a>.</li>
  <li>It works. Neighbor-joining has been found to be <a href="https://doi.org/10.1007/s00453-007-9116-4">topologically accurate</a> and to sometimes <a href="https://doi.org/10.1093/molbev/msw042">out-perform more complicated methods</a> like maximum-likelihood and Bayesian inference.</li>
</ul>

<h3 id="disadvantages">Disadvantages</h3>

<ul>
  <li>You lose data. When you squish down sequence alignment or other data into distances, you are performing <a href="https://en.wikipedia.org/wiki/Data_reduction">data reduction</a>. This isn’t necessarily a bad thing (ordination methods like <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a> also do this), but you should keep it in mind.</li>
  <li>You only get one possible tree. Other methods such as maximum-likelihood and Bayesian inference return multiple different trees, i.e. evolutionary hypotheses, which can be useful for some analyses.</li>
  <li>Neighbor-joining can sometimes result in <a href="https://doi.org/10.1093/oxfordjournals.molbev.a040126">negative branch lengths</a>. Note that this does not affect the topology of the tree, just branch lengths.</li>
</ul>

<h2 id="how-to-neighbor-join">How to neighbor-join</h2>

<p>To begin neighbor-joining, you need a distance matrix. A distance matrix is a square matrix containing pairwise distances between members of some group. It must be symmetric (e.g., the distance from A to B is the same as the distance from B to A) and the distance from an object to itself must be 0. The distance does not necessarily need to be <a href="https://en.wikipedia.org/wiki/Metric_(mathematics)#Definition">metric</a>, but in at least one instance <a href="https://doi.org/10.1093/oxfordjournals.molbev.a040454">a metric distance slightly outperformed a non-metric distance</a>.</p>

<p>Once you have a matrix, you can begin neighbor-joining.</p>

<p>The neighbor-joining process consists of three steps:</p>

<ol>
  <li>Initiation</li>
  <li>Iteration</li>
  <li>Termination</li>
</ol>

<p><em>A quick note on the formulas (which can be found in the section below this one): You may notice a slight difference in the equations between this tutorial and another. Do not panic. These are only slight algebraic differences that do not affect the final answer, only the intermediate numbers.</em></p>

<h3 id="initiation">Initiation</h3>

<p>In the <strong>initiation</strong> step, we define a set of leaf nodes, <code class="language-plaintext highlighter-rouge">T</code>, and set <code class="language-plaintext highlighter-rouge">L</code> equal to the number of leaf nodes. These are the nodes at the “ends” of trees and therefore do not have any child nodes. You should have one leaf node for each item you want to compare. For example, if you are placing sequences on a tree, you will have one leaf node per sequence.</p>

<h3 id="iteration">Iteration</h3>

<p>The <strong>iteration</strong> step is where most of the action takes place. Virtually all of our calculations are made in this step, and, as the name implies, we will repeat these calculations over and over until some conclusion is reached.</p>

<p>First, we calculate the <strong>net divergence (r)</strong> of each leaf node. You can think of this as being essentially the distance from each leaf node to all of the others.</p>

<p>Next, we calculate the <strong>adjusted distance (D)</strong> between each pair of nodes, which is based on the pairwise distance in the starting matrix and the divergence of each node. The pair of nodes with the lowest adjusted distance are <strong>neighbors</strong> and share a parent node.</p>

<p>Next, we declare the parent node and calculate the distance from each of the neighbors to the shared parent. This is also the step where I like to add the siblings and parent to the tree.</p>

<p>At this point, our goal is to construct a new distance matrix. To do this, we remove the two nodes that we earlier determined to be neighbors from the distance matrix and replace them with the newly formed parent node. New <strong>pairwise distances (d)</strong> are calculated between the new parent node and other nodes in the matrix. Any other distances (i.e., pairwise comparisons present in the new matrix and the previous matrix) can simply be transferred to the new matrix.</p>

<p><em>Note: In the formulas and calculations below, adjusted distances use a capital <code class="language-plaintext highlighter-rouge">D</code>, whereas pairwise distance use a lowercase <code class="language-plaintext highlighter-rouge">d</code>.  Try not to get them mixed up!</em></p>

<p>One thing to be aware of is that, after the first iteration, the neighbors are not restricted to being leaves, and may in fact be internal parent nodes.</p>

<p>Each iteration step ends with a new distance matrix that is one node smaller than the one in the previous step (e.g., <code class="language-plaintext highlighter-rouge">(L-1) by (L-1)</code> after the first iteration). Iteration continues until there are only two nodes remaining in the matrix.</p>

<h3 id="termination">Termination</h3>

<p>The final step is <strong>termination</strong>.</p>

<p>The only task remaining is to join the two nodes that remain after iteration with a single edge to complete the tree!</p>

<p>Now that we’ve braved the written explanation, it’s time to look at some examples to make all of these steps clearer!</p>

<h2 id="formulas">Formulas</h2>

<p>These are the formulas for each of the calculations we will perform (you can find more formatted version in the <a href="/assets/data/posts/nj_trees/neighbor-joining_examples_spreadsheet.xlsx">excel file</a> containing the examples).</p>

<h3 id="net-divergence">Net divergence</h3>

<p>Net divergence <code class="language-plaintext highlighter-rouge">r</code> for a node <code class="language-plaintext highlighter-rouge">i</code> with 3 other nodes <code class="language-plaintext highlighter-rouge">(j, k, and l)</code>:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">r(i) = [1/(L-2)] \* [d(ij) + d(ik) + d(il)]</code></pre></figure>

<h3 id="adjusted-distance">Adjusted distance</h3>

<p>Adjusted distance <code class="language-plaintext highlighter-rouge">D</code> for two nodes <code class="language-plaintext highlighter-rouge">i</code> and <code class="language-plaintext highlighter-rouge">j</code>:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">D(ij) = d(ij) - [r(i) + r(j)]</code></pre></figure>

<h3 id="distance-from-child-to-parent">Distance from child to parent</h3>

<p>Distance from child <code class="language-plaintext highlighter-rouge">i</code> to parent <code class="language-plaintext highlighter-rouge">k</code>, <code class="language-plaintext highlighter-rouge">d(ik)</code>, where <code class="language-plaintext highlighter-rouge">j</code> is the neighbor of <code class="language-plaintext highlighter-rouge">i</code>:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">d(ik) = [d(ij) + r(i) + r(j)] / 2</code></pre></figure>

<h3 id="distance-from-non-child-to-new-node">Distance from non-child to new node</h3>

<p>Distance from other non-child node, <code class="language-plaintext highlighter-rouge">m</code> to new node <code class="language-plaintext highlighter-rouge">d(mk)</code>:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">d(mk) = [d(im) + d(jm) - d(ij)] / 2</code></pre></figure>

<h2 id="example-1">Example 1</h2>

<p>There’s a good chance that even if you read the description of neighbor-joining above, you still don’t have a great idea of how to do it. That should become clearer with some examples.</p>

<p>Here is our starting matrix:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th style="text-align: center"><strong>A</strong></th>
      <th style="text-align: center"><strong>B</strong></th>
      <th style="text-align: center"><strong>C</strong></th>
      <th style="text-align: center"><strong>D</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>A</strong></td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">4</td>
      <td style="text-align: center">5</td>
      <td style="text-align: center">10</td>
    </tr>
    <tr>
      <td><strong>B</strong></td>
      <td style="text-align: center">4</td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">7</td>
      <td style="text-align: center">12</td>
    </tr>
    <tr>
      <td><strong>C</strong></td>
      <td style="text-align: center">5</td>
      <td style="text-align: center">7</td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">9</td>
    </tr>
    <tr>
      <td><strong>D</strong></td>
      <td style="text-align: center">10</td>
      <td style="text-align: center">12</td>
      <td style="text-align: center">9</td>
      <td style="text-align: center">0</td>
    </tr>
  </tbody>
</table>

<h4 id="step-1-initiation">Step 1: Initiation</h4>

<p>All we do here is define a set of leaf nodes, <code class="language-plaintext highlighter-rouge">T</code>, and set <code class="language-plaintext highlighter-rouge">L</code> equal to the number of leaf nodes.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">T = { A, B, C, D }

L = 4</code></pre></figure>

<h4 id="step-2-iteration">Step 2: Iteration</h4>

<p>Now for the real action. Remember, this will consist of multiple iterations.</p>

<h5 id="iteration-1">Iteration 1</h5>

<p>First, we calculate the net divergence <code class="language-plaintext highlighter-rouge">r</code> of each node:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">r(A) = [1/(L-2)] * [d(AB) + d(AC) + d(AD)] = (1/2) * (4 + 5 + 10) = 9.5

r(B) = [1/(L-2)] * [d(AB) + d(BC) + d(BD)] = (1/2) * (4 + 7 + 12) = 11.5

r(C) = [1/(L-2)] * [d(AC) + d(BC) + d(CD)] = (1/2) * (5 + 7 + 9) = 10.5

r(D) = [1/(L-2)] * [d(AD) + d(BD) + d(CD)] = (1/2) * (10 + 12 + 9) = 15.5</code></pre></figure>

<p>Next, the adjusted distance <code class="language-plaintext highlighter-rouge">D</code> for each node pair:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">D(AB) = d(AB) - [r(A) + r(B)] = 4 - (9.5 + 11.5) = -17

D(AC) = d(AC) - [r(A) + r(C)] = 5 - (9.5 + 10.5) = -15

D(AD) = d(AD) - [r(A) + r(D)] = 10 - (9.5 + 15.5) = -15

D(BC) = d(BC) - [r(B) + r(C)] = 7 - (11.5 + 10.5) = -15

D(BD) = d(BD) - [r(B) + r(D)] = 12 - (11.5 + 15.5) = -15

D(CD) = d(CD) - [r(C) + r(D)] = 9 - (10.5 + 15.5) = -17</code></pre></figure>

<p>The pair of nodes with the smallest adjusted distance are neighbors. In this case, we have a tie between the pairs <code class="language-plaintext highlighter-rouge">AB</code> and <code class="language-plaintext highlighter-rouge">CD</code>. We can only move forward with one pair, so we’ll pick <code class="language-plaintext highlighter-rouge">AB</code>. We now define a new node that connects these neighbors; we’ll call this new node <code class="language-plaintext highlighter-rouge">Z</code>.</p>

<p>We’re close now to constructing our first bit of the tree. To do that, we need to calculate the distance from each neighbor (child) node to the connecting (parent) node.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">d(AZ) = [d(AB) + r(A) - r(B)]/2 = (4 + 9.5 - 11.5)/2 = 1

d(BZ) = [d(AB) + r(B) - r(A)]/2 = (4 + 11.5 - 9.5)/2 = 3</code></pre></figure>

<p>With this information, we can draw the first two branches on our tree:</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//nj_trees/Example1_iteration1.png" alt="Example 1 tree first iteration" />
    <figcaption>Example 1 tree first iteration</figcaption>
</figure>

<p>Lastly, we need to reconstruct the distance matrix, replacing <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code> with <code class="language-plaintext highlighter-rouge">Z</code>. Some distances can be transferred, but others (represented by question marks), need to be calculated:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th style="text-align: center"><strong>Z</strong></th>
      <th style="text-align: center"><strong>C</strong></th>
      <th style="text-align: center"><strong>D</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Z</strong></td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">?</td>
      <td style="text-align: center">?</td>
    </tr>
    <tr>
      <td><strong>C</strong></td>
      <td style="text-align: center">?</td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">9</td>
    </tr>
    <tr>
      <td><strong>D</strong></td>
      <td style="text-align: center">?</td>
      <td style="text-align: center">9</td>
      <td style="text-align: center">0</td>
    </tr>
  </tbody>
</table>

<p>Here are the formulas for calculating <code class="language-plaintext highlighter-rouge">d(ZC)</code> and <code class="language-plaintext highlighter-rouge">d(ZD)</code>.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">d(ZC) = [d(AC) + d(BC) - d(AB)]/2 = (5 + 7 - 4)/2 = 4

d(ZD) = [d(AD) + d(BD) - d(AB)]/2 = (10 + 12 - 4)/2 = 9</code></pre></figure>

<p>With these calculations done, we can replace the question marks in our distance matrix:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th style="text-align: center"><strong>Z</strong></th>
      <th style="text-align: center"><strong>C</strong></th>
      <th style="text-align: center"><strong>D</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Z</strong></td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">4</td>
      <td style="text-align: center">9</td>
    </tr>
    <tr>
      <td><strong>C</strong></td>
      <td style="text-align: center">4</td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">9</td>
    </tr>
    <tr>
      <td><strong>D</strong></td>
      <td style="text-align: center">9</td>
      <td style="text-align: center">9</td>
      <td style="text-align: center">0</td>
    </tr>
  </tbody>
</table>

<p>And we’re done…with the first iteration. Remember, the iteration step ends when there are only two nodes left in the matrix, and we have three. On to the next iteration!</p>

<h5 id="iteration-2">Iteration 2</h5>

<p>For this iteration, we use the latest version of the distance matrix, constructed at the end of the previous iteration and reset <code class="language-plaintext highlighter-rouge">L</code> (the number of nodes in the matrix).</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">L = 3</code></pre></figure>

<p>Calculate the net divergence <code class="language-plaintext highlighter-rouge">r</code> of each node:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">r(Z) = [1/(L-2)] * [d(ZC) + d(ZD)] = 1 * (4 + 9) = 13

r(C) = [1/(L-2)] * [d(ZC) + d(CD)] = 1 * (4 + 9) = 13

r(D) = [1/(L-2)] * [d(ZD) + d(CD)] = 1 * (9 + 9) = 18</code></pre></figure>

<p>Next, the adjusted distance <code class="language-plaintext highlighter-rouge">D</code> for each node pair:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">D(ZC) = d(ZC) - [r(Z) + r(C)] = 4 - (13 + 13) = -22

D(ZD) = d(ZD) - [r(Z) + r(D)] = 9 - (13 + 18) = -22

D(CD) = d(CD) - [r(C) + r(D)] = 9 - (13 + 18) = -22</code></pre></figure>

<p>All of the pairs are tied for lowest adjusted distance, so we’ll select <code class="language-plaintext highlighter-rouge">ZC</code> because it’s first in the list and define a new node <code class="language-plaintext highlighter-rouge">Y</code> that connects the neighbors.</p>

<p>Calculate the distances from the new parent node to it’s children:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">d(ZY) = [d(ZC) + r(Z) - r(C)]/2 = (4 + 13 - 13)/2 = 2

d(CY) = [d(ZC) + r(C) - r(Z)]/2 = (4 + 13 - 13)/2 = 2</code></pre></figure>

<p>Add the new branches to the tree:</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//nj_trees/Example1_iteration2.png" alt="Example 1 tree second iteration" />
    <figcaption>Example 1 tree second iteration</figcaption>
</figure>

<p>Calculate any other new distances and construct the new distance matrix:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">d(YD) = [d(ZD) + d(CD) - d(ZC)]/2 = (9 + 9 - 4)/2 = 7</code></pre></figure>

<table>
  <thead>
    <tr>
      <th> </th>
      <th style="text-align: center"><strong>Y</strong></th>
      <th style="text-align: center"><strong>D</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Z</strong></td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">7</td>
    </tr>
    <tr>
      <td><strong>D</strong></td>
      <td style="text-align: center">7</td>
      <td style="text-align: center">0</td>
    </tr>
  </tbody>
</table>

<h4 id="step-3-termination">Step 3: Termination</h4>

<p><code class="language-plaintext highlighter-rouge">L</code> now consists of only 2 nodes (<code class="language-plaintext highlighter-rouge">Y</code> and <code class="language-plaintext highlighter-rouge">D</code>), so we add the edge between them to finish the tree:</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//nj_trees/Example1_termination.png" alt="Example 1 tree termination" />
    <figcaption>Example 1 tree termination</figcaption>
</figure>

<h4 id="summary">Summary</h4>

<p>And with that, we’ve built our first neighbor-joining tree! Here is the tree coming together in each step:</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//nj_trees/Example1_tree_step-by-step.png" alt="Example 1 tree step-by-step" />
    <figcaption>Example 1 tree step-by-step</figcaption>
</figure>

<h2 id="on-distance-matrices">On Distance Matrices</h2>

<p>Now, you may have noticed that to build the tree in Example 1, we didn’t actually need all of those formulas. In iteration 1, for example, we can figure out the distance from <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code> to their parent just by noticing that <code class="language-plaintext highlighter-rouge">B</code> is always 2 units further from other nodes than <code class="language-plaintext highlighter-rouge">A</code>. Therefore, <code class="language-plaintext highlighter-rouge">d(BZ)</code> must equal <code class="language-plaintext highlighter-rouge">d(AZ) + 2</code>. If their combined distance from <code class="language-plaintext highlighter-rouge">Z</code> is 4, then the only possible branch lengths are 1 and 3.</p>

<p>So, why did we go through the trouble of neighbor-joining? And when do we actually need neighbor-joining?</p>

<h3 id="additive-matrices">Additive matrices</h3>

<p>The distance matrix that we used for example 1 is what’s called an <strong>additive</strong> matrix. Simply put, a matrix is additive if you are able to reproduce the starting matrix by adding together the branch lengths along the paths between nodes. To demonstrate this, let’s look back at example 1.</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//nj_trees/Example1_reconstruct.png" alt="Reconstruct the example 1 distance matrix from the tree" />
    <figcaption>Reconstruct the example 1 distance matrix from the tree</figcaption>
</figure>

<p>In the figure above, I’ve deconstructed the tree so that you can see the individual paths between each pair of leaf nodes. Notice that we can reconstruct the starting matrix exactly using only the distances on the tree, which is the main trait of an additive matrix (for a more technical and thorough look at additive matrices, <a href="http://people.cs.uchicago.edu/~ridg/digbio08/talkaddree.pdf">see this presentation</a>).</p>

<p>I like to use an additive matrix as the first neighbor-joining example because, 1) it gives me an excuse to discuss additive matrices, and 2) it’s very easy to check your work. If you are unable to reconstruct the starting matrix in example 1 using the tree, you know you have a problem in your calculations, which is harder to catch with non-additive matrices.</p>

<p>Alright, so if we don’t need neighbor-joining for additive distance matrices, then when do we need it? Neighbor-joining is said to work best for near-additive matrices, i.e. matrices for which the tree <em>almost</em> reconstructs the starting matrix, though they have been reported to be <a href="https://doi.org/10.1007/s00453-007-9116-4">topologically accurate</a> even when this is not the case. And I should note here that the vast majority of distance matrices based on biological data are <a href="https://doi.org/10.1016/j.tcs.2008.12.040">not additive or even nearly additive</a>.</p>

<p>Without further ado, here is another example using a nearly-additive matrix.</p>

<h2 id="example-2">Example 2</h2>

<p>Here is our starting matrix:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th style="text-align: center"><strong>A</strong></th>
      <th style="text-align: center"><strong>B</strong></th>
      <th style="text-align: center"><strong>C</strong></th>
      <th style="text-align: center"><strong>D</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>A</strong></td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">2</td>
    </tr>
    <tr>
      <td><strong>B</strong></td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">3</td>
      <td style="text-align: center">2</td>
    </tr>
    <tr>
      <td><strong>C</strong></td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">3</td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">2</td>
    </tr>
    <tr>
      <td><strong>D</strong></td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">0</td>
    </tr>
  </tbody>
</table>

<h3 id="step-1-initiation-1">Step 1: Initiation</h3>

<p>Again, we define <code class="language-plaintext highlighter-rouge">T</code> and <code class="language-plaintext highlighter-rouge">L</code>. They are the same as example 1.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">T = { A, B, C, D }

L = 4</code></pre></figure>

<h3 id="step-2-iteration-1">Step 2: Iteration</h3>

<h4 id="iteration-1-1">Iteration 1</h4>

<p>First, we calculate the net divergence <code class="language-plaintext highlighter-rouge">r</code> of each node:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">r(A) = [1/(L-2)] * [d(AB) + d(AC) + d(AD)] = (1/2) * (2 + 2 + 2) = 3

r(B) = [1/(L-2)] * [d(AB) + d(BC) + d(BD)] = (1/2) * (2 + 3 + 2) = 3.5

r(C) = [1/(L-2)] * [d(AC) + d(BC) + d(CD)] = (1/2) * (2 + 3 + 2) = 3.5

r(D) = [1/(L-2)] * [d(AD) + d(BD) + d(CD)] = (1/2) * (2 + 2 + 2) = 3</code></pre></figure>

<p>Next, the adjusted distance <code class="language-plaintext highlighter-rouge">D</code> for each node pair:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">D(AB) = d(AB) - [r(A) + r(B)] = 2 - (3 + 3.5) = -4.5

D(AC) = d(AC) - [r(A) + r(C)] = 2 - (3 + 3.5) = -4.5

D(AD) = d(AD) - [r(A) + r(D)] = 2 - (3 + 3) = -4

D(BC) = d(BC) - [r(B) + r(C)] = 3 - (3.5 + 3.5) = -4

D(BD) = d(BD) - [r(B) + r(D)] = 2 - (3.5 + 3 = -4.5

D(CD) = d(CD) - [r(C) + r(D)] = 2 - (3.5 + 3) = -4.5</code></pre></figure>

<p>A lot of ties here. Again, we’ll pick the tied pair that is closest to the top of the list, <code class="language-plaintext highlighter-rouge">AB</code>, and assign them a parent node, <code class="language-plaintext highlighter-rouge">Z</code>.</p>

<p>Now, calculate the distance from each neighbor (child) node to the connecting (parent) node.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">d(AZ) = [d(AB) + r(A) - r(B)]/2 = (2 + 3 - 3.5)/2 = 0.75

d(BZ) = [d(AB) + r(B) - r(A)]/2 = (2 + 3.5 - 3)/2 = 1.25</code></pre></figure>

<p>And draw the first two branches on our tree:</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//nj_trees/Example2_iteration1.png" alt="Example 2 tree first iteration" />
    <figcaption>Example 2 tree first iteration</figcaption>
</figure>

<p>Lastly, we calculate new distances and reconstruct the distance matrix:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">d(ZC) = [d(AC) + d(BC) - d(AB)]/2 = (2 + 3 - 2)/2 = 1.5

d(ZD) = [d(AD) + d(BD) - d(AB)]/2 = (2 + 2 - 2)/2 = 1</code></pre></figure>

<table>
  <thead>
    <tr>
      <th> </th>
      <th style="text-align: center"><strong>Z</strong></th>
      <th style="text-align: center"><strong>C</strong></th>
      <th style="text-align: center"><strong>D</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Z</strong></td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">1.5</td>
      <td style="text-align: center">1</td>
    </tr>
    <tr>
      <td><strong>C</strong></td>
      <td style="text-align: center">1.5</td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">2</td>
    </tr>
    <tr>
      <td><strong>D</strong></td>
      <td style="text-align: center">1</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">0</td>
    </tr>
  </tbody>
</table>

<p>On to the next iteration!</p>

<h4 id="iteration-2-1">Iteration 2</h4>

<p>For this iteration, we use the latest version of the distance matrix, constructed at the end of the previous iteration and reset <code class="language-plaintext highlighter-rouge">L</code>.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">L = 3</code></pre></figure>

<p>Calculate the net divergence <code class="language-plaintext highlighter-rouge">r</code> of each node:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">r(Z) = [1/(L-2)] * [d(ZC) + d(ZD)] = 1 * (1.5 + 1) = 2.5

r(C) = [1/(L-2)] * [d(ZC) + d(CD)] = 1 * (1.5 + 2) = 3.5

r(D) = [1/(L-2)] * [d(ZD) + d(CD)] = 1 * (1 + 2) = 3</code></pre></figure>

<p>Next, the adjusted distance <code class="language-plaintext highlighter-rouge">D</code> for each node pair:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">D(ZC) = d(ZC) - [r(Z) + r(C)] = 1.5 - (2.5 + 3.5) = -4.5

D(ZD) = d(ZD) - [r(Z) + r(D)] = 1 - (2.5 + 3) = -4.5

D(CD) = d(CD) - [r(C) + r(D)] = 2 - (3.5 + 3) = -4.5</code></pre></figure>

<p>All of the pairs are tied for lowest adjusted distance, so we’ll select <code class="language-plaintext highlighter-rouge">ZC</code> because it’s first in the list and define a new node <code class="language-plaintext highlighter-rouge">Y</code> that connects the neighbors.</p>

<p>Calculate the distances from the new parent node to it’s children:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">d(ZY) = [d(ZC) + r(Z) - r(C)]/2 = (1.5 + 2.5 - 3.5)/2 = 0.25

d(CY) = [d(ZC) + r(C) - r(Z)]/2 = (1.5 + 3.5 - 2.5)/2 = 1.25</code></pre></figure>

<p>Add the new branches to the tree:</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//nj_trees/Example2_iteration2.png" alt="Example 2 tree second iteration" />
    <figcaption>Example 2 tree second iteration</figcaption>
</figure>

<p>Calculate any other new distances and construct the new distance matrix:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">d(YD) = [d(ZD) + d(CD) - d(ZC)]/2 = (1 + 2 - 1.5)/2 = 0.75</code></pre></figure>

<table>
  <thead>
    <tr>
      <th> </th>
      <th style="text-align: center"><strong>Y</strong></th>
      <th style="text-align: center"><strong>D</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Z</strong></td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">0.75</td>
    </tr>
    <tr>
      <td><strong>D</strong></td>
      <td style="text-align: center">0.75</td>
      <td style="text-align: center">0</td>
    </tr>
  </tbody>
</table>

<h4 id="step-3-termination-1">Step 3: Termination</h4>

<p><code class="language-plaintext highlighter-rouge">L</code> now consists of only 2 nodes (<code class="language-plaintext highlighter-rouge">Y</code> and <code class="language-plaintext highlighter-rouge">D</code>), so we add the edge between them to finish the tree:</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//nj_trees/Example2_termination.png" alt="Example 2 tree termination" />
    <figcaption>Example 2 tree termination</figcaption>
</figure>

<h4 id="summary-1">Summary</h4>

<p>Here is our second tree in completion:</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//nj_trees/Example2_step-by-step.png" alt="Example 2 tree step-by-step" />
    <figcaption>Example 2 tree step-by-step</figcaption>
</figure>

<p>Lastly, let’s make a distance matrix using the tree to provide the distances. Notice that these distances are just a little bit off from the starting matrix. Hence, “near-additive”.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th style="text-align: center"><strong>A</strong></th>
      <th style="text-align: center"><strong>B</strong></th>
      <th style="text-align: center"><strong>C</strong></th>
      <th style="text-align: center"><strong>D</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>A</strong></td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">2.25</td>
      <td style="text-align: center">1.75</td>
    </tr>
    <tr>
      <td><strong>B</strong></td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">2.75</td>
      <td style="text-align: center">2.25</td>
    </tr>
    <tr>
      <td><strong>C</strong></td>
      <td style="text-align: center">2.25</td>
      <td style="text-align: center">2.75</td>
      <td style="text-align: center">0</td>
      <td style="text-align: center">2</td>
    </tr>
    <tr>
      <td><strong>D</strong></td>
      <td style="text-align: center">1.75</td>
      <td style="text-align: center">2.25</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">0</td>
    </tr>
  </tbody>
</table>

<h2 id="wrapping-up">Wrapping up</h2>

<p>Having reached the end of this lesson, you should have learned how to construct neighbor-joining trees by hand from additive and nearly additive matrices. If you want to take a closer look at the examples (and access one additional example), you can check out <a href="/assets/data/posts/nj_trees/neighbor-joining_examples_spreadsheet.xlsx">this excel file</a>.</p>]]></content><author><name>Amelia Harrision</name></author><category term="blog" /><summary type="html"><![CDATA[Bioinformatics algorithms can be intimidating, but many are much simpler than you think. In this post, we show how to calculate neighbor-joining trees "by hand" without any computational assistance beyond a spreadsheet program.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.tenderisthebyte.com/assets/img/posts/nj_trees/Example1_reconstruct.png" /><media:content medium="image" url="https://www.tenderisthebyte.com/assets/img/posts/nj_trees/Example1_reconstruct.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Generating Python bindings for OCaml with pyml_bindgen</title><link href="https://www.tenderisthebyte.com/blog/2022/04/12/ocaml-python-bindgen/" rel="alternate" type="text/html" title="Generating Python bindings for OCaml with pyml_bindgen" /><published>2022-04-12T00:00:00+00:00</published><updated>2022-04-12T00:00:00+00:00</updated><id>https://www.tenderisthebyte.com/blog/2022/04/12/ocaml-python-bindgen</id><content type="html" xml:base="https://www.tenderisthebyte.com/blog/2022/04/12/ocaml-python-bindgen/"><![CDATA[<p><code class="language-plaintext highlighter-rouge">pyml_bindgen</code> is a command line app that generates Python bindings via <a href="https://github.com/thierry-martinez/pyml">pyml</a> directly from OCaml value specifications.  While you could write <code class="language-plaintext highlighter-rouge">pyml</code> bindings by hand, it can get repetitive, especially if you are binding a decent sized Python library.</p>

<p>In this post, I will introduce <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> and go through a couple of common tasks.</p>

<div class="post-toc">

  <h4 class="post-toc--header" id="contents">Contents</h4>

  <ul>
    <li><a href="#install">Install</a></li>
    <li><a href="#a-simple-example">A simple example</a></li>
    <li><a href="#controlling-the-bindings">Controlling the bindings</a></li>
    <li><a href="#binding-cyclic-python-classes">Binding cyclic Python classes</a></li>
    <li><a href="#other-stuff">Other stuff</a></li>
    <li><a href="#wrap-up">Wrap-up</a></li>
  </ul>

</div>

<h2 id="install">Install</h2>

<p>To get started with <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>, you will need to install it.  It is available on <a href="https://opam.ocaml.org/packages/pyml_bindgen/">opam</a> (<code class="language-plaintext highlighter-rouge">opam install pyml_bindgen</code>).</p>

<h2 id="a-simple-example">A simple example</h2>

<p>Let’s start with a simple example.</p>

<h3 id="python-code">Python code</h3>

<p>Here is the Python class that we want to bind (<code class="language-plaintext highlighter-rouge">hobbit.py</code>).</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Hobbit</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">age</span> <span class="o">=</span> <span class="n">age</span>

    <span class="k">def</span> <span class="nf">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">'Hobbit -- </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">age</span><span class="si">}</span><span class="s">'</span></code></pre></figure>

<p>As you see, it’s pretty simple! It’s just the <code class="language-plaintext highlighter-rouge">__init__</code> method to create the class and the <code class="language-plaintext highlighter-rouge">__str__</code> method for converting it to a string with the Python <code class="language-plaintext highlighter-rouge">str</code> or <code class="language-plaintext highlighter-rouge">print</code> functions.</p>

<p>Here’s an example of using it in Python.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">hobbit</span> <span class="kn">import</span> <span class="n">Hobbit</span>
<span class="n">bilbo</span> <span class="o">=</span> <span class="n">Hobbit</span><span class="p">(</span><span class="s">'Bilbo'</span><span class="p">,</span> <span class="mi">111</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">bilbo</span><span class="p">)</span>
<span class="c1">#=&gt; Hobbit -- Bilbo, 111</span></code></pre></figure>

<h3 id="write-value-specifications">Write value specifications</h3>

<p>To bind Python classes with <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>, you first need to write value specifications to define the OCaml interface for the Python code we are binding.</p>

<p>To start, we will keep the functions and argument names the same.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">__init__</span> <span class="o">:</span> <span class="n">name</span><span class="o">:</span><span class="kt">string</span> <span class="o">-&gt;</span> <span class="n">age</span><span class="o">:</span><span class="kt">int</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="o">-&gt;</span> <span class="n">t</span>
<span class="k">val</span> <span class="n">__str__</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="o">-&gt;</span> <span class="kt">string</span>
<span class="k">val</span> <span class="n">name</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">string</span>
<span class="k">val</span> <span class="n">age</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">int</span></code></pre></figure>

<p>There are a couple things call your attention to here:</p>

<ul>
  <li>I haven’t defined <code class="language-plaintext highlighter-rouge">type t</code> anywhere yet. Depending on the command line arguments you pass to <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>, it will take care of this for you.</li>
  <li>For the <code class="language-plaintext highlighter-rouge">__init__</code> function, I have used all named arguments plus the <code class="language-plaintext highlighter-rouge">unit</code> argument.  The <code class="language-plaintext highlighter-rouge">unit</code> argument tells <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> that you are binding a normal Python method or function call (as opposed to a Python attribute or property).</li>
  <li>The <code class="language-plaintext highlighter-rouge">__str__</code> function takes <code class="language-plaintext highlighter-rouge">t</code> as the first argument.  Value specifications that start with <code class="language-plaintext highlighter-rouge">t</code>, will bind to object method calls on the Python side.</li>
  <li><code class="language-plaintext highlighter-rouge">name</code> and <code class="language-plaintext highlighter-rouge">age</code> both take <code class="language-plaintext highlighter-rouge">t</code> as the first and only argument.  If a value specification takes <code class="language-plaintext highlighter-rouge">t</code> and nothing else, it binds to the Python attribute of that name.</li>
</ul>

<p>Save the above in a file called <code class="language-plaintext highlighter-rouge">hobbit.txt</code>.</p>

<h3 id="generate-bindings">Generate bindings</h3>

<p>Now, we’re ready to generate the OCaml bindings.</p>

<p>Here’s how you would run <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> for this example.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>pyml_bindgen hobbit.txt hobbit Hobbit <span class="se">\</span>
  <span class="nt">--of-pyo-ret-type</span> no_check <span class="se">\</span>
  <span class="o">&gt;</span> hobbit.ml</code></pre></figure>

<p>Let’s unpack that.</p>

<ul>
  <li>The first three arguments are the path to the OCaml value specifications, the name of the Python module we are binding, and the Python class name.
    <ul>
      <li>Since we named the Python file <code class="language-plaintext highlighter-rouge">hobbit.py</code>, its module name is <code class="language-plaintext highlighter-rouge">hobbit</code>.</li>
      <li>Depending on the directory structure you’re using, this may change.</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">--of-pyo-ret-type</code> specifies the return type for functions that generate Python objects.
    <ul>
      <li>Using <code class="language-plaintext highlighter-rouge">no_check</code> means the generated functions will assume the Python object is the correct type.</li>
      <li>You can also use <code class="language-plaintext highlighter-rouge">option</code> and <code class="language-plaintext highlighter-rouge">or_error</code> as well.</li>
    </ul>
  </li>
  <li>The output is redirected to a file called <code class="language-plaintext highlighter-rouge">hobbit.ml</code>.  Thus, our generated code will be in a module called <code class="language-plaintext highlighter-rouge">Hobbit</code>.</li>
  <li>We did not tell <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> that it should generate a full module with a signature, so it will just write the implementation.
    <ul>
      <li>In this example it is fine, but you will often want to generate the module and signature, so that your types will be abstract.</li>
      <li>For example, you could use <code class="language-plaintext highlighter-rouge">--caml-module Hobbit --split-caml-module</code> to generate both an <code class="language-plaintext highlighter-rouge">ml</code> and <code class="language-plaintext highlighter-rouge">mli</code> file.</li>
    </ul>
  </li>
  <li>If you look at the generated code, it will be kind of messy.  I usually run the output through <code class="language-plaintext highlighter-rouge">ocamlformat</code> if I need to edit the output, or check the generated code into version control or something like that.</li>
</ul>

<h3 id="test-it-out">Test it out</h3>

<p>Now we can make a program to test it out.  Don’t forget to call <a href="https://github.com/thierry-martinez/pyml#getting-started">initialize</a> before running the rest of your code!</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">Py</span><span class="p">.</span><span class="n">initialize</span> <span class="bp">()</span>

<span class="k">let</span> <span class="n">bilbo</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">__init__</span> <span class="o">~</span><span class="n">name</span><span class="o">:</span><span class="s2">"Bilbo"</span> <span class="o">~</span><span class="n">age</span><span class="o">:</span><span class="mi">111</span> <span class="bp">()</span>

<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
  <span class="k">assert</span> <span class="p">(</span><span class="s2">"Hobbit -- Bilbo, 111"</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">__str__</span> <span class="n">bilbo</span> <span class="bp">()</span><span class="p">);</span>
  <span class="k">assert</span> <span class="p">(</span><span class="s2">"Bilbo"</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">name</span> <span class="n">bilbo</span><span class="p">);</span>
  <span class="k">assert</span> <span class="p">(</span><span class="mi">111</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">age</span> <span class="n">bilbo</span><span class="p">)</span></code></pre></figure>

<p>Since we didn’t generate a signature to go with our implementation, the type of the value returned by <code class="language-plaintext highlighter-rouge">Hobbit.__init__</code> will be <code class="language-plaintext highlighter-rouge">Pytypes.pyobject</code>.  In this way, we can pass any <code class="language-plaintext highlighter-rouge">pyobject</code> to the <code class="language-plaintext highlighter-rouge">Hobbit.__str__</code> function.  Let’s see.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="nn">Py</span><span class="p">.</span><span class="nn">Int</span><span class="p">.</span><span class="n">of_int</span> <span class="mi">1234</span>

<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="n">print_endline</span> <span class="o">@@</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">__str__</span> <span class="n">x</span> <span class="bp">()</span></code></pre></figure>

<p>If you run that, it will print <code class="language-plaintext highlighter-rouge">1234</code>.  Huh?  Well, if you look at the generated code for the <code class="language-plaintext highlighter-rouge">Hobbit.__str__</code> function, it looks something like this:</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">__str__</span> <span class="n">t</span> <span class="bp">()</span> <span class="o">=</span>
  <span class="k">let</span> <span class="n">callable</span> <span class="o">=</span> <span class="nn">Py</span><span class="p">.</span><span class="nn">Object</span><span class="p">.</span><span class="n">find_attr_string</span> <span class="n">t</span> <span class="s2">"__str__"</span> <span class="k">in</span>
  <span class="k">let</span> <span class="n">kwargs</span> <span class="o">=</span> <span class="n">filter_opt</span> <span class="bp">[]</span> <span class="k">in</span>
  <span class="nn">Py</span><span class="p">.</span><span class="nn">String</span><span class="p">.</span><span class="n">to_string</span>
  <span class="o">@@</span> <span class="nn">Py</span><span class="p">.</span><span class="nn">Callable</span><span class="p">.</span><span class="n">to_function_with_keywords</span> <span class="n">callable</span> <span class="p">[</span><span class="o">||</span><span class="p">]</span> <span class="n">kwargs</span></code></pre></figure>

<p>Without going into too much detail, essentially all it is doing is calling the <code class="language-plaintext highlighter-rouge">__str__</code> method on the Python object passed in.  While this is fine on the Python side, it doesn’t work the way we might want it to on the OCaml side.  Ideally, we only want the <code class="language-plaintext highlighter-rouge">Hobbit</code> module functions to work on values of type <code class="language-plaintext highlighter-rouge">Hobbit.t</code>.</p>

<h3 id="generating-abstract-types">Generating abstract types</h3>

<p>If we were writing the bindings by hand, we would make <code class="language-plaintext highlighter-rouge">Hobbit.t</code> abstract.  With <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>, we can do that using the <code class="language-plaintext highlighter-rouge">--caml-module</code> option.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>pyml_bindgen hobbit_specs.txt hobbit Hobbit <span class="se">\</span>
  <span class="nt">--of-pyo-ret-type</span> no_check <span class="se">\</span>
  <span class="nt">--caml-module</span> Hobbit <span class="se">\</span>
  <span class="nt">--split-caml-module</span> <span class="nb">.</span> <span class="se">\</span>
  <span class="o">&gt;</span> hobbit.ml</code></pre></figure>

<p>Notice that I also used <code class="language-plaintext highlighter-rouge">--split-caml-module .</code> which tells <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> to split the implementation and signature into separate <code class="language-plaintext highlighter-rouge">ml</code> and <code class="language-plaintext highlighter-rouge">mli</code> files, and to put the output in the directory in which the command is run.  You can pass in whatever directory you want to this option.</p>

<p>Now if we tried something like this:</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="nn">Py</span><span class="p">.</span><span class="nn">Int</span><span class="p">.</span><span class="n">of_int</span> <span class="mi">1234</span>

<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="n">print_endline</span> <span class="o">@@</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">__str__</span> <span class="n">x</span> <span class="bp">()</span></code></pre></figure>

<p>It would be a compile-time error.</p>

<h2 id="controlling-the-bindings">Controlling the bindings</h2>

<p>Let’s clean up this example a little bit.</p>

<h3 id="using-different-function-names">Using different function names</h3>

<p>While <code class="language-plaintext highlighter-rouge">__init__</code> and <code class="language-plaintext highlighter-rouge">__str__</code> are fine for OCaml function names, they don’t feel quite right.  <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> lets you bind Python functions to different names on the OCaml side using <a href="https://ocaml.org/manual/attributes.html">attributes</a> on the value specifications.  To bind to a different function name, we use the <code class="language-plaintext highlighter-rouge">py_fun_name</code> attribute.  Check it out.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">create</span> <span class="o">:</span> <span class="n">name</span><span class="o">:</span><span class="kt">string</span> <span class="o">-&gt;</span> <span class="n">age</span><span class="o">:</span><span class="kt">int</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="o">-&gt;</span> <span class="n">t</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__init__</span><span class="p">]</span>

<span class="k">val</span> <span class="n">to_string</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="o">-&gt;</span> <span class="kt">string</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__str__</span><span class="p">]</span></code></pre></figure>

<p>We bind the <code class="language-plaintext highlighter-rouge">__init__</code> function to an OCaml function called <code class="language-plaintext highlighter-rouge">create</code>, and the Python function <code class="language-plaintext highlighter-rouge">__str__</code> to the OCaml function <code class="language-plaintext highlighter-rouge">to_string</code>.  That’s much more natural!</p>

<p>As you can see, the syntax is like this: <code class="language-plaintext highlighter-rouge">[@@attr-id attr-payload]</code>.  In this case, the attribute id is <code class="language-plaintext highlighter-rouge">py_fun_name</code> and the payload is the name of the Python function that we want to bind.  Put another way, the attribute payload should be the name of the function as it is defined in the Python library you are binding to (i.e., <code class="language-plaintext highlighter-rouge">__init__</code> is the name of the function on the Python side, not <code class="language-plaintext highlighter-rouge">create</code>).</p>

<p>Putting it together, you get <code class="language-plaintext highlighter-rouge">[@@py_fun_name __init__]</code> for the Python <code class="language-plaintext highlighter-rouge">__init__</code> function and <code class="language-plaintext highlighter-rouge">[@@py_fun_name __str__]</code> for the Python <code class="language-plaintext highlighter-rouge">__str__</code> function.</p>

<h3 id="using-different-argument-names">Using different argument names</h3>

<p>The other available attribute is <code class="language-plaintext highlighter-rouge">py_arg_name</code>.  With this, we can bind arguments to different names on the OCaml and Python sides.  This can be useful in situations in which Python argument names use reserved OCaml keywords, or simply to make the generated API feel more natural for use in OCaml.</p>

<p>For example, you may have a Python function that has an argument name <code class="language-plaintext highlighter-rouge">method</code>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">cluster</span><span class="p">(</span><span class="n">method</span><span class="o">=</span><span class="s">'ward'</span><span class="p">):</span>
    <span class="p">...</span></code></pre></figure>

<p>Since <code class="language-plaintext highlighter-rouge">method</code> is a <a href="https://ocaml.org/manual/lex.html#sss:keywords">reserved keyword</a> in OCaml, we can’t use it directly.  Instead, we want to name it <code class="language-plaintext highlighter-rouge">method_</code> in our OCaml code.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">cluster</span> <span class="o">:</span> <span class="n">method_</span><span class="o">:</span><span class="kt">string</span> <span class="o">-&gt;</span> <span class="o">...</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_arg_name</span> <span class="n">method_</span> <span class="n">method</span><span class="p">]</span></code></pre></figure>

<p>In this case, the payload is two items: the first is the argument name on the OCaml side, and the second is the argument name on the Python side.</p>

<p>Note that in cases in which you need <a href="https://github.com/mooreryan/ocaml_python_bindgen/tree/main/examples/attributes#multiple-attributes">multiple attributes</a> per specification, they must be placed one per line.  (This is a <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> specific restriction.)  E.g., something like this:</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">run_clustering</span> <span class="o">:</span> <span class="n">method_</span><span class="o">:</span><span class="kt">string</span> <span class="o">-&gt;</span> <span class="o">...</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">cluster</span><span class="p">]</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_arg_name</span> <span class="n">method_</span> <span class="n">method</span><span class="p">]</span></code></pre></figure>

<p>This will bind the OCaml function <code class="language-plaintext highlighter-rouge">run_clustering</code> to the corresponding Python function <code class="language-plaintext highlighter-rouge">cluster</code>.</p>

<h2 id="binding-cyclic-python-classes">Binding cyclic Python classes</h2>

<p>Often you will need to bind Python classes that refer to each other.  One way to bind these is to use <a href="https://ocaml.org/manual/recursivemodules.html">recursive modules</a>.  Let’s update our Hobbit example to show how you can do this in <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Hobbit</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">age</span> <span class="o">=</span> <span class="n">age</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">house</span> <span class="o">=</span> <span class="bp">None</span>

    <span class="k">def</span> <span class="nf">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">'Hobbit -- </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, age: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">age</span><span class="si">}</span><span class="s">, house: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">house</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">'</span>

    <span class="k">def</span> <span class="nf">buy_house</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">house</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">house</span> <span class="o">=</span> <span class="n">house</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">house</span><span class="p">.</span><span class="n">owner</span> <span class="o">=</span> <span class="bp">self</span>

<span class="k">class</span> <span class="nc">House</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">owner</span> <span class="o">=</span> <span class="bp">None</span>

    <span class="k">def</span> <span class="nf">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">'House -- </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">, owner: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">owner</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">'</span></code></pre></figure>

<p>So this is a pretty silly example, but it’s just to illustrate the point.  In this case, a <code class="language-plaintext highlighter-rouge">Hobbit</code> can own a <code class="language-plaintext highlighter-rouge">House</code> and a <code class="language-plaintext highlighter-rouge">House</code> can have a <code class="language-plaintext highlighter-rouge">Hobbit</code> for an owner.</p>

<p>To bind these classes, I will use the <code class="language-plaintext highlighter-rouge">gen_multi</code> and <code class="language-plaintext highlighter-rouge">combine_rec_modules</code> helper programs that come with <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>.</p>

<h3 id="gen_multi">gen_multi</h3>

<p><code class="language-plaintext highlighter-rouge">gen_multi</code> is a wrapper script that runs <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> multiple times to generate multiple OCaml modules in one go.  It takes a tsv file specifying the same set of options that you would pass in to <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> if you used it directly.</p>

<p>Assume this is in a file called <code class="language-plaintext highlighter-rouge">gen_multi_cli.tsv</code>.</p>

<table class="scroll">
  <thead>
    <tr>
      <th>signatures</th>
      <th>py_module</th>
      <th>py_class</th>
      <th>associated_with</th>
      <th>caml_module</th>
      <th>split_caml_module</th>
      <th>embed_python_source</th>
      <th>of_pyo_ret_type</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>hobbit.txt</td>
      <td>hobbit</td>
      <td>Hobbit</td>
      <td>class</td>
      <td>Hobbit</td>
      <td>NA</td>
      <td>hobbit.py</td>
      <td>no_check</td>
    </tr>
    <tr>
      <td>house.txt</td>
      <td>house</td>
      <td>House</td>
      <td>class</td>
      <td>House</td>
      <td>NA</td>
      <td>house.py</td>
      <td>no_check</td>
    </tr>
  </tbody>
</table>

<p>The order of the columns must as shown above.  <em>(For more info on each of these options, run <code class="language-plaintext highlighter-rouge">pyml_bindgen --help</code>.)</em></p>

<p>You will see that we refer to <code class="language-plaintext highlighter-rouge">hobbit.txt</code> and <code class="language-plaintext highlighter-rouge">house.txt</code>.  These are the value specifications for each of the Python classes.  Here are there contents.</p>

<p><code class="language-plaintext highlighter-rouge">hobbit.txt</code></p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">create</span> <span class="o">:</span> <span class="n">name</span><span class="o">:</span><span class="kt">string</span> <span class="o">-&gt;</span> <span class="n">age</span><span class="o">:</span><span class="kt">int</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="o">-&gt;</span> <span class="n">t</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__init__</span><span class="p">]</span>

<span class="k">val</span> <span class="n">to_string</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="o">-&gt;</span> <span class="kt">string</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__str__</span><span class="p">]</span>

<span class="k">val</span> <span class="n">buy_house</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="n">house</span><span class="o">:</span><span class="nn">House</span><span class="p">.</span><span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="o">-&gt;</span> <span class="kt">unit</span></code></pre></figure>

<p><code class="language-plaintext highlighter-rouge">house.txt</code></p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">val</span> <span class="n">create</span> <span class="o">:</span> <span class="n">name</span><span class="o">:</span><span class="kt">string</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="o">-&gt;</span> <span class="n">t</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__init__</span><span class="p">]</span>

<span class="k">val</span> <span class="n">to_string</span> <span class="o">:</span> <span class="n">t</span> <span class="o">-&gt;</span> <span class="kt">unit</span> <span class="o">-&gt;</span> <span class="kt">string</span>
<span class="p">[</span><span class="o">@@</span><span class="n">py_fun_name</span> <span class="n">__str__</span><span class="p">]</span></code></pre></figure>

<h3 id="combine_rec_modules">combine_rec_modules</h3>

<p><code class="language-plaintext highlighter-rouge">combine_rec_modules</code> takes a file of OCaml modules and “converts” them into recursive modules.  It does this using a simple text transformation.</p>

<p>Often you will want to pipe the output of <code class="language-plaintext highlighter-rouge">gen_multi</code> directly into <code class="language-plaintext highlighter-rouge">combine_rec_modules</code>.</p>

<h3 id="generate-the-modules--test-it-out">Generate the modules &amp; test it out</h3>

<p>Now let’s see it in action.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>gen_multi gen_multi_cli.tsv | combine_rec_modules /dev/stdin <span class="o">&gt;</span> lib.ml</code></pre></figure>

<p>We put that in a module called <code class="language-plaintext highlighter-rouge">Lib</code>.  And here is how we might use that.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">open</span> <span class="nc">Lib</span>

<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">Py</span><span class="p">.</span><span class="n">initialize</span> <span class="bp">()</span>

<span class="k">let</span> <span class="n">bilbo</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">create</span> <span class="o">~</span><span class="n">name</span><span class="o">:</span><span class="s2">"Bilbo"</span> <span class="o">~</span><span class="n">age</span><span class="o">:</span><span class="mi">111</span> <span class="bp">()</span>

<span class="k">let</span> <span class="n">bag_end</span> <span class="o">=</span> <span class="nn">House</span><span class="p">.</span><span class="n">create</span> <span class="o">~</span><span class="n">name</span><span class="o">:</span><span class="s2">"Bag End"</span> <span class="bp">()</span>

<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">buy_house</span> <span class="n">bilbo</span> <span class="o">~</span><span class="n">house</span><span class="o">:</span><span class="n">bag_end</span> <span class="bp">()</span>

<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
  <span class="k">assert</span> <span class="p">(</span>
    <span class="s2">"Hobbit -- Bilbo, age: 111, house: Bag End"</span> <span class="o">=</span> <span class="nn">Hobbit</span><span class="p">.</span><span class="n">to_string</span> <span class="n">bilbo</span> <span class="bp">()</span><span class="p">)</span></code></pre></figure>

<h2 id="other-stuff">Other stuff</h2>

<p>Let me mention a couple of other things before we go…</p>

<ul>
  <li>In this post we ran <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> (or its helper scripts) manually, it’s not too hard to set up Dune <a href="https://dune.readthedocs.io/en/stable/dune-files.html#rule">rules</a> to automatically generate bindings.  See the <code class="language-plaintext highlighter-rouge">dune</code> files in the <a href="https://github.com/mooreryan/ocaml_python_bindgen/tree/main/examples">example</a> directory on the <code class="language-plaintext highlighter-rouge">pyml_bindgen</code> GitHub for more information.</li>
  <li>While I only showed how to bind to Python classes, you can also bind to functions associated with modules rather than with classes.</li>
  <li>Another cool feature is that you can embed Python source code directly into your generated OCaml modules.  See <a href="https://github.com/mooreryan/ocaml_python_bindgen/tree/main/examples/embedding_python_source">here</a> for more details.</li>
</ul>

<h2 id="wrap-up">Wrap-up</h2>

<p><code class="language-plaintext highlighter-rouge">pyml_bindgen</code> is a command line app for generating Python bindings using pyml.  It makes incorporating Python libraries into your OCaml projects as easy as writing regular OCaml value specifications.</p>

<p>To get more information on setting up and using <code class="language-plaintext highlighter-rouge">pyml_bindgen</code>, including ideas on how to structure your projects, check out the <a href="https://github.com/mooreryan/ocaml_python_bindgen/tree/main/examples">examples</a>, <a href="https://github.com/mooreryan/ocaml_python_bindgen/tree/main/test">tests</a>, and <a href="https://mooreryan.github.io/ocaml_python_bindgen/">docs</a>.</p>]]></content><author><name>Ryan Moore</name></author><category term="blog" /><summary type="html"><![CDATA[This post provides an introduction to using pyml_bindgen, a command line application that generates Python bindings via pyml directly from OCaml value specifications.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.tenderisthebyte.com/assets/img/posts/ocaml_python_bindgen/ocaml_pyml_bindgen.png" /><media:content medium="image" url="https://www.tenderisthebyte.com/assets/img/posts/ocaml_python_bindgen/ocaml_pyml_bindgen.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">An introduction to the re2 regular expression library for OCaml</title><link href="https://www.tenderisthebyte.com/blog/2021/10/02/ocaml-re2-tutorial/" rel="alternate" type="text/html" title="An introduction to the re2 regular expression library for OCaml" /><published>2021-10-02T00:00:00+00:00</published><updated>2021-10-02T00:00:00+00:00</updated><id>https://www.tenderisthebyte.com/blog/2021/10/02/ocaml-re2-tutorial</id><content type="html" xml:base="https://www.tenderisthebyte.com/blog/2021/10/02/ocaml-re2-tutorial/"><![CDATA[<p>In this tutorial, we will talk about <a href="https://github.com/janestreet/re2">re2</a>, an OCaml library providing bindings to <a href="https://github.com/google/re2">RE2</a>, Google’s regular expression library.</p>

<p>This post is intended for newer OCaml programmers, or those who want to use the <code class="language-plaintext highlighter-rouge">re2</code> library, but could use a couple of examples to help get started.  This is not a general introduction to regular expressions, however.  If you have never used regular expressions before, read up a little bit on the syntax before tackling this post.</p>

<div class="post-toc">

  <h4 class="post-toc--header" id="contents">Contents</h4>

  <ul>
    <li><a href="#overview">Overview</a></li>
    <li><a href="#creating-regular-expressions">Creating regular expressions</a></li>
    <li><a href="#checking-for-a-match">Checking for a match</a></li>
    <li><a href="#finding-matches">Finding matches</a></li>
    <li><a href="#finding-submatches">Finding submatches</a></li>
    <li><a href="#splitting-strings">Splitting strings</a></li>
    <li><a href="#replacing">Replacing</a></li>
    <li><a href="#miscellaneous-info">Miscellaneous info</a></li>
    <li><a href="#wrap-up">Wrap up</a></li>
  </ul>

</div>

<h2 id="overview">Overview</h2>

<p>The there are few choices for regular expression libraries available for OCaml on <a href="https://opam.ocaml.org/">Opam</a>.  Some of the most popular include</p>

<ul>
  <li><a href="https://opam.ocaml.org/packages/re">re</a>, a pure OCaml library (installed 7667 times last month),</li>
  <li><a href="https://opam.ocaml.org/packages/pcre">pcre</a>, bindings to the Perl Compatibility Regular Expressions library (<a href="https://www.pcre.org/">PCRE</a>), (installed 1115 times last month), and</li>
  <li><a href="https://opam.ocaml.org/packages/re2">re2</a>, OCaml bindings for RE2, Google’s regular expression library (installed 114 times last month).</li>
</ul>

<p>The first two are by far the most popular in terms of raw Opam install counts.  However, <code class="language-plaintext highlighter-rouge">re2</code> integrates nicely into the Jane Street Base/Core/Async ecosystem (it’s a Jane Street package after all!), and is covered under the MIT license rather than the <a href="https://spdx.org/licenses/OCaml-LGPL-linking-exception.html">LGPL with OCaml linking exception</a>, which may be appealing depending on your situation.</p>

<p><em>Note: According to this <a href="https://blog.janestreet.com/what-the-interns-have-wrought-2020/">blog post</a> and this <a href="https://github.com/janestreet/re2/issues/26#issuecomment-395870146">GitHub issue</a>, Jane Street is phasing out its use of re2. The <a href="https://github.com/janestreet/re2">re2 GitHub</a> does have recent commits, though, so your mileage may vary.</em></p>

<p>One issue that newcomers may face when getting started with the <code class="language-plaintext highlighter-rouge">re2</code> library is the slightly terse <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html">API documentation</a>.  While it is detailed and thorough, it can be hard to get started with if you’re not already used to reading Jane Street <code class="language-plaintext highlighter-rouge">mli</code> files and source code.</p>

<p><em>Note: if you want to follow along, you can paste the examples into the toplevel (or <a href="https://opam.ocaml.org/blog/about-utop/">utop</a>).  However, don’t paste in lines starting with <code class="language-plaintext highlighter-rouge">- :</code>.  These lines show the type of the expression as reported by <code class="language-plaintext highlighter-rouge">utop</code>.</em></p>

<h2 id="creating-regular-expressions">Creating regular expressions</h2>

<p>You create regular expressions with <code class="language-plaintext highlighter-rouge">Re2.create</code> and <code class="language-plaintext highlighter-rouge">Re2.create_exn</code>.  The former returns <code class="language-plaintext highlighter-rouge">Re2.t Or_error.t</code> and the latter <code class="language-plaintext highlighter-rouge">Re2.t</code>.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Or_error</span><span class="p">.</span><span class="n">ok_exn</span> <span class="o">@@</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create</span> <span class="s2">"apple"</span><span class="p">;;</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span><span class="p">;;</span></code></pre></figure>

<h3 id="matching-options">Matching options</h3>

<p>You can control how regular expression matching works by passing the <code class="language-plaintext highlighter-rouge">options</code> argument to the <code class="language-plaintext highlighter-rouge">create</code> and <code class="language-plaintext highlighter-rouge">create_exn</code> functions.  If you omit this argument, the default options will be passed.  Here they are:</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="n">default</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
<span class="p">{</span>
  <span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="n">case_sensitive</span> <span class="o">=</span> <span class="bp">true</span><span class="p">;</span>
  <span class="n">dot_nl</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
  <span class="n">encoding</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="nn">Encoding</span><span class="p">.</span><span class="nc">Utf8</span><span class="p">;</span>
  <span class="n">literal</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
  <span class="n">log_errors</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
  <span class="n">longest_match</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
  <span class="n">max_mem</span> <span class="o">=</span> <span class="mi">8388608</span><span class="p">;</span>
  <span class="n">never_capture</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
  <span class="n">never_nl</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
  <span class="n">one_line</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
  <span class="n">perl_classes</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
  <span class="n">posix_syntax</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
  <span class="n">word_boundary</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>For a more detailed description of these options, see the <a href="https://github.com/janestreet/re2/blob/89373a48bc786be9b2a7f530dd5954222515c048/src/re2_c/libre2/re2/re2.h#L509">re2.h</a> header filer.</p>

<p>By default, <code class="language-plaintext highlighter-rouge">re2</code> uses case-sensitive matching.  To create a case-insensitive regex, pass in an options map like so.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re_i</span> <span class="o">=</span>
  <span class="k">let</span> <span class="n">options</span> <span class="o">=</span> <span class="p">{</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="n">default</span> <span class="k">with</span> <span class="n">case_sensitive</span> <span class="o">=</span> <span class="bp">false</span> <span class="p">}</span> <span class="k">in</span>
  <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="o">~</span><span class="n">options</span> <span class="s2">"abc"</span></code></pre></figure>

<h2 id="checking-for-a-match">Checking for a match</h2>

<p>Perhaps the most basic regex task is to check if a string matches a given regular expression.  You can use <code class="language-plaintext highlighter-rouge">Re2.matches</code> for this.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="c">(* Case sensitive *)</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="k">assert</span> <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">matches</span> <span class="n">re</span> <span class="s2">"apple pie"</span><span class="p">);</span>
<span class="k">assert</span> <span class="p">(</span><span class="n">not</span> <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">matches</span> <span class="n">re</span> <span class="s2">"Apple pie"</span><span class="p">));;</span>

<span class="c">(* Case insensitive *)</span>
<span class="k">let</span> <span class="n">re</span> <span class="o">=</span>
  <span class="k">let</span> <span class="n">options</span> <span class="o">=</span> <span class="p">{</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Options</span><span class="p">.</span><span class="n">default</span> <span class="k">with</span> <span class="n">case_sensitive</span> <span class="o">=</span> <span class="bp">false</span> <span class="p">}</span> <span class="k">in</span>
  <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="o">~</span><span class="n">options</span> <span class="s2">"apple"</span> 
<span class="k">in</span>
<span class="k">assert</span> <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">matches</span> <span class="n">re</span> <span class="s2">"apple pie"</span><span class="p">);</span>
<span class="k">assert</span> <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">matches</span> <span class="n">re</span> <span class="s2">"Apple pie"</span><span class="p">);;</span></code></pre></figure>

<h2 id="finding-matches">Finding matches</h2>

<p>To find all matches of a regular expression in a string, you can use the <code class="language-plaintext highlighter-rouge">find_*</code> functions.</p>

<h3 id="find-first-match">Find first match</h3>

<p>To return the first match in the query string, use <code class="language-plaintext highlighter-rouge">find_first</code> or <code class="language-plaintext highlighter-rouge">find_first_exn</code>.  These functions return matched string rather than the underlying <code class="language-plaintext highlighter-rouge">Re2.Match.t</code>.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
  <span class="nn">Re2</span><span class="p">.</span><span class="n">find_first_exn</span> <span class="n">re</span> <span class="s2">"apple pie is made from apples"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"apple"</span>

<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"[ab]{2}"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_first_exn</span> <span class="n">re</span> <span class="s2">"ababa"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"ab"</span></code></pre></figure>

<h3 id="find-all-matches">Find all matches</h3>

<p>While <code class="language-plaintext highlighter-rouge">find_first</code> returns the first match in a query string, <code class="language-plaintext highlighter-rouge">find_all</code> and <code class="language-plaintext highlighter-rouge">find_all_exn</code> return lists of all non-overlapping matches in the query string.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all</span> <span class="n">re</span> <span class="s2">"apple pie"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="nn">Or_error</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span> <span class="nn">Result</span><span class="p">.</span><span class="nc">Ok</span> <span class="p">[</span><span class="s2">"apple"</span><span class="p">]</span>

<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all_exn</span> <span class="n">re</span> <span class="s2">"apple pie is made from apples"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"apple"</span><span class="p">;</span> <span class="s2">"apple"</span><span class="p">]</span></code></pre></figure>

<h4 id="submatches-and-capturing-groups">Submatches and capturing groups</h4>

<p>You can use the <code class="language-plaintext highlighter-rouge">sub</code> argument to return submatches defined by capturing groups rather than the whole match.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">s</span> <span class="o">=</span> <span class="s2">"ab ac ab"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all_exn</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">)</span> <span class="n">re</span> <span class="n">s</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"b"</span><span class="p">;</span> <span class="s2">"c"</span><span class="p">;</span> <span class="s2">"b"</span><span class="p">]</span></code></pre></figure>

<p>Be aware that passing index greater than the amount of capturing groups will raise an error.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">s</span> <span class="o">=</span> <span class="s2">"ab ac ab"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all_exn</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">10</span><span class="p">)</span> <span class="n">re</span> <span class="n">s</span><span class="p">;;</span>
<span class="nc">Exception</span><span class="o">:</span> <span class="nn">Re2__Regex</span><span class="p">.</span><span class="nn">Exceptions</span><span class="p">.</span><span class="nc">Regex_no_such_subpattern</span><span class="p">(</span><span class="mi">10</span><span class="o">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span></code></pre></figure>

<h4 id="or_error-returning-vs-exception-raising">Or_error returning vs. Exception raising</h4>

<p>Like most of the functions in the <code class="language-plaintext highlighter-rouge">Re2</code> module, the <code class="language-plaintext highlighter-rouge">find</code> functions come in both <code class="language-plaintext highlighter-rouge">Or_error.t</code> returning and exception raising versions.  If the regular expression doesn’t match, <code class="language-plaintext highlighter-rouge">find_all</code> returns a <code class="language-plaintext highlighter-rouge">Result.Error.t</code> whereas <code class="language-plaintext highlighter-rouge">find_all_exn</code> raises an exception.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all</span> <span class="n">re</span> <span class="s2">"peach pie"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="nn">Or_error</span><span class="p">.</span><span class="n">t</span> <span class="o">=</span>
<span class="nn">Result</span><span class="p">.</span><span class="nc">Error</span>
 <span class="p">(</span><span class="s2">"Re2__Regex.Exceptions.Regex_match_failed(</span><span class="se">\"</span><span class="s2">apple</span><span class="se">\"</span><span class="s2">)"</span><span class="p">)</span>

<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"apple"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all_exn</span> <span class="n">re</span> <span class="s2">"peach pie"</span><span class="p">;;</span>
<span class="nc">Exception</span><span class="o">:</span> <span class="nn">Re2__Regex</span><span class="p">.</span><span class="nn">Exceptions</span><span class="p">.</span><span class="nc">Regex_match_failed</span><span class="p">(</span><span class="s2">"apple"</span><span class="p">)</span><span class="o">.</span>
<span class="c">(* ...output omitted... *)</span></code></pre></figure>

<p>It is important to remember that the <code class="language-plaintext highlighter-rouge">find_all</code> functions return <em>non-overlapping</em> matches.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"[ab]{2}"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_all_exn</span> <span class="n">re</span> <span class="s2">"ababa"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"ab"</span><span class="p">;</span> <span class="s2">"ab"</span><span class="p">]</span></code></pre></figure>

<h2 id="finding-submatches">Finding submatches</h2>

<p>If you need a bit more control than provided by <code class="language-plaintext highlighter-rouge">find_all</code> with the <code class="language-plaintext highlighter-rouge">sub</code> argument (e.g., <code class="language-plaintext highlighter-rouge">find_all ~sub:(` Index 1)</code>), the you may need to use <code class="language-plaintext highlighter-rouge">find_submatches</code> or <code class="language-plaintext highlighter-rouge">find_submatches_exn</code>.  These return the first match in the query string.  The match is returned as a <code class="language-plaintext highlighter-rouge">string option array</code>, where the first element is the whole match, and subsequent elements are submatches as defined by any capturing groups.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])([de])"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">find_submatches_exn</span> <span class="n">re</span> <span class="s2">"abdace"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="n">option</span> <span class="kt">array</span> <span class="o">=</span> <span class="p">[</span><span class="o">|</span><span class="nc">Some</span> <span class="s2">"abd"</span><span class="p">;</span> <span class="nc">Some</span> <span class="s2">"b"</span><span class="p">;</span> <span class="nc">Some</span> <span class="s2">"d"</span><span class="o">|</span><span class="p">]</span></code></pre></figure>

<p>You may wonder why <code class="language-plaintext highlighter-rouge">find_submatches_exn</code> returns a <code class="language-plaintext highlighter-rouge">string option array</code> and not simply a <code class="language-plaintext highlighter-rouge">string array</code>.  <code class="language-plaintext highlighter-rouge">find_submatches_exn</code> uses <code class="language-plaintext highlighter-rouge">Match.get</code> <a href="https://github.com/janestreet/re2/blob/72e01a088b48791aa6387dc3a093d3806122e2bd/src/regex.ml#L307">under-the-hood</a>.  Basically, <code class="language-plaintext highlighter-rouge">find_submatches_exn</code> processes a <code class="language-plaintext highlighter-rouge">Match.t Sequence.t</code> of matches, calling <code class="language-plaintext highlighter-rouge">get</code> on each one.  And the <code class="language-plaintext highlighter-rouge">Match.get</code> function <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/Match/index.html#val-get">returns</a> a <code class="language-plaintext highlighter-rouge">string option</code>.</p>

<p>This little code snippet will hopefully give you an idea of what’s going on.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])([de])"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">match_</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">first_match_exn</span> <span class="n">re</span> <span class="s2">"abdace"</span> <span class="k">in</span>
<span class="p">[</span><span class="o">|</span>
  <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get</span> <span class="n">match_</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">0</span><span class="p">);</span>
  <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get</span> <span class="n">match_</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">);</span>
  <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get</span> <span class="n">match_</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">2</span><span class="p">);</span>
  <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get</span> <span class="n">match_</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">3</span><span class="p">);</span>
<span class="o">|</span><span class="p">]</span>
<span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="n">option</span> <span class="kt">array</span> <span class="o">=</span> <span class="p">[</span><span class="o">|</span> <span class="nc">Some</span> <span class="s2">"abd"</span><span class="p">;</span> <span class="nc">Some</span> <span class="s2">"b"</span><span class="p">;</span> <span class="nc">Some</span> <span class="s2">"d"</span><span class="p">;</span> <span class="nc">None</span> <span class="o">|</span><span class="p">]</span></code></pre></figure>

<p>If the <code class="language-plaintext highlighter-rouge">Index</code> you pass to <code class="language-plaintext highlighter-rouge">~sub</code> is higher than the of capturing groups plus one (e.g., the number returned from <code class="language-plaintext highlighter-rouge">Re2.num_submatches</code>), <code class="language-plaintext highlighter-rouge">None</code> is returned.</p>

<h3 id="more-complicated-submatch-interface">More complicated submatch interface</h3>

<p>If you want to work with the <code class="language-plaintext highlighter-rouge">Re2.Match.t</code> directly, you can use functions from the <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#complicated-interface">complicated interface</a> like <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#val-first_match">first_match</a> and <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#val-get_matches">get_matches</a>.</p>

<p>If you need to work with submatches of every match in a string rather than just the first, and you need direct access to the <code class="language-plaintext highlighter-rouge">Match.t</code>, you will want to use <code class="language-plaintext highlighter-rouge">get_matches</code> or <code class="language-plaintext highlighter-rouge">get_matches_exn</code>.  Let’s try it out with a weird, little example.</p>

<p>Say we have a string made up of chunks.  Each chunk is a number followed by an <code class="language-plaintext highlighter-rouge">A</code> (for add) or an <code class="language-plaintext highlighter-rouge">S</code> (for subtract) (e.g., <code class="language-plaintext highlighter-rouge">50A</code> and <code class="language-plaintext highlighter-rouge">3S</code>).  The chunk describes an arithmetic operation: <code class="language-plaintext highlighter-rouge">12A</code> means add 12 to the previous total; <code class="language-plaintext highlighter-rouge">3S</code> means subtract 3 from the previous total.</p>

<p>A full string then might look something like this: <code class="language-plaintext highlighter-rouge">10A5S2S3A</code>, which represents the following sequence of operations: <code class="language-plaintext highlighter-rouge">0 + 10 - 5 - 2 + 3</code>.</p>

<p>One way to solve this little problem using regexes and the <code class="language-plaintext highlighter-rouge">get_matches</code> function.  Let’s see how it might go.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">total</span> <span class="o">=</span>
  <span class="k">let</span> <span class="n">s</span> <span class="o">=</span> <span class="s2">"10A5S2S3A"</span> <span class="k">in</span>
  <span class="c">(* Make the regex *)</span>
  <span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"([0-9]*)([AS])"</span> <span class="k">in</span>
  <span class="c">(* Get a Match.t list *)</span>
  <span class="k">let</span> <span class="n">matches</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">get_matches_exn</span> <span class="n">re</span> <span class="n">s</span> <span class="k">in</span>
  <span class="c">(* Fold over the matches to get the total. *)</span>
  <span class="nn">List</span><span class="p">.</span><span class="n">fold</span> <span class="n">matches</span> <span class="o">~</span><span class="n">init</span><span class="o">:</span><span class="mi">0</span> <span class="o">~</span><span class="n">f</span><span class="o">:</span><span class="p">(</span><span class="k">fun</span> <span class="n">total</span> <span class="n">m</span> <span class="o">-&gt;</span>
      <span class="c">(* The first capturing group is the "count". *)</span>
      <span class="k">let</span> <span class="n">number</span> <span class="o">=</span> <span class="nn">Int</span><span class="p">.</span><span class="n">of_string</span> <span class="o">@@</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get_exn</span> <span class="n">m</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">)</span> <span class="k">in</span>
      <span class="c">(* The second capturing group represents the operation. *)</span>
      <span class="k">match</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get_exn</span> <span class="n">m</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">2</span><span class="p">)</span> <span class="k">with</span>
      <span class="o">|</span> <span class="s2">"A"</span> <span class="o">-&gt;</span> <span class="n">total</span> <span class="o">+</span> <span class="n">number</span>
      <span class="o">|</span> <span class="s2">"S"</span> <span class="o">-&gt;</span> <span class="n">total</span> <span class="o">-</span> <span class="n">number</span>
      <span class="o">|</span> <span class="n">_</span> <span class="o">-&gt;</span> <span class="k">assert</span> <span class="bp">false</span><span class="p">)</span>
<span class="p">;;</span>

<span class="k">assert</span> <span class="p">(</span><span class="n">total</span> <span class="o">=</span> <span class="mi">0</span> <span class="o">+</span> <span class="mi">10</span> <span class="o">-</span> <span class="mi">5</span> <span class="o">-</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">3</span><span class="p">);;</span></code></pre></figure>

<p><em>Note: This weird format is actually loosely based on the <a href="https://en.wikipedia.org/wiki/Sequence_alignment#Representations">CIGAR</a> strings found in <a href="http://samtools.github.io/hts-specs/SAMv1.pdf">SAM files</a> describing <a href="https://en.wikipedia.org/wiki/Sequence_alignment">biological sequence alignments</a>.</em></p>

<h3 id="controlling-submatches">Controlling submatches</h3>

<p>In the last two examples, we used the <code class="language-plaintext highlighter-rouge">sub</code> argument along with a polymorphic variant to select capture groups.  Let’s take a closer look at the type used for that.</p>

<p>To select submatches, we use <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#type-id_t">id_t</a>, which looks like this:</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">type</span> <span class="n">id_t</span> <span class="o">=</span> <span class="p">[</span> <span class="err">`</span> <span class="nc">Index</span> <span class="k">of</span> <span class="kt">int</span> <span class="o">|</span> <span class="err">`</span> <span class="nc">Name</span> <span class="k">of</span> <span class="kt">string</span> <span class="p">]</span></code></pre></figure>

<p>This type is used to refer to submatches.  E.g., <code class="language-plaintext highlighter-rouge">` Index 1</code> would be the result of first capturing group, <code class="language-plaintext highlighter-rouge">` Index 2</code> the 2nd, etc.  Remember that  <code class="language-plaintext highlighter-rouge">` Index 0</code> refers to the whole match.</p>

<p>In addition to referring to submatches/capturing groups by index, you can refer to them by name.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a(?P&lt;second_letter&gt;[bc])"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">m</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">first_match_exn</span> <span class="n">re</span> <span class="s2">"abc"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get_exn</span> <span class="n">m</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Name</span> <span class="s2">"second_letter"</span><span class="p">)</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">y</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get_exn</span> <span class="n">m</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">)</span> <span class="k">in</span>
<span class="k">assert</span> <span class="nn">String</span><span class="p">.(</span><span class="n">x</span> <span class="o">=</span> <span class="n">y</span><span class="p">);;</span></code></pre></figure>

<p>When using a complicated regular expression with multiple capturing groups, it is often less error prone to use named submatches rather than numbered ones.</p>

<p><em>Note:  It is not a compile-error to try an access a capturing group that doesn’t exist in the regular expression.  Depending on the function, you may get <code class="language-plaintext highlighter-rouge">None</code> or raise an exception.</em></p>

<h3 id="using-id_t-to-control-match-efficiency">Using <code class="language-plaintext highlighter-rouge">id_t</code> to control match efficiency</h3>

<p>Many of the regex matching functions take a <code class="language-plaintext highlighter-rouge">?sub:id_t</code> argument.</p>

<p>In some cases, you can increase the efficiency of matching by restricting the number of submatches.  If you only care about whether a pattern matches, and not about submatches, you could pass in <code class="language-plaintext highlighter-rouge">~sub:(` Index -1)</code> to many of the above functions.</p>

<p>You can get increasingly more information by increasing the <code class="language-plaintext highlighter-rouge">n</code> to the index.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="c">(* Get only the whole match. *)</span>
<span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">0</span><span class="p">)</span>

<span class="c">(* Get the whole match and first submatch. *)</span>
<span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">)</span></code></pre></figure>

<p><a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#type-id_t">This section</a> of the documentation has more info on how specifying the <code class="language-plaintext highlighter-rouge">sub</code> argument can have an impact on regex performance, and which functions are affected by its usage.</p>

<h2 id="splitting-strings">Splitting strings</h2>

<p>Another common regex task is splitting an input string based on a regular expression pattern.  <code class="language-plaintext highlighter-rouge">Re2</code> provides the <code class="language-plaintext highlighter-rouge">split</code> function for this purpose.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"[.,! ]+"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">split</span> <span class="n">re</span> <span class="s2">"Hello, world! I like pie."</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"Hello"</span><span class="p">;</span> <span class="s2">"world"</span><span class="p">;</span> <span class="s2">"I"</span><span class="p">;</span> <span class="s2">"like"</span><span class="p">;</span> <span class="s2">"pie"</span><span class="p">;</span> <span class="s2">""</span><span class="p">]</span></code></pre></figure>

<p>If you need to include the actual matches in the output, you can.  Passing <code class="language-plaintext highlighter-rouge">~include_matches:true</code> ensures the “separators” are in there with the rest of the output.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"[.,! ]+"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">split</span> <span class="o">~</span><span class="n">include_matches</span><span class="o">:</span><span class="bp">true</span> <span class="n">re</span> <span class="s2">"Hello, world! I like pie."</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span>
<span class="p">[</span><span class="s2">"Hello"</span><span class="p">;</span> <span class="s2">", "</span><span class="p">;</span> <span class="s2">"world"</span><span class="p">;</span> <span class="s2">"! "</span><span class="p">;</span> <span class="s2">"I"</span><span class="p">;</span> <span class="s2">" "</span><span class="p">;</span> <span class="s2">"like"</span><span class="p">;</span> <span class="s2">" "</span><span class="p">;</span> <span class="s2">"pie"</span><span class="p">;</span> <span class="s2">"."</span><span class="p">;</span> <span class="s2">""</span><span class="p">]</span></code></pre></figure>

<p>Just be aware of that final empty string at the end!</p>

<p>You can also limit the number of matches with the <code class="language-plaintext highlighter-rouge">max</code> argument.  You could use this to get the first value separated from the remaining values in a string of tab-separated values, for example.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"</span><span class="se">\t</span><span class="s2">"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">split</span> <span class="o">~</span><span class="n">max</span><span class="o">:</span><span class="mi">1</span> <span class="n">re</span> <span class="s2">"apple</span><span class="se">\t</span><span class="s2">pie</span><span class="se">\t</span><span class="s2">is</span><span class="se">\t</span><span class="s2">good"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"apple"</span><span class="p">;</span> <span class="s2">"pie</span><span class="se">\t</span><span class="s2">is</span><span class="se">\t</span><span class="s2">good"</span><span class="p">]</span></code></pre></figure>

<p>If the regular expression has no matches in the query string, then a one element list is returned.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"</span><span class="se">\t</span><span class="s2">"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">split</span> <span class="o">~</span><span class="n">max</span><span class="o">:</span><span class="mi">1</span> <span class="n">re</span> <span class="s2">"apple pie is good"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="kt">list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"apple pie is good"</span><span class="p">]</span></code></pre></figure>

<h2 id="replacing">Replacing</h2>

<h3 id="using-rewrite">Using <code class="language-plaintext highlighter-rouge">rewrite</code></h3>

<p>The simpler interface for regex replacement consists of the <code class="language-plaintext highlighter-rouge">rewrite</code> and <code class="language-plaintext highlighter-rouge">rewrite_exn</code> functions.  The <code class="language-plaintext highlighter-rouge">template</code> argument defines how you want to replace any matches in the query string.  In this case, we replace any matches with a capital A.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">rewrite_exn</span> <span class="n">re</span> <span class="o">~</span><span class="n">template</span><span class="o">:</span><span class="s2">"A"</span> <span class="s2">"apple peach"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"Apple peAch"</span></code></pre></figure>

<p>You can reference the submatches in the template string using the syntax <code class="language-plaintext highlighter-rouge">\\n</code>.  Check it out.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"([ae])"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">rewrite_exn</span> <span class="n">re</span> <span class="o">~</span><span class="n">template</span><span class="o">:</span><span class="s2">"( </span><span class="se">\\</span><span class="s2">1 )"</span> <span class="s2">"apple peach"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"( a )ppl( e ) p( e )( a )ch"</span></code></pre></figure>

<p>If you have multiple submatches, just keep referring to them in the same way: <code class="language-plaintext highlighter-rouge">\\1 ... \\2 ...</code> etc.</p>

<p>If you need to check if your rewrite template is valid before running <code class="language-plaintext highlighter-rouge">rewrite</code>, use <code class="language-plaintext highlighter-rouge">valid_rewrite_template</code> function.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"([ae])([io])([uy])"</span> <span class="k">in</span>
<span class="k">let</span> <span class="n">template</span> <span class="o">=</span> <span class="s2">"</span><span class="se">\\</span><span class="s2">3 - </span><span class="se">\\</span><span class="s2">2 - </span><span class="se">\\</span><span class="s2">1"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">valid_rewrite_template</span> <span class="n">re</span> <span class="o">~</span><span class="n">template</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">bool</span> <span class="o">=</span> <span class="bp">true</span></code></pre></figure>

<h3 id="using-replace">Using <code class="language-plaintext highlighter-rouge">replace</code></h3>

<p>The <code class="language-plaintext highlighter-rouge">re2</code> library also provides more powerful replacing functions:  <code class="language-plaintext highlighter-rouge">replace</code> and <code class="language-plaintext highlighter-rouge">replace_exn</code>.  You can use them if you need direct access to the <code class="language-plaintext highlighter-rouge">Match.t</code>.</p>

<p>Here is a silly example that picks a different replacement value depending on the match.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"([ae])"</span> <span class="k">in</span>
<span class="nn">Re2</span><span class="p">.</span><span class="n">replace_exn</span> <span class="n">re</span> <span class="s2">"apple peach"</span> <span class="o">~</span><span class="n">f</span><span class="o">:</span><span class="p">(</span><span class="k">fun</span> <span class="n">m</span> <span class="o">-&gt;</span>
  <span class="k">match</span> <span class="nn">Re2</span><span class="p">.</span><span class="nn">Match</span><span class="p">.</span><span class="n">get_exn</span> <span class="n">m</span> <span class="o">~</span><span class="n">sub</span><span class="o">:</span><span class="p">(</span><span class="err">`</span> <span class="nc">Index</span> <span class="mi">1</span><span class="p">)</span> <span class="k">with</span>
  <span class="o">|</span> <span class="s2">"a"</span> <span class="o">-&gt;</span> <span class="s2">"u"</span>
  <span class="o">|</span> <span class="s2">"e"</span> <span class="o">-&gt;</span> <span class="s2">"o"</span>
  <span class="o">|</span> <span class="n">_</span> <span class="o">-&gt;</span> <span class="k">assert</span> <span class="bp">false</span><span class="p">)</span>
<span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"upplo pouch"</span></code></pre></figure>

<p>While the <code class="language-plaintext highlighter-rouge">replace</code> function is more complicated than <code class="language-plaintext highlighter-rouge">rewrite</code>, it gives you more control and has a few <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html#val-replace">other options</a> you may find useful.</p>

<h2 id="miscellaneous-info">Miscellaneous info</h2>

<h3 id="escaping-strings-for-regular-expressions">Escaping strings for regular expressions</h3>

<p>Properly escaping regular expressions can sometimes be tricky, especially if you want to avoid illegal backslash characters in your strings.</p>

<p><code class="language-plaintext highlighter-rouge">Re2</code> provides a function <code class="language-plaintext highlighter-rouge">escape</code> that escapes its input in such a way that if you create a regex from the resulting escaped string, it would match the original string.  Here’s how it works.</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="nn">Re2</span><span class="p">.</span><span class="n">escape</span> <span class="s2">"Apple. (Pie)!!"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">string</span> <span class="o">=</span> <span class="s2">"Apple</span><span class="se">\\</span><span class="s2">.</span><span class="se">\\</span><span class="s2"> </span><span class="se">\\</span><span class="s2">(Pie</span><span class="se">\\</span><span class="s2">)</span><span class="se">\\</span><span class="s2">!</span><span class="se">\\</span><span class="s2">!"</span>

<span class="nn">Re2</span><span class="p">.</span><span class="n">matches</span>
  <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="o">@@</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">escape</span> <span class="s2">"Apple. (Pie)!!"</span><span class="p">)</span>
  <span class="s2">"Apple. (Pie)!!"</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">bool</span> <span class="o">=</span> <span class="bp">true</span></code></pre></figure>

<p>Depending on how many special characters are in the string you use to build the regex, escaping can be pretty noisy!  In these cases, <code class="language-plaintext highlighter-rouge">escape</code> is especially useful.</p>

<h3 id="infix-matching-operator">Infix matching operator</h3>

<p>If you’re feeling nostalgic for Perl, feel free to use the <code class="language-plaintext highlighter-rouge">=~</code> infix operator!</p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"ab"</span><span class="p">;;</span>

<span class="nn">Re2</span><span class="p">.</span><span class="nn">Infix</span><span class="p">.(</span><span class="s2">"abc"</span> <span class="o">=~</span> <span class="n">re</span><span class="p">);;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">bool</span> <span class="o">=</span> <span class="bp">true</span>

<span class="c">(* Let's get crazy and open the module! *)</span>
<span class="k">open</span> <span class="nn">Re2</span><span class="p">.</span><span class="nc">Infix</span><span class="p">;;</span>

<span class="s2">"abc"</span> <span class="o">=~</span> <span class="n">re</span><span class="p">;;</span>
<span class="o">-</span> <span class="o">:</span> <span class="kt">bool</span> <span class="o">=</span> <span class="bp">true</span></code></pre></figure>

<h3 id="precompiling-your-regular-expressions">“Precompiling” your regular expressions</h3>

<p>Unless you have a good reason not to, you will probably want to create your regular expression outside of the function that will be using it.</p>

<p>To see why, let’s check out this little benchmark program that compares two functions.  The first one reuses a regex that is created outside of the function, whereas the second one creates a new regex each time the function is called.</p>

<p><em>Note:  This benchmark program uses Jane Street’s <a href="https://github.com/janestreet/core_bench">core_bench</a> micro-benchmarking library.</em></p>

<figure class="highlight"><pre><code class="language-ocaml" data-lang="ocaml"><span class="k">open</span><span class="o">!</span> <span class="nc">Core</span>
<span class="k">open</span><span class="o">!</span> <span class="nc">Core_bench</span>

<span class="k">let</span> <span class="n">re</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])"</span>

<span class="k">let</span> <span class="n">find</span> <span class="n">re</span> <span class="n">s</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">find_first_exn</span> <span class="n">re</span> <span class="n">s</span>
<span class="k">let</span> <span class="n">find'</span> <span class="n">s</span> <span class="o">=</span> <span class="nn">Re2</span><span class="p">.</span><span class="n">find_first_exn</span> <span class="p">(</span><span class="nn">Re2</span><span class="p">.</span><span class="n">create_exn</span> <span class="s2">"a([bc])"</span><span class="p">)</span> <span class="n">s</span>

<span class="k">let</span> <span class="bp">()</span> <span class="o">=</span>
  <span class="nn">Command</span><span class="p">.</span><span class="n">run</span>
    <span class="p">(</span><span class="nn">Bench</span><span class="p">.</span><span class="n">make_command</span>
       <span class="p">[</span>
         <span class="nn">Bench</span><span class="p">.</span><span class="nn">Test</span><span class="p">.</span><span class="n">create</span> <span class="o">~</span><span class="n">name</span><span class="o">:</span><span class="s2">"outside"</span> <span class="p">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
             <span class="n">find</span> <span class="n">re</span> <span class="s2">"abcabcabc"</span><span class="p">);</span>
         <span class="nn">Bench</span><span class="p">.</span><span class="nn">Test</span><span class="p">.</span><span class="n">create</span> <span class="o">~</span><span class="n">name</span><span class="o">:</span><span class="s2">"inside"</span> <span class="p">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span>
             <span class="n">find'</span> <span class="s2">"abcabcabc"</span><span class="p">);</span>
       <span class="p">])</span></code></pre></figure>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th style="text-align: right">Time/Run</th>
      <th style="text-align: right">mWd/Run</th>
      <th style="text-align: right">Percentage</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>outside</td>
      <td style="text-align: right">272.60 ns</td>
      <td style="text-align: right">2.00 w</td>
      <td style="text-align: right">3.74%</td>
    </tr>
    <tr>
      <td>inside</td>
      <td style="text-align: right">7_281.55 ns</td>
      <td style="text-align: right">91.00 w</td>
      <td style="text-align: right">100.00%</td>
    </tr>
  </tbody>
</table>

<p>As you can see, reusing a regex rather than creating a new one each time a function is called makes a big difference in this benchmark.  Keep in mind that this is a micro-benchmark, and that this difference may not be that important to the run time of your program as a whole.  That said, if you had the slow version of the above function in a hot loop, it could really be wasting a lot of CPU cycles.</p>

<h2 id="wrap-up">Wrap up</h2>

<p>Hopefully this overview helps you get started with using <code class="language-plaintext highlighter-rouge">re2</code>!</p>

<p>To get more info about using <code class="language-plaintext highlighter-rouge">re2</code>, check out the <a href="https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2/index.html">API docs</a>.  Additionally, the <code class="language-plaintext highlighter-rouge">re2</code> <a href="https://github.com/janestreet/re2/tree/master/src">source code</a> is quite readable.  I encourage you to take a look at how the functions are defined–it may help clear up any additional questions you have!</p>]]></content><author><name>Ryan Moore</name></author><category term="blog" /><summary type="html"><![CDATA[In this post, I give an introduction and guide to help you get started using OCaml's re2 regular expression library, which provides OCaml bindings for Google's regular expression library, RE2.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.tenderisthebyte.com/assets/img/posts/ocaml_re2_tutorial/ocaml_regex.png" /><media:content medium="image" url="https://www.tenderisthebyte.com/assets/img/posts/ocaml_re2_tutorial/ocaml_regex.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Styling plots in base R graphics to match ggplot2 classic theme</title><link href="https://www.tenderisthebyte.com/blog/2021/05/09/pretty-plots-with-base-r-grahpics/" rel="alternate" type="text/html" title="Styling plots in base R graphics to match ggplot2 classic theme" /><published>2021-05-09T00:00:00+00:00</published><updated>2021-05-09T00:00:00+00:00</updated><id>https://www.tenderisthebyte.com/blog/2021/05/09/pretty-plots-with-base-r-grahpics</id><content type="html" xml:base="https://www.tenderisthebyte.com/blog/2021/05/09/pretty-plots-with-base-r-grahpics/"><![CDATA[<p><a href="https://ggplot2.tidyverse.org/">ggplot2</a> is an R package for creating graphics in a declarative way and is based on <a href="https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html">The Grammar of Graphics</a>.  If you have never used ggplot2, it’s a nice library for making publication ready figures with much less hassle than the base R graphics.</p>

<p>Something I think is pretty fun is to try and recreate ggplot2 style figures using base R graphics.  Sometimes, I look at the actual plotting code in the ggplot2 package, but I think it is more fun to just make a figure with ggplot and then try and get a reasonable match with base R.  Doing so, you really get an appreciation of the convencience of the ggplot2 package.</p>

<p>With that, let’s try and recreate a figure using the “classic” ggplot2 theme: <a href="https://ggplot2.tidyverse.org/reference/ggtheme.html">theme_classic</a>.</p>

<p><em>If you want to learn more about base R graphics, check out my <a href="https://www.tenderisthebyte.com/blog/2019/04/25/rotating-axis-labels-in-r/">deep dive into rotating axis labels in base R plots</a>.</em></p>

<div class="post-toc">

  <h4 class="post-toc--header" id="contents">Contents</h4>

  <ul>
    <li><a href="#set-up">Set up</a></li>
    <li><a href="#fixing-the-axes">Fixing the axes</a></li>
    <li><a href="#fixing-the-points">Fixing the points</a></li>
    <li><a href="#adding-a-legend">Adding a legend</a></li>
    <li><a href="#some-final-touchups">Some final touchups</a></li>
    <li><a href="#wrap-up">Wrap up</a></li>
  </ul>

</div>

<h2 id="set-up">Set up</h2>

<p>First, here is some “set up” code where we create some data and set some variables to hold colors and stuff like that.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">

</span><span class="n">k_purple</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"#875692"</span><span class="w">
</span><span class="n">k_orange</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"#F38400"</span><span class="w">

</span><span class="n">set.seed</span><span class="p">(</span><span class="m">12341234</span><span class="p">)</span><span class="w">

</span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="n">group</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">),</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="s2">"B"</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">))</span></code></pre></figure>

<p>With that out of the way, let’s see the ggplot2 classic theme that we will try and match.  Here it is:</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="p">),</span><span class="w">
       </span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">group</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_point</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">scale_color_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">k_orange</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme_classic</span><span class="p">()</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/ggplot_theme_classic.png" alt="ggplot2 classic theme" />
    <figcaption>ggplot2 classic theme</figcaption>
</figure>

<p>And finally, let’s compare the simplest possible base R graphics plot.  I’m sure that you’re familiar with what it looks like!</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base.png" alt="Base R graphics plot" />
    <figcaption>Base R graphics plot</figcaption>
</figure>

<p>You can see that that plot is pretty far from where we want to be.  Let’s go step-by-step getting closer to the <code class="language-plaintext highlighter-rouge">theme_classic</code> ggplot version each time.</p>

<h2 id="fixing-the-axes">Fixing the axes</h2>

<p>The first thing you see is that box around the plot that isn’t present in the ggplot version.  Let’s remove it by passing <code class="language-plaintext highlighter-rouge">bty = "n"</code> to the plot function.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w">
     </span><span class="c1">## Remove the box around the plot.</span><span class="w">
     </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_no_box.png" alt="Removing the box" />
    <figcaption>Removing the box</figcaption>
</figure>

<p>You can see that the axes are a bit different than in the ggplot2 version.  Here, the final ticks are the edges of the axis.  The ggplot version has a nice, solid line for the x and y axes that connects at the bottom left corner.  You can get that effect with the <code class="language-plaintext highlighter-rouge">bty</code> option to <code class="language-plaintext highlighter-rouge">plot</code>.</p>

<p>The <code class="language-plaintext highlighter-rouge">bty</code> parameter is an interesting one.  Here is the section from the <code class="language-plaintext highlighter-rouge">par</code> help file describing <code class="language-plaintext highlighter-rouge">bty</code>:</p>

<blockquote>
  <p>‘bty’ A character string which determined the type of box which
     is drawn about plots.  If ‘bty’ is one of ‘”o”’ (the
     default), ‘”l”’, ‘”7”’, ‘”c”’, ‘”u”’, or ‘”]”’ the resulting
     box resembles the corresponding upper case letter.  A value
     of ‘”n”’ suppresses the box.</p>
</blockquote>

<p>Those options look pretty weird, but they each show the “shape” of what the box will look like: <code class="language-plaintext highlighter-rouge">l</code> will look like a upper case <code class="language-plaintext highlighter-rouge">L</code>, or have a line on the left and the right only.  The <code class="language-plaintext highlighter-rouge">7</code> will look sort of like a <code class="language-plaintext highlighter-rouge">7</code>, or have the box lines on the top and right only.  Since we want lines on the left and bottom, we can use <code class="language-plaintext highlighter-rouge">bty = "l"</code>.  I will also remove the default x and y axes (using <code class="language-plaintext highlighter-rouge">xaxt</code> and <code class="language-plaintext highlighter-rouge">yaxt</code>) since we don’t want it to overlap the lines of the box.  Also we can increase the width a bit with <code class="language-plaintext highlighter-rouge">lwd</code>.</p>

<p>While you can control the box inside the plot function, I will use the <code class="language-plaintext highlighter-rouge">box</code> function instead.  That way, it will be a little easier to customize.  To do that, we will keep the <code class="language-plaintext highlighter-rouge">bty = "n"</code> in the plot function to turn the box off, then add it back in after with <code class="language-plaintext highlighter-rouge">box</code>.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w">
     </span><span class="c1">## Remove box.</span><span class="w">
     </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
     </span><span class="c1">## Remove default x and y axis.</span><span class="w">
     </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w">
    </span><span class="c1">## Add 'box' lines to the bottom and left of the plot.</span><span class="w">
    </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w">
    </span><span class="c1">## Increase width of box lines.</span><span class="w">
    </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb.png" alt="With nice axis lines" />
    <figcaption>With nice axis lines</figcaption>
</figure>

<h3 id="add-the-tick-marks">Add the tick marks</h3>

<p>Now let’s add the axis ticks and labels back in.  For that we use the
<code class="language-plaintext highlighter-rouge">axis</code> function.  We will change a few of the options at once, so I
will go over them first.  The <code class="language-plaintext highlighter-rouge">side</code> parameter controls where the axis
is drawn with respect to the plot: 1 = below, 2 = to the left, 3 =
above, and 4 = to the right.  Remember how the axis is drawn with the
line by default?  We turn that off with <code class="language-plaintext highlighter-rouge">lwd = 0</code> and then we set the
tick width to match the box width using <code class="language-plaintext highlighter-rouge">lwd.ticks = 2</code>.  Finally, we
want to <a href="https://www.tenderisthebyte.com/blog/2019/04/25/rotating-axis-labels-in-r/">rotate the tick labels of the y
axis</a>
so they are perpendicular to the axis.  Here it is.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1">## X Axis</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
     </span><span class="c1">## Don't draw the axis line.</span><span class="w">
     </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w">
     </span><span class="c1">##  Match the width of the tick marks to the box lines.</span><span class="w">
     </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1">## Y axis</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
     </span><span class="c1">## Rotate tick labels prependicular to the axis.</span><span class="w">
     </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes.png" alt="With ticks and tick labels" />
    <figcaption>With ticks and tick labels</figcaption>
</figure>

<h3 id="adjusting-ticks-and-tick-labels">Adjusting ticks and tick labels</h3>

<p>Next, we are going to make some adjustments to the length of the tick
marks and to where the axis labels are drawn.  This can get a little
weird, and there are multiple ways to do it.  Let’s go through some of
the options we will need.</p>

<p>The <code class="language-plaintext highlighter-rouge">mgp</code> parameter is <a href="https://www.tenderisthebyte.com/blog/2019/04/25/rotating-axis-labels-in-r/#the-las-and-mgp-parameters">a little
tricky</a>.
It is a three part vector that controls the margin for the axis title
(<code class="language-plaintext highlighter-rouge">mgp[1]</code>), axis (tick) labels (<code class="language-plaintext highlighter-rouge">mgp[2]</code>), and the axis line
(<code class="language-plaintext highlighter-rouge">mgp[3]</code>).  The default value is <code class="language-plaintext highlighter-rouge">c(3, 1, 0)</code>.  The units are in
lines of text.</p>

<p>We want to move the axis labels and tick labels closer to the axis, so
we need to reduce the first two numbers in that vector.  This time,
I’m going to use the
<a href="https://stat.ethz.ch/R-manual/R-patched/library/graphics/html/par.html">par</a>
function to set the parameter since I want it to apply to all the
plotting functions.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">## Move the axis label and tick labels closer to the axis line.</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted.png" alt="Adjusting the axis labels" />
    <figcaption>Adjusting the axis labels</figcaption>
</figure>

<h3 id="adjusting-tick-label-length">Adjusting tick label length</h3>

<p>Now that we’ve tweaked the label positions, we need to adjust the
tick length.  We do that with <code class="language-plaintext highlighter-rouge">tcl</code> parameter to the <code class="language-plaintext highlighter-rouge">par</code> function,
which specifies tick mark length as a fraction of the height of a line
of text.  So <code class="language-plaintext highlighter-rouge">tcl = 1</code> will make tick labels the same height as a line
of text, <code class="language-plaintext highlighter-rouge">tcl = -0.5</code> (the default) will make them 1/2 the line
height.  The sign of the argument controls the direction the ticks
point: positive values point into the chart, negative values point
away.  Let’s make them half as long as they are now with <code class="language-plaintext highlighter-rouge">tcl =
-0.25</code>.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
    </span><span class="c1">## Reduce the size of the tick marks.</span><span class="w">
    </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_2.png" alt="Shrinking the tick marks" />
    <figcaption>Shrinking the tick marks</figcaption>
</figure>

<h3 id="moving-the-x-labels-a-bit-more">Moving the x labels a bit more</h3>

<p>That’s pretty good, but to my eye, the x axis tick labels are still a
bit too far away from the ticks.  To fix that, we can pass the <code class="language-plaintext highlighter-rouge">mgp</code>
param directly to the <code class="language-plaintext highlighter-rouge">axis</code> function that we use to draw the axis.
It will overwrite the global value set by the <code class="language-plaintext highlighter-rouge">par</code> function, but only
for the function we pass it to.  The 2nd element in the <code class="language-plaintext highlighter-rouge">mgp</code> vector
controls the axis tick labels, so we will reduce it from <code class="language-plaintext highlighter-rouge">0.4</code> to
<code class="language-plaintext highlighter-rouge">0.2</code>.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
     </span><span class="c1">## Reducing the 2nd element from 0.4 to 0.2 moves the x axis</span><span class="w">
     </span><span class="c1">## tick labels closer to the axis line.</span><span class="w">
     </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3.png" alt="Moving the x axis labels in" />
    <figcaption>Moving the x axis labels in</figcaption>
</figure>

<p>That’s better!</p>

<h2 id="fixing-the-points">Fixing the points</h2>

<p>Now that the axes are looking pretty good, let’s move on to the
points.  To change the type of point that is plotted, you use the
<code class="language-plaintext highlighter-rouge">pch</code> parameter.  I like <code class="language-plaintext highlighter-rouge">pch = 20</code> for little dots, but <code class="language-plaintext highlighter-rouge">pch = 16</code>
could work as well.  We can also change the size of the points with
the <code class="language-plaintext highlighter-rouge">cex</code> parameter.  The default size is <code class="language-plaintext highlighter-rouge">cex = 1</code> and increasing the
number will increase the size (e.g., <code class="language-plaintext highlighter-rouge">cex = 2</code> will be twice as big).
We will use <code class="language-plaintext highlighter-rouge">cex = 1.4</code> to approximate the size of the ggplot points.</p>

<p>Finally, to change the color, we will use the <code class="language-plaintext highlighter-rouge">col</code> parameter to the
<code class="language-plaintext highlighter-rouge">plot</code> function.  For this parameter, we can pass in a vector the same
length as the <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> data vectors to specify the color for each
data point.  The <code class="language-plaintext highlighter-rouge">group</code> vector we created at the beginning gives two
groups, <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code>, for the points.  We want to associate each group
with a color so we make a named color vector like this: <code class="language-plaintext highlighter-rouge">colors &lt;- c(A
= k_purple, B = k_orange)</code>.  Then we use the <code class="language-plaintext highlighter-rouge">groups</code> vector to index
the <code class="language-plaintext highlighter-rouge">colors</code> vector like this: <code class="language-plaintext highlighter-rouge">colors[group]</code>.</p>

<p>If that doesn’t make sense, here is a simple example.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">tastiness</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">Cookie</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"yummy"</span><span class="p">,</span><span class="w"> </span><span class="n">Cake</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"yucky"</span><span class="p">)</span><span class="w">
</span><span class="n">desserts</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Cookie"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Cake"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Cookie"</span><span class="p">)</span><span class="w">
</span><span class="n">tastiness</span><span class="p">[</span><span class="n">desserts</span><span class="p">]</span><span class="w">
</span><span class="c1">##   Cookie  Cake    Cookie</span><span class="w">
</span><span class="c1">##   "yummy" "yucky" "yummy"</span></code></pre></figure>

<p>Let’s use that idea for our plot.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">## Associate group A with purple and group B with orange.</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
     </span><span class="c1">## Draw filled in dots instead of open circles.</span><span class="w">
     </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
     </span><span class="c1">## Increase the size of the dots.</span><span class="w">
     </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w">
     </span><span class="c1">## Set the color of each dot based on its group.</span><span class="w">
     </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points.png" alt="Fixing the points" />
    <figcaption>Fixing the points</figcaption>
</figure>

<p>Now that’s looking pretty good!</p>

<h2 id="adding-a-legend">Adding a legend</h2>

<p>It’s time now to put in the legend.  We will start with something
basic and then adjust it to match the legend in the ggplot2 figure.</p>

<p>To make a legend in base R graphics, use the
<a href="https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/legend.html">legend</a>
function.  We set the legend location with the <code class="language-plaintext highlighter-rouge">x</code> parameter.  To put
the legend on the right side of the plot, we use <code class="language-plaintext highlighter-rouge">x = "right"</code>.  We
use the <code class="language-plaintext highlighter-rouge">legend</code> param to actually tell the legend the names of the
groups: <code class="language-plaintext highlighter-rouge">legend = c("A", "B")</code>.  Now for the points, we specify the
style we used (<code class="language-plaintext highlighter-rouge">pch = 20</code>) and the different colors for the each group
(<code class="language-plaintext highlighter-rouge">col = colors</code>).  Here it is.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
     </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1">## Add a legend to the right side of the plot.</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w">
       </span><span class="c1">## Specify the group names.</span><span class="w">
       </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">),</span><span class="w">
       </span><span class="c1">## And the colors of the dots.</span><span class="w">
       </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">,</span><span class="w">
       </span><span class="c1">## And the shape of the dots.</span><span class="w">
       </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend.png" alt="Adding a legend" />
    <figcaption>Adding a legend</figcaption>
</figure>

<p>That’s not bad, but not quite the look we are going for.  We need to
add a legend title, remove the box around the legend, and tweak the
size and spacing of the elements.</p>

<h3 id="adjusting-the-legend">Adjusting the legend</h3>

<p>To set the title, we can do this: <code class="language-plaintext highlighter-rouge">title = "group"</code>.  Removing the box
is done as in the main plot by setting <code class="language-plaintext highlighter-rouge">bty = "n"</code>.  I think it looks
nice when the size of the points in a legend to match the size of the
points in the plot.  To do that, we can use the <code class="language-plaintext highlighter-rouge">pt.cex</code> option.  We
set it to <code class="language-plaintext highlighter-rouge">1.4</code> to match the <code class="language-plaintext highlighter-rouge">cex</code> parameter that we passed in to
<code class="language-plaintext highlighter-rouge">plot</code> like so: <code class="language-plaintext highlighter-rouge">pt.cex = 1.4</code>.</p>

<p>It’s a subtle thing, but the spacing between the legend elements in
the ggplot figure are a bit more spaced out than in the base graphics
figure.  To adjust that, we use <code class="language-plaintext highlighter-rouge">x.intersp</code> and <code class="language-plaintext highlighter-rouge">y.intersp</code>
parameters, which adjust the character spacing in the horizontal and
vertical directions (the units are line heights again).  The default
is <code class="language-plaintext highlighter-rouge">1</code> for both.  Since we want a little more space, we increase them
to something like this: <code class="language-plaintext highlighter-rouge">x.intersp = 1.4, y.intersp = 1.15</code>.</p>

<p>Here’s what those changes look like.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
     </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
       </span><span class="c1">## Add a title</span><span class="w">
       </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"group"</span><span class="p">,</span><span class="w">
       </span><span class="c1">## Remove the box around the legend.</span><span class="w">
       </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
       </span><span class="c1">## Increase the size of the points to match those in the plot.</span><span class="w">
       </span><span class="n">pt.cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w">
       </span><span class="c1">## Increase the spacing in the x and y directions.</span><span class="w">
       </span><span class="n">x.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">y.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.15</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend_2.png" alt="Adjusting the legend" />
    <figcaption>Adjusting the legend</figcaption>
</figure>

<p>outside of the plotting area</p>

<h3 id="move-the-legend-outside-of-the-plotting-area">Move the legend outside of the plotting area</h3>

<p>Next we need to adjust the position of the whole legend.  Do you see
how it is actually inside the plot on the base graphics version, but
outside of it in the ggplot version?  We can move the legend around
with the <code class="language-plaintext highlighter-rouge">inset</code> parameter.  The default value is <code class="language-plaintext highlighter-rouge">0</code>.  If you pass in
a positive number, the legend moves into the plot, whereas if you pass
in a negative number the legend moves out away from the plot.  We will
pass in <code class="language-plaintext highlighter-rouge">inset = -0.1</code> to bump it to the right to get it outside of
the plot.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
     </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
       </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"group"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w">
       </span><span class="n">x.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">y.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.15</span><span class="p">,</span><span class="w">
       </span><span class="c1">## Nudge the legend to the right.</span><span class="w">
       </span><span class="n">inset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.1</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend_3.png" alt="Moving the legend outside of the plot area" />
    <figcaption>Moving the legend outside of the plot area</figcaption>
</figure>

<p>Whoops!  Do you see how the legend went right off the chart?  To make
sure the legend doesn’t get clipped, we need to pass in <code class="language-plaintext highlighter-rouge">xpd = TRUE</code>
to the <code class="language-plaintext highlighter-rouge">legend</code> function.  The <code class="language-plaintext highlighter-rouge">xpd</code> parameter affects how the plot
elements are clipped if they exceed the edges of the plot.  Here is
how you move the legend outside of the plotting area using the <code class="language-plaintext highlighter-rouge">xpd</code>
parameter.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">)</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
     </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
       </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"group"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w">
       </span><span class="n">x.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">y.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.15</span><span class="p">,</span><span class="w">
       </span><span class="n">inset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.1</span><span class="p">,</span><span class="w">
       </span><span class="c1">## Ensure the legend is not clipped even though it is</span><span class="w">
       </span><span class="c1">## outside of the plotting area.</span><span class="w">
       </span><span class="n">xpd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend_4.png" alt="Do not clip the legend outside the plotting area" />
    <figcaption>Do not clip the legend outside the plotting area</figcaption>
</figure>

<h2 id="some-final-touchups">Some final touchups</h2>

<p>We’re almost there now!  Just a few more adjustments to make: tick
label size, plot element colors, and plot margins.</p>

<h3 id="tick-label-size">Tick label size</h3>

<p>Right now, the tick labels are a lot bigger than they are in the
ggplot version.  To fix it, we can pass in <code class="language-plaintext highlighter-rouge">cex.axis = 0.85</code> to the
<code class="language-plaintext highlighter-rouge">par</code> function.  That way, it will be applied to both the x and y axes
and we don’t have to specify it twice.  Remember that the normal <code class="language-plaintext highlighter-rouge">cex</code>
is 1 so any number less than that will be smaller than the default.</p>

<h3 id="plot-element-colors">Plot element colors</h3>

<p>Setting the plot element colors can be a little tricky because we have
to specify them in a few different places.  I should mention that
there are quite a few ways to control the colors in plots made with
base R graphics.  It can get a little confusing as to what parameter
is controlling what aspect of the plot, especially when you consider
that the options passed in to the <code class="language-plaintext highlighter-rouge">par</code> function control lots of
different plot elements.  For example, <code class="language-plaintext highlighter-rouge">par(fg = "green")</code> will turn a
lot of plot elements green, but not all of them.  Rather than do that,
we will adjust colors mostly inside the functions that they will
affect.</p>

<p>We will first set a variable to hold the color and use that:
<code class="language-plaintext highlighter-rouge">base_color &lt;- "#444444"</code>.  The axes label colors are controlled with
the <code class="language-plaintext highlighter-rouge">col.lab</code> parameter to the <code class="language-plaintext highlighter-rouge">par</code> function (<code class="language-plaintext highlighter-rouge">col.lab =
base_color</code>).  To change the axis (box) line color, we pass in <code class="language-plaintext highlighter-rouge">col =
base_color</code> to the <code class="language-plaintext highlighter-rouge">box</code> function.  For the axes ticks and tick
labels, we the <code class="language-plaintext highlighter-rouge">col</code> and <code class="language-plaintext highlighter-rouge">col.axis</code> parameters to the <code class="language-plaintext highlighter-rouge">axis</code> function
to control the tick color and the tick label color, respectively
(e.g., <code class="language-plaintext highlighter-rouge">col = base_color, col.axis = base_color</code>).  To change the
legend color, we pass <code class="language-plaintext highlighter-rouge">text.col = base_color</code> directly to the <code class="language-plaintext highlighter-rouge">legend</code>
function.</p>

<h3 id="plot-margins">Plot margins</h3>

<p>As with many other things in base R graphics, there are a couple ways
to control the plot margins.  We are going to be using the <code class="language-plaintext highlighter-rouge">mar</code>
parameter to the <code class="language-plaintext highlighter-rouge">par</code> function.  To do so, you pass in a 4 part
vector specifying the size of the margin (in lines of text) of the
bottom, left, top, and right sides of the plot, in that order.  The
default is <code class="language-plaintext highlighter-rouge">c(5, 4, 4, 2) + 0.1</code>.  We will shrink all the margins
except for the right, which we need to increase to make enough room
for our legend: <code class="language-plaintext highlighter-rouge">mar = c(3, 3, 1, 3.5)</code>.  Just to make it clear, that
is three lines of text for the bottom and left margins, one line of
text for the top margin, and 3.5 lines of text for the right margin.</p>

<h3 id="all-the-final-adjustments">All the final adjustments</h3>

<p>Let’s put all the final touchups in now.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">base_color</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"#444444"</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">tcl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.25</span><span class="p">,</span><span class="w">
    </span><span class="c1">## Shrink the tick labels.</span><span class="w">
    </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.85</span><span class="p">,</span><span class="w">
    </span><span class="c1">## Set the axis label color</span><span class="w">
    </span><span class="n">col.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">,</span><span class="w">
    </span><span class="c1">## Adjust the margin:  bottom, left, top, right</span><span class="w">
    </span><span class="n">mar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">3.5</span><span class="p">))</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_purple</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k_orange</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
     </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">[</span><span class="n">group</span><span class="p">])</span><span class="w">
</span><span class="n">box</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
    </span><span class="c1">## Set the box color.</span><span class="w">
    </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mgp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
     </span><span class="c1">## Set the axis tick and tick label colors.</span><span class="w">
     </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">,</span><span class="w"> </span><span class="n">col.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lwd.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
     </span><span class="c1">## Set the axis tick and tick label colors.</span><span class="w">
     </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">,</span><span class="w"> </span><span class="n">col.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colors</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
       </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"group"</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w">
       </span><span class="n">x.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.4</span><span class="p">,</span><span class="w"> </span><span class="n">y.intersp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.15</span><span class="p">,</span><span class="w">
       </span><span class="n">inset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.1</span><span class="p">,</span><span class="w"> </span><span class="n">xpd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
       </span><span class="c1">## Set the legend text color.</span><span class="w">
       </span><span class="n">text.col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">base_color</span><span class="p">)</span></code></pre></figure>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend_5.png" alt="Applying the final adjustments" />
    <figcaption>Applying the final adjustments</figcaption>
</figure>

<p>Looking good!  So that’s almost the same as the original “classic”
theme ggplot2 plot.  One thing you may notice is that there are a
different number of tick marks on the axes.  You can actually adjust
this in base R graphics, but it can be a little bit tricky, so we will
leave that for another post.</p>

<h2 id="wrap-up">Wrap up</h2>

<p>Whew, that was a lot of stuff!  As we saw, copying the style of the
ggplot <code class="language-plaintext highlighter-rouge">theme_classic</code> requires quite a lot of fiddling around with a
lot of different parameters to a few different functions.  If I was
making a plot for a publication or blog post or something, I would
definitely just use ggplot, but it can be fun and educational to try
to reproduce something that an awesome library does with base R
graphics.  Hopefully, you enjoyed the process and learned a lot about
base R graphics!</p>]]></content><author><name>Ryan Moore</name></author><category term="blog" /><summary type="html"><![CDATA[The ggplot2 package makes some really nice looking plots. In this post, we give a step-by-step guide to styling plots, including moving the legend outside the plotting area, to match the ggplot2 classic theme using base R graphics.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.tenderisthebyte.com/assets/img/posts/pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend_5.png" /><media:content medium="image" url="https://www.tenderisthebyte.com/assets/img/posts/pretty_plots_in_base_r/base_box_lb_axes_adjusted_3_fix_points_legend_5.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Computational lab notebooks using git and git-annex</title><link href="https://www.tenderisthebyte.com/blog/2021/05/07/computational-lab-notebooks/" rel="alternate" type="text/html" title="Computational lab notebooks using git and git-annex" /><published>2021-05-07T00:00:00+00:00</published><updated>2021-05-07T00:00:00+00:00</updated><id>https://www.tenderisthebyte.com/blog/2021/05/07/computational-lab-notebooks</id><content type="html" xml:base="https://www.tenderisthebyte.com/blog/2021/05/07/computational-lab-notebooks/"><![CDATA[<p><em>Disclaimer: if you need a lab notebook for legal records, copyright,
patent rights, or anything like that, then this article probably isn’t
for you.  This post is <strong>not</strong> providing any recommendations for those
cases.</em></p>

<div class="post-toc">

  <h4 class="post-toc--header" id="contents">Contents</h4>

  <ul>
    <li><a href="#overview">Overview</a></li>
    <li><a href="#provenance-tracking">Provenance tracking</a></li>
    <li><a href="#a-git-based-lab-notebook">A git-based lab notebook</a></li>
    <li><a href="#a-cli-app-to-help-manage-git-based-lab-notebooks">A CLI app to help manage git-based lab notebooks</a></li>
    <li><a href="#a-super-simple-example">A super simple example</a></li>
  </ul>

</div>

<p><em><strong>Too long; didn’t read</strong>: Check out the <a href="https://github.com/mooreryan/computational_lab_notebooks">cln
app</a> on
GitHub.  It helps you manage a computational lab notebook using git
and git-annex.  You can find the documentation
<a href="https://mooreryan.github.io/computational_lab_notebooks/">here</a>.</em></p>

<h2 id="overview">Overview</h2>

<p>Keeping a good lab notebook for your computational work is important,
but it can be challenging.  A quick Google search will show you lots
of examples of people talking about it:</p>

<ul>
  <li><a href="https://doi.org/10.1371/journal.pcbi.1004385">Ten Simple Rules for a Computational Biologist’s Laboratory Notebook</a></li>
  <li><a href="https://ori.hhs.gov/education/products/wsu/data.html">Notebook &amp; Data Management</a></li>
  <li><a href="https://scicomp.stackexchange.com/questions/35854/lab-notebooks-for-computational-science">Lab Notebooks for Computational Science</a></li>
  <li><a href="https://blog.addgene.org/how-to-keep-a-lab-notebook-for-bioinformatic-analyses">How to Keep a Lab Notebook for Bioinformatic Analyses</a></li>
  <li><a href="https://www.reddit.com/r/labrats/comments/66dlgq/keeping_a_good_lab_notebook_in_a_computational/">Keeping a good lab notebook in a computational field?</a></li>
</ul>

<p>I have tried a lot of different methods, but they all more or less
boil down to a workflow sort of like this:</p>

<ul>
  <li>Write down some summary of what I’m about to do and why.</li>
  <li>Run some commands, programs, or bash stuff.</li>
  <li>Copy what I did into a document. (e.g., <a href="https://www.markdownguide.org/getting-started/">Markdown
notes</a> files,
<a href="https://tiddlywiki.com/">TiddlyWiki</a>, etc.)</li>
  <li>Write a bit more about what happened.</li>
  <li>Rinse and repeat.</li>
</ul>

<p>Then, depending on my needs, I may clean up the analysis and put it
into an <a href="https://rmarkdown.rstudio.com/">R Markdown</a> or <a href="https://jupyter.org/">Jupyter
notebooks</a> notebook so it will be easier to
reproduce later.</p>

<p>One problem with this general workflow is that it requires tracking a
lot of things manually (e.g., copying and pasting).  Whenever you do a
lot of that, you will inevitably forget to paste a command into your
notebook.  You might make a mistake or typo when running a command,
and rather than noting it down in your notebook, you just rerun it and
pretty soon your lab notebook is out of sync with the commands that
you have actually run.  Another issue is that you may be running a
bunch of commands quickly, just testing some ideas out.  When doing
this, you end up needing to track a ton of things in an ad-hoc manner
leading to a messy lab notebook that you need to come back to later
and reorganize.</p>

<p>In other words, you need to manually track a lot of information, and
it can be quite a challenge to keep track of everything!</p>

<h2 id="provenance-tracking">Provenance tracking</h2>

<p>One approach to dealing with this problem is by tracking the
provenance of files.  An example of this is how <a href="https://doi.org/10.1038/s41587-019-0209-9">QIIME
2</a> includes metadata in
their artifact files (<code class="language-plaintext highlighter-rouge">.qza</code> files) to <a href="https://docs.qiime2.org/2021.2/concepts/#data-files-qiime-2-artifacts">track things that were done in
an
analysis</a>.</p>

<p>I like the idea of provenance tracking, but even if you do use QIIME,
there are a lot of things you need to do outside of QIIME that will
need tracking.  While not quite the same, this sort of provenance
tracking reminds me a bit of using git or other version control
software.  <a href="https://git-scm.com/">Git</a> is software used to track
changes in a set of files, and is often used by programmers during
software development.</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//computational_lab_notebooks/git_logo.png" alt="git -- a distributed version control system" />
    <figcaption>git -- a distributed version control system</figcaption>
</figure>

<p><em>Note: If you have never used git before, the <a href="https://git-scm.com/doc">official
docs</a> have a lot of info that may be of use
to you.  I have also written a <a href="https://mooreryan.github.io/computational_lab_notebooks/git/">small git
tutorial</a>
that you may find useful!</em></p>

<p>While I had used git while working on software, I had never tried
using it to manage a computational lab notebook.  One reason is that
it <a href="https://stackoverflow.com/questions/3055506/git-is-very-very-slow-when-tracking-large-binary-files">doesn’t handle large files
well</a>.
For computational work, whether bioinformatics or data science, you
will be dealing with a lot of large files.  Sequencing files easily
get over 10 GB in size, so using git alone is going to be problematic.
However, there are extensions to git like <a href="https://git-lfs.github.com/">Git Large File
Storage</a> and
<a href="https://git-annex.branchable.com/">git-annex</a> that help to address
this problem.  (Essentially, git-annex tracks <a href="https://en.wikipedia.org/wiki/Symbolic_link">symbolic
links</a> in the git
repository rather than the file itself.  There is a lot more to it
than that, so you check out the <a href="https://git-annex.branchable.com/walkthrough/">git-annex
walkthrough</a> if you
want to know more.)</p>

<h2 id="a-git-based-lab-notebook">A git-based lab notebook</h2>

<p><em>Note: I’m not the first one to think of using git to help manage a
computational lab notebook.  In fact, you can find some interesting
discussion on whether version control is even useful for lab notebooks
<a href="http://ivory.idyll.org/blog/is-version-control-an-electronic-lab-notebook.html">here</a>,
<a href="https://kbroman.org/blog/2013/08/20/electronic-lab-notebook/">here</a>,
and
<a href="https://yossadh.github.io/posts/2018/12/lab-notebook-part-2/">here</a>.</em></p>

<p>Using git and git-annex, I figured that I could get a pretty decent
workflow going for my computational lab notebook.  After playing
around with it for a while (and seeing that git-annex was a good
solution to git’s large file problem), I settled into a pretty
familiar workflow:</p>

<ul>
  <li>Run a program, script, whatever.</li>
  <li>Track any new files or changes with git.</li>
  <li>Commit the changes.</li>
  <li>Repeat.</li>
</ul>

<p>One key difference from my “typical” workflow is that instead of
putting the commands that I ran and their explanations into some
external document like a markdown file, I would put all the
information into the commit message.  That way, all the info about how
and why I did something would be tracked in the git repository along
with the actual files and changes.</p>

<p>That works pretty well, but you still run in to the issue of having to
remember what you ran, copy and paste it correctly into the commit
message, blah blah blah.  In other words, it’s still a bit of a pain.
While you get the added benefits of git logs and history tracking, you
have to do a lot of repetitive, annoying stuff to get things to work.
So, of course, I wrote a little program to help automate some of the
tedious stuff!</p>

<h2 id="a-cli-app-to-help-manage-git-based-lab-notebooks">A CLI app to help manage git-based lab notebooks</h2>

<p>While working with the above workflow, in addition to QIIME’s
provenance tracking, I was also reminded of <a href="https://en.wikipedia.org/wiki/Schema_migration">database
migrations</a>.
Basically, the way they work is that you write some script that says
how the database is supposed to change (e.g., add column <code class="language-plaintext highlighter-rouge">first_name</code>
to table <code class="language-plaintext highlighter-rouge">authors</code>), and then <a href="https://guides.rubyonrails.org/active_record_migrations.html#running-migrations">some migration
tool</a>
handles actually making any changes to the database.  In theory, this
gives you a simpler way to track how your database has changed over
time–you can just follow the paper trail of your migration files.</p>

<p>The app I wrote works in a similar way, except that instead of making
incremental changes to a database, you are formalizing making changes
to the repository itself.  The app is called <code class="language-plaintext highlighter-rouge">cln</code> (it stands for
“computational lab notebooks”…clever, I know!).  You can find it on
<a href="https://github.com/mooreryan/computational_lab_notebooks">GitHub</a>.
There is also some pretty extensive <a href="https://mooreryan.github.io/computational_lab_notebooks/">documentation
available</a>
to help you get started using the software.</p>

<p>While I suggest you check out the docs for a more detailed explanation
of its installation and usage, I want to show a quick, little
example to give you a flavor of how the <code class="language-plaintext highlighter-rouge">cln</code> program can help you
manage you git-based lab notebook.</p>

<h2 id="a-super-simple-example">A super simple example</h2>

<p>The <code class="language-plaintext highlighter-rouge">cln</code> command provides a couple of subcommands to help you manage
your lab notebook with git and git-annex.  (For more details on
individual subcommands, see
<a href="https://mooreryan.github.io/computational_lab_notebooks/usage/">here</a>).</p>

<h3 id="create-a-project">Create a project</h3>

<p>To start, you make a new project.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">$ mkdir -p ~/projects/cln_example &amp;&amp; cd ~/projects/cln_example
$ cln init 'Example Project'
$ tree -a -I .git
.
├── .actions
│   ├── completed
│   ├── failed
│   ├── ignored
│   └── pending
└── README.md</code></pre></figure>

<p>The <code class="language-plaintext highlighter-rouge">cln init</code> command initializes a new project, creates a git
repository, and generates some scaffolding for actions and git commit
templates.</p>

<h3 id="prepare-an-action">Prepare an action</h3>

<p>Next, you prepare an action to run.  (Again, this is just a silly
example…for a more in depth tutorial, see the
<a href="https://mooreryan.github.io/computational_lab_notebooks/">documentation</a>).</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">$ cln prepare 'printf "I like apple pie\n" &gt; msg.txt'</code></pre></figure>

<p>In this case the action is just running a <code class="language-plaintext highlighter-rouge">printf</code> command and saving
the contents in a file.  Of course, you can prepare an action
containing anything that you would normally run at the command line.
For example, you could prepare a crazy action like this:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">$ cln prepare "$(cat &lt;&lt;'EOF'
cut -f2 seq_information.seq_id_eco.tsv \
  | cut -d';' -f5 \
  | ruby -e 'h = Hash.new 0; \
      ARGF.each {|l| h[l.chomp] += 1 }; \
      h.sort_by {|_, count| count }.reverse. \
      each {|eco, count| puts "#{eco}\t#{count}" }' \
  | column -t \
  &gt; seq_eco_counts.txt
EOF
)"</code></pre></figure>

<p><em>Note: That’s actually an action I prepared and ran in a real project.
Previously, I would have put that little ad-hoc
<a href="https://www.ruby-lang.org/en/">Ruby</a> script into a file and ran it in
a way that is easier to track, but with the <code class="language-plaintext highlighter-rouge">cln</code> to help me manage
things, everything will be nicely tracked automatically.</em></p>

<p>The <code class="language-plaintext highlighter-rouge">cln prepare</code> command creates an action file and a <a href="https://git-scm.com/docs/git-commit/2.10.5#Documentation/git-commit.txt---templateltfilegt">git commit
template</a>.
The action file is simply a bash script with the command you want to
run, but having it there in your repository as a standalone script
helps you see what is going on if you’re running a complicated command
or when you come back to the project a couple of months later.</p>

<h3 id="run-the-pending-action">Run the pending action</h3>

<p>Next, you can check that everything is okay doing a <a href="https://en.wikipedia.org/wiki/Dry_run_(testing)">dry
run</a>.  It will spit
out some stuff to the terminal to let you know what’s going on and
suggests what steps to take next.  <em>Note: I’ve edited the terminal
output a bit.</em></p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">$ cln run -dry-run
~~~
~~~
~~~ Hi!  I just previewed an action for you.
~~~
~~~ I plan to run this action file:
~~~   '.actions/pending/action__ ...'
~~~
~~~ It's contents are:
~~~
printf "I like apple pie\n" &gt; msg.txt

~~~
~~~ If that looks good, you can run the action:
~~~   $ cln run
~~~
~~~</code></pre></figure>

<p>If it looks good, you can go ahead and run the action.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">$ cln run
  ~~~
  ~~~
  ~~~ Hi!  I just ran an action for you.
  ~~~
  ~~~ * The pending action was '.actions/pending/action__REDACTED.sh'.
  ~~~ * The completed action is '.actions/completed/action__REDACTED.sh'.
  ~~~
  ~~~ Now, there are a couple of things you should do.
  ~~~
  ~~~ * Check which files have changed:
  ~~~     $ git status
  ~~~ * Add actions and commit templates:
  ~~~     $ git add .actions
  ~~~ * Unless they are small, add other new files with git annex:
  ~~~     $ git annex add blah blah blah...
  ~~~ * After adding files, commit changes using the template:
  ~~~     $ git commit -t '.actions/completed/action__REDACTED.gc_template.txt'
  ~~~
  ~~~ After that you are good to go!
  ~~~
  ~~~ * You can now check the logs with git log,
  ~~~   or use a GUI like gitk to view the history.
  ~~~
  ~~~</code></pre></figure>

<p>See how the <code class="language-plaintext highlighter-rouge">cln run</code> command gives you hints on what to do next?  I
tried to make all the <code class="language-plaintext highlighter-rouge">cln</code> commands spit out helpful info like that
to the terminal.</p>

<h3 id="track-and-commit-changes">Track and commit changes</h3>

<p>Now, you will be able to see any files that were created or changed as
the result of running the action using <code class="language-plaintext highlighter-rouge">git status</code>.  Depending on the
size(s) of the file(s) that were created or changed, you can add them
to the <a href="https://mooreryan.github.io/computational_lab_notebooks/git/#what-is-an-index">git
index</a>
with either <code class="language-plaintext highlighter-rouge">git add</code> or <code class="language-plaintext highlighter-rouge">git-annex add</code>.  Finally, you commit the
changes using the git commit template that was made when you prepared
the action.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">$ git commit -t '.actions/completed/action__REDACTED.gc_template.txt'</code></pre></figure>

<p>The template file will look something like this:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">PUT COMMIT MSG HERE.

== Details ==
PUT DETAILS HERE.

== Command(s) ==
printf "I like apple pie\n" &gt; msg.txt

== Action file ==
action__REDACTED.sh</code></pre></figure>

<p>When you run the <code class="language-plaintext highlighter-rouge">git commit</code> command, a text editor will pop up with
the contents of the git template file ready for you to fill out.  This
is nice because you can avoid manually copying in the commands you
ran.  For such a small example it’s not really a big deal, but if
you’re running some complicated bioinformatics software with a lot of
flags and options, it’s pretty convenient!</p>

<h3 id="browse-the-git-history">Browse the git history</h3>

<p>After editing the message and saving the commit, you can browse
through your nicely organized repository history and see something
like this:</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">$ git log
commit ebf738 (HEAD -&gt; master)
Author: Ryan Moore &lt;moorer@udel.edu&gt;
Date:   Mon Apr 5 18:44:54 2021 -0400

    Created the msg.txt file

    == Details ==
    I needed to create a file that describes something that I like.  I
    used the `printf` rather than `echo` because it is more portable.
    (See https://stackoverflow.com/a/11530298 for a discussion of this on
    stack overflow).

    == Command(s) ==
    /usr/bin/printf "I like apple pie\n" &gt; msg.txt

    == Action file ==
    action__460986084__2021-04-05_18:02:37.sh

commit 1a2e90
Author: Ryan Moore &lt;moorer@udel.edu&gt;
Date:   Mon Apr 5 17:43:50 2021 -0400

    Initial commit</code></pre></figure>

<p>Notice how I put a short, descriptive commit message for the first
line, and then added in any additional details that I think I will
need later.  The <code class="language-plaintext highlighter-rouge">== Details ==</code> section would hold all the extra
stuff I would put in my lab notebook anyway, but it is really
convenient to have it right there in the git log.</p>

<p>Having the command that you ran, the details about that command, and
the changes that command effected in your repository opens up some
really powerful ways to track your analyses.</p>

<h3 id="get-individual-file-provenance-info">Get individual file provenance info</h3>

<p>For example, you can use the <code class="language-plaintext highlighter-rouge">git</code> cli app (e.g., <code class="language-plaintext highlighter-rouge">git whatchanged</code> or
<code class="language-plaintext highlighter-rouge">git log</code>) or a GUI like <a href="https://git-scm.com/docs/gitk/">gitk</a> to get
detailed info about the provenance of any files in the repository.
You could run something like this to see all the history for the
<code class="language-plaintext highlighter-rouge">msg.txt</code> file.</p>

<figure class="highlight"><pre><code class="language-text" data-lang="text">$ git log --stat --follow -p -- msg.txt
commit ... (HEAD -&gt; master)
Author: Ryan Moore &lt;moorer@udel.edu&gt;
Date:   ....

    Created the msg.txt file

    == Details ==
    I needed to create a file that describes something that I like.  I
    used the `printf` rather than `echo` because it is more portable.
    (See https://stackoverflow.com/a/11530298 for a discussion of this on
    stack overflow).

    == Command(s) ==
    printf "I like apple pie\n" &gt; msg.txt

    == Action file ==
    action__467354640__.....sh
---
 msg.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/msg.txt b/msg.txt
new file mode 100644
index 0000000..135d9d6
--- /dev/null
+++ b/msg.txt
@@ -0,0 +1 @@
+I like apple pie</code></pre></figure>

<p>As you can imagine, having output like that for all the files in your
project folder as well as the chronological logs is a very powerful
way to track your analyses and makes managing a computational lab
notebook so much easier.</p>

<h2 id="wrap-up">Wrap up</h2>

<p>Managing a computational lab notebook is tricky.  I have found that
using git and git-annex can be a good way to keep all the info you
need right in the same directory as all your data files, scripts, and
analysis code.  To help you more easily manage lab notebooks using git
and git-annex, I created a command line app called <code class="language-plaintext highlighter-rouge">cln</code>.  You can
find the code on
<a href="https://github.com/mooreryan/computational_lab_notebooks">GitHub</a>.
Installation instructions and usage examples can be found in the
<a href="https://mooreryan.github.io/computational_lab_notebooks/">documentation</a>.</p>]]></content><author><name>Ryan Moore</name></author><category term="blog" /><summary type="html"><![CDATA[Managing a computational lab notebook can be tricky. Here I discuss a workflow and command line app for helping you to set up and manage your lab notebook with git and git-annex.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.tenderisthebyte.com/assets/img/posts/computational_lab_notebooks/git_log_672_high.jpg" /><media:content medium="image" url="https://www.tenderisthebyte.com/assets/img/posts/computational_lab_notebooks/git_log_672_high.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">divnet-rs: A Rust implementation for DivNet</title><link href="https://www.tenderisthebyte.com/blog/2021/01/18/divnet-rust-implementation/" rel="alternate" type="text/html" title="divnet-rs: A Rust implementation for DivNet" /><published>2021-01-18T00:00:00+00:00</published><updated>2021-01-18T00:00:00+00:00</updated><id>https://www.tenderisthebyte.com/blog/2021/01/18/divnet-rust-implementation</id><content type="html" xml:base="https://www.tenderisthebyte.com/blog/2021/01/18/divnet-rust-implementation/"><![CDATA[<ul>
  <li><em>Update: divnet-rs now has a way to parallelize the bootstrapping procedure. With enough RAM, it can give <a href="https://github.com/mooreryan/divnet-rs/issues/4#issuecomment-955592257">approximately linear decreases</a> in run time with increasing number of cores. Consider it an <a href="https://github.com/mooreryan/divnet-rs/blob/main/CHANGELOG.md#unreleased">experimental</a> feature for now.</em></li>
  <li><em>Update 2022-04-06: On the <a href="https://doi.org/10.3389/fmicb.2015.01470">Lee dataset</a>, <a href="https://github.com/mooreryan/divnet-rs/releases/tag/v0.3.0">v0.3.0</a> is around 3x faster and uses ~60% of the memory as compared to <a href="https://github.com/mooreryan/divnet-rs/releases/tag/v0.2.1">v0.2.1</a>.</em></li>
  <li><em>Update 2021-01-22: <a href="https://github.com/mooreryan/divnet-rs/releases/tag/v0.2.1">v0.2.1</a> further decreases the run time and required memory</em></li>
  <li><em>Update 2021-01-19: As of <code class="language-plaintext highlighter-rouge">divnet-rs</code> <a href="https://github.com/mooreryan/divnet-rs/releases/tag/v0.2.0">v0.2.0</a>, users can manually set the random seed. Also, <code class="language-plaintext highlighter-rouge">v0.2.0</code> uses only about 2/3 the memory that was used by <a href="https://github.com/mooreryan/divnet-rs/releases/tag/v0.1.1">v0.1.1</a>.</em></li>
</ul>

<h2 id="background">Background</h2>

<p>One reason for doing microbiome sequencing is to learn about the microbial diversity of the ecosystems of interest. Estimating the diveristy of microbial communities is hard. Essentially every step of a sample to sequence pipeline <a href="https://doi.org/10.7554/eLife.46923">introduces biases</a> into your analyses, meaning the community composition you observe is likely quite different from the true community composition. Further, <a href="https://doi.org/10.3389/fmicb.2017.02224">microbiome datasets are compositional</a>, and must be treated with <a href="https://doi.org/10.1093/gigascience/giz107">statistical and computational methods</a> designed to handle such data.</p>

<p>Most communities are incredibly complex so you’re going to nearly always have issues with undersampling – there are just too many microbes to sequence them all, so you have to work with samples. Even though you cannot practically observe all the taxa in your environment, you still need to estimate the diversity of that environment. So why don’t we just “plug-in” our data into one of the common diversity indices borrowed from macroecology like Shannon or Simpson and be done with it? You will actually see this a lot in the literature: plugging in the observed relative abundances (sometimes after <a href="https://doi.org/10.1371/journal.pcbi.1003531">rarefying</a> the data first) from our samples into standard “plug-in” diversity formulas.</p>

<p>There are a couple of problems with this. Undersampling is problematic because alpha diversity metrics are <a href="https://doi.org/10.3389/fmicb.2019.02407">heavily biased when there are unobserved taxa</a>. The random sampling variation combined with biases introduced in the sample-to-sequence pipeline mean your observed relative abundances probably don’t faithfully represent the true community you want to study. Additionally, many commonly used methods for generating confidence intervals assume that taxa are independent (i.e., if taxa A is present in a community, it doesn’t provide any information about whether taxa B is there too).</p>

<h3 id="what-is-divnet">What is DivNet?</h3>

<p>So how are you supposed to measure diversity of microbial communities then? One method that is designed to address a lot of these problems is <a href="https://github.com/adw96/DivNet">DivNet</a>, an R package for estimating diversity when taxa in the community occur in an ecological network (i.e., a pattern of microbial co-occurence). DivNet leverages info from multiple samples and can estimate relative abundance of taxon in communities where it was unobserved. It also gives accurate estimates of variance in the measured diversity by taking into account sample metadata/covariates.</p>

<p>Probably the most interesting aspect of DivNet is that it allows you to account for ecological networks where taxa positively and negatively co-occur. DivNet estimates diversity using models from <a href="https://en.wikipedia.org/wiki/Compositional_data">compositional data analysis</a> that can handle co-occurance networks. This is in contrast to most common diversity estimates that are based on the <a href="https://en.wikipedia.org/wiki/Multinomial_distribution">multinomial model</a> that makes assumptions about sampling that prohibit ecological networks (i.e., situations in which taxa positively and negatively co-occur). (<em>Note: you may know the multinomial model from your stats courses in modeling the probability of counts for dice rolls or as generalization of the <a href="https://en.wikipedia.org/wiki/Binomial_distribution">binomial distribution</a>.</em>)</p>

<p>You can find a lot more information about DivNet, including algorithmic details, validation, comparison to other methods of estimating diversity, and some important details to keep in mind when using DivNet on your data in the <a href="https://doi.org/10.1093/biostatistics/kxaa015">DivNet manuscript</a>.</p>

<h3 id="why-make-divnet-rs">Why make divnet-rs?</h3>

<p>In the <a href="https://github.com/adw96/DivNet/blob/31e04e29e4f3c02ea07c7f35873ee6743b79170a/vignettes/getting-started.Rmd">getting started tutorial</a>, there is a section called “What does DivNet do that I can’t do already?” (it is worth reading if you haven’t!). So I thought it would be good to answer the question, “What does <code class="language-plaintext highlighter-rouge">divnet-rs</code> do that the R implementation of DivNet can’t do aleady?” The answer is simple: <code class="language-plaintext highlighter-rouge">divnet-rs</code> gives you the ability to apply the DivNet algorithm to large datasets. For those without easy access to high performance computing facilities, you will be able to run <code class="language-plaintext highlighter-rouge">divnet-rs</code> on typically sized SSU rRNA microbiome datasets on your laptop. <code class="language-plaintext highlighter-rouge">divnet-rs</code> is both faster and much more memory efficent that the R implementation. Of course, bioinformatics software is all about tradeoffs and <code class="language-plaintext highlighter-rouge">divnet-rs</code> is no different. <a href="#differences-in-the-implementations">Comapared to the R implementation</a>, it’s harder to install, you have to write some R code specifically to get data in and out of <code class="language-plaintext highlighter-rouge">divnet-rs</code>, and not all network and boostrapping options offered by the R implementation are available in the Rust implementation. That said, I think <code class="language-plaintext highlighter-rouge">divnet-rs</code> still fulfills a useful niche by allowing researchers to apply the DivNet algorithm to datasets that are currently too large for the R implementation to handle.</p>

<h2 id="comparing-run-time-and-memory-usage">Comparing run time and memory usage</h2>

<h3 id="set-up">Set up</h3>

<p>While developing <code class="language-plaintext highlighter-rouge">divnet-rs</code>, I spent a good amount of time profiling and optimizing the code. Rather than talk about that, I wanted to get a high level overview of how the performance of the R and Rust implementation compared on a real dataset. The data I used was the <a href="https://doi.org/10.3389/fmicb.2015.01470">Lee dataset</a> that is incuded with the DivNet R package. It has 1490 <a href="https://doi.org/10.1038/ismej.2017.119">amplicon sequence variants</a> (ASVs), 16 samples, and associated taxonomy and sample info.</p>

<p>So what did I do? First, I took the Lee data and sorted the ASV table in decreasing abundance order. Then I created new datasets from the top 10, 20, 40, 80, 160, 320, 640, and 1280 ASVs. In addition to the full 16 sample datasets, I also created test datasets with only eight samples by randomly picking samples from the ASV table, remiving any ASVs that had zero count in the remaining samples, and then took the top 10, 20, …, 1280 ASVs just like for the 16 sample datasets. I ran everything with the default algorithm tuning in DivNet (6 expectation maximization (EM) iterations (3 burn), 500 Monte-Carlo (MC) iterations (250 burn)) and 2 replicates. I would probably use the “careful” setting (10 EM iterations and 1000 MC iterations) as well as running more replicates if I was actually analyzing data, but this was good enough for this little profiling experiment.</p>

<p>This isn’t the most scientific profiling job ever, but it should give you a sense of how the run time and memory scales with the number of taxa and samples for both the R and Rust versions of DivNet. For the timing, I ran each dataset three times, and I used the <code class="language-plaintext highlighter-rouge">time</code> function to get the elapsed time and the max memory used for each run. Since loading all the R dependencies takes a large proportion of the total run time in the smaller DivNet-R runs, I got the elapsed time of just the <code class="language-plaintext highlighter-rouge">divnet</code> function using the <a href="https://cran.r-project.org/web/packages/tictoc/index.html">tictoc</a> R package. I still used <code class="language-plaintext highlighter-rouge">time</code> to get the max memory for these runs though.</p>

<p>One other thing to mention, I ran all of these on a compute cluster. I didn’t think about it until after I had already run everthing, but I compiled both <code class="language-plaintext highlighter-rouge">divnet-rs</code> and <code class="language-plaintext highlighter-rouge">OpenBLAS</code> on a different node than the one that I used to actually run the tests. The compute cluster that I used has a bunch of different types of nodes, so the compiled output of both may not be ideal for the node I actually ran the timings on (e.g., different <a href="https://en.wikipedia.org/wiki/SIMD">SIMD instructions</a>, different CPU architectures, etc.). While the timing experiments were running, there were other jobs on the same node running at the same time, so that is another thing that may have influenced the results.</p>

<p>For the R tests, I used R v3.6.2 linked against <a href="https://www.openblas.net/">OpenBLAS</a> v0.3.7 and DivNet v0.3.6. I set DivNet to use only 1 core (<code class="language-plaintext highlighter-rouge">ncores = 1</code>) because in all my tests (and on multiple different machines), DivNet is actually slower when using more than one core. For <code class="language-plaintext highlighter-rouge">divnet-rs</code> I used v0.1.1 linked against OpenBLAS v0.3.13. I also forced OpenBLAS to use only 1 core (<code class="language-plaintext highlighter-rouge">OPENBLAS_NUM_THREADS=1</code>) as that is how the R was using OpenBLAS. (<em>As an aside, if you don’t have <a href="https://csantill.github.io/RPerformanceWBLAS/">R linking against an optimized BLAS implementation</a>, you should. It will give you a big perfomance increase.</em>)</p>

<p>Just keep all this stuff in mind while taking a look at these results.</p>

<h3 id="results">Results</h3>

<p>Here are the run time and memory profiling results:</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//divnet_rs_intro/timing_425_350.svg" alt="DivNet timing and memory requirements" />
    <figcaption>DivNet timing and memory requirements</figcaption>
</figure>

<p>Let’s break down a couple of things. The Rust version is faster and more memory efficient, but that’s not surprising – a Rust program should be faster than an R program, and I spent a good amount of time profiling and optimizing the code. In this test, the Rust version is about 20 times faster than the R version.</p>

<p>The other interesting thing to measure is max memory usage. For the largest dataset that I tested (16 samples, 1280 taxa), the Rust version used ~300 MB of RAM as compared to the ~6000 MB used by the R version. When implementing DivNet in Rust, I spent a good amount of time and effort optimizing the run time, and much less worrying about the memory, so it was nice to see it being relatively frugal with the memory.</p>

<p>As you might expect, the 16 sample datasets took longer and used more memory than the 8 sample datasets, but not twice as much time and memory. There was a weird thing thing in the 1280 taxa test set in the Rust implementation. The 8 sample set actually took a bit more time (but still used less memory) than the 16 sample set. I thought this was strange so I actually ran the 16x1280 and 8x1280 datasets many more times to see if there was some weird random variation in the timings, or if I made some mistake in the testing and mislabeled the datasets or something, but each run gave me relatively the same result as you see here. I’m not honestly sure why this is, but like I mention above, these benchmarks aren’t prefect and could be improved.</p>

<h2 id="differences-in-the-implementations">Differences in the implementations</h2>

<p>Before wrapping up, I want to take a little time to highlight some of the more important differences in the R and Rust implementations of DivNet.</p>

<h3 id="estimating-the-network">Estimating the network</h3>

<p>While the original DivNet R code has multiple options for the <code class="language-plaintext highlighter-rouge">network</code> parameter, the only network option in <code class="language-plaintext highlighter-rouge">divnet-rs</code> is “diagonal”. To explain why this is, here is an excerpt from a <a href="https://github.com/adw96/DivNet/issues/32">GitHub issue</a> where <a href="https://github.com/adw96/DivNet/issues/32#issuecomment-521727997">Amy Willis is talking</a> about using DivNet on large datasets:</p>

<blockquote>
  <p>I would recommend network=”diagonal” for a dataset of this size. This means you’re allowing overdispersion (compared to a plugin aka multinomial model) but not a network structure. This isn’t just about computational expense – it’s about the reliability of the network estimates. Essentially estimating network structure on 20k variables (taxa) with 50 samples with any kind of reliability is going to be very challenging, and I don’t think that it’s worth doing here. In our simulations we basically found that overdispersion contributes the bulk of the variance to diversity estimation (i.e. overdispersion is more important than network structure), so I don’t think you are going to lose too much anyway.</p>
</blockquote>

<p>Another benefit of the diagonal network is that it is fast: it’s a simple, vectorizable mathematical operation, as compared to the <a href="https://github.com/adw96/DivNet/blob/31e04e29e4f3c02ea07c7f35873ee6743b79170a/R/MCmat.R#L83">default method</a>, which will need to do either a Cholesky decomposition or a generalized matrix inversion, or to the <a href="https://github.com/adw96/DivNet/blob/31e04e29e4f3c02ea07c7f35873ee6743b79170a/R/MCmat.R#L114">“stars”</a> method, which does a whole lot more operations.</p>

<p><code class="language-plaintext highlighter-rouge">divnet-rs</code> isn’t a replacement for DivNet. It’s focus is on allowing the core algorithm to be applied to datasets that are too large for the R implementation to handle, and so, only the diagonal network is available in <code class="language-plaintext highlighter-rouge">divnet-rs</code>. If your data is small enough that the R implentation can handle it, then I recommend using the original!</p>

<h3 id="bootstrapping">Bootstrapping</h3>

<p>Another difference from the original is that only the parametric bootstrap is available – you can’t do the nonparametric bootstrap. The parametric bootstrap is the default in the R implementation, and, if you check out the <a href="https://doi.org/10.1093/biostatistics/kxaa015">DivNet manuscript</a>, you’ll see that the parametric and nonparametric bootstraps perform similarly.</p>

<h3 id="setting-the-random-seed">Setting the random seed</h3>

<p><code class="language-plaintext highlighter-rouge">divnet-rs</code> currently does not allow you to set the seed for the random number generator, which will have an impact on reproducibility across runs. While the DivNet R implementation does allow you to set the random seed prior to the run (for example, just use <code class="language-plaintext highlighter-rouge">set.seed(5623472)</code> before running the <code class="language-plaintext highlighter-rouge">divnet</code> function), there is a <a href="https://github.com/adw96/DivNet/blob/31e04e29e4f3c02ea07c7f35873ee6743b79170a/vignettes/getting-started.Rmd#L64">caveat about setting the random seed when running DivNet on multiple cores</a> that you should be aware of. In practice, if you are getting more variability across runs than desired, you can up the EM iterations, the MC iterations, and the replicates, and it <a href="https://github.com/adw96/DivNet/blob/31e04e29e4f3c02ea07c7f35873ee6743b79170a/vignettes/getting-started.Rmd#L64">should take care of things</a>.</p>

<h2 id="wrap-up">Wrap-up</h2>

<p>In this post, I introduced <code class="language-plaintext highlighter-rouge">divnet-rs</code>, a Rust implementation of the <a href="https://github.com/adw96/DivNet">DivNet R package</a>. It is both faster and more memory efficent than the original, allowing you to run much larger data sets even on your laptop, but it has fewer features and isn’t as straightforward to use. Like any bioinformatics software, there are always tradeoffs, so I encourage you to pick the right tool for the right job: if you have small enough datasets, stick with the R implementation, but if R keeps crashing on you or DivNet is just too slow for whatever reason, give <a href="https://github.com/mooreryan/divnet-rs">divnet-rs</a> try.</p>]]></content><author><name>Ryan Moore</name></author><category term="blog" /><summary type="html"><![CDATA[DivNet is an R package for estimating diversity when taxa occur in an ecological network. Here I talk about why you may want to use DivNet, introduce my Rust implementation of the algorithm, and compare its performance to the original R package.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.tenderisthebyte.com/assets/img/posts/divnet_rs_intro/timing_425_350.png" /><media:content medium="image" url="https://www.tenderisthebyte.com/assets/img/posts/divnet_rs_intro/timing_425_350.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">A simple dashboard for COVID-19 case counts</title><link href="https://www.tenderisthebyte.com/blog/2020/12/30/covid-19-dashboard/" rel="alternate" type="text/html" title="A simple dashboard for COVID-19 case counts" /><published>2020-12-30T00:00:00+00:00</published><updated>2020-12-30T00:00:00+00:00</updated><id>https://www.tenderisthebyte.com/blog/2020/12/30/covid-19-dashboard</id><content type="html" xml:base="https://www.tenderisthebyte.com/blog/2020/12/30/covid-19-dashboard/"><![CDATA[<p>I made a simple <a href="https://www.tenderisthebyte.com/apps/covid19dashboard">COVID-19 dashboard</a> that lets you compare the confirmed case counts for multiple counties as well as viewing the raw counts and the counts per 100,000 people.  It plots the case counts over time for as many counties as you want to compare and lets you download the resulting chart.  Here is an example for Delaware’s three counties:</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//covid_19_dashboard/delaware_covid_chart.svg" alt="Confirmed COVID-19 Cases for Delaware Counties" />
    <figcaption>Confirmed COVID-19 Cases for Delaware Counties</figcaption>
</figure>

<p>Being a Delaware resident, I like to pretend everyone already knows everything about Delaware, but <em>just in case</em> you don’t, here you go:  New Castle county is in the north and has Wilmington (our largest city) and Newark, home of the Univesity of Delaware.  Kent county is in the middle and has Dover (the state capitol), and Sussex county is in the south with Lewes and all the beaches.  It’s interesting to see the differences between New Castle and Kent counties, which look pretty similar to one another, and Sussex county.  At some point, I would like to overlay some demographic or socio-economic data on this to look for any trends, but that’s for a different day.</p>

<h2 id="the-data">The data</h2>

<p>The COVID-19 case data is from the <a href="https://github.com/CSSEGISandData/COVID-19">Center for Systems Science and Engineering (CSSE) at Johns Hopkins University</a>.  Their data is aggregated from a ton of different sources and I encourage you to check out <a href="https://github.com/CSSEGISandData/COVID-19">their GitHub page</a> for more information about the data.  If you’re interested, they have <a href="https://doi.org/10.1016/S1473-3099(20)30120-1">an article</a> in the Lancet talking about the data and <a href="https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6">their dashboard</a>.  Of course, their dashboard has a lot more bells and whistles than mine!</p>

<p>For the county level population info, I used data from the <a href="https://www.ers.usda.gov/data-products/atlas-of-rural-and-small-town-america/">Atlas of Rural and Small-Town America</a> from the <a href="https://www.ers.usda.gov/">USDA Economic Research Service</a>.  It is a really cool and in-depth county level dataset.  In addition to the population data, you can find info about jobs, income, veterans and more.  They also have a <a href="https://www.ers.usda.gov/data-products/atlas-of-rural-and-small-town-america/go-to-the-atlas/">nice interactive map</a> to view everything county-by-county.  If you want to download and remix the data yourself, it is all available in CSV and Excel format <a href="https://www.ers.usda.gov/data-products/atlas-of-rural-and-small-town-america/download-the-data/">on their site</a>.</p>

<p>One thing to note is that the county level population data is mostly from 2019 estimates.  So, while weighting the case counts by the population data gives a nice way to compare COVID-19 cases across counties, just keep in mind that the population estimates are from last year.</p>

<h2 id="the-code">The code</h2>

<p>If you’re interested in the source code for the dashboard, you can find it on my <a href="https://github.com/mooreryan/Covid19Dashboard">GitHub page</a>.</p>

<p>It is an <a href="https://elm-lang.org/">Elm app</a>.  I haven’t used Elm much before this project, but it was very easy to get started with.  The <a href="https://guide.elm-lang.org/">documentaion</a> was awesome and the <a href="https://elmlang.herokuapp.com/">Elm Slack channel</a> is full of helpful people.  I think having some experience in <a href="https://www.rust-lang.org/">Rust</a> and <a href="https://clojure.org/">Clojure</a> helped me feel right at home using Elm.  Elm seems a bit like a gateway to <a href="https://github.com/alpacaaa/elm-to-purescript-cheatsheet">PureScript</a> or <a href="https://www.reddit.com/r/haskell/comments/6wbzer/elm_as_a_gateway_to_learn_haskell/">Haskell</a>, so I’m thinking of checking those out as well.</p>

<p>The charts are made with <a href="https://vega.github.io/vega-lite/">Vega-Lite</a>, a nice tool for data visualization based on <a href="https://vega.github.io/vega/">Vega</a> and the <a href="https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html">Grammar of Graphics</a>.  It’s <a href="https://en.wikipedia.org/wiki/Declarative_programming">declarative</a>, in that you write <a href="https://www.json.org/json-en.html">JSON</a> specifications and Vega-Lite compiles the spec to Vega and Vega’s runtime hadles rendering the chart.  To generate the Vega-Lite specs, I used <a href="https://package.elm-lang.org/packages/gicentre/elm-vegalite/latest/VegaLite">this Elm package</a> in conjunction with Elm <a href="https://guide.elm-lang.org/interop/ports.html">ports</a>.</p>]]></content><author><name>Ryan Moore</name></author><category term="blog" /><summary type="html"><![CDATA[Introducing the COVID-19 dashboard I made to compare case counts across U.S. counties.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.tenderisthebyte.com/assets/img/posts/covid_19_dashboard/delaware_covid_chart.png" /><media:content medium="image" url="https://www.tenderisthebyte.com/assets/img/posts/covid_19_dashboard/delaware_covid_chart.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">COVID-19 Dashboard</title><link href="https://www.tenderisthebyte.com/app/2020/12/28/covid-19-dashboard/" rel="alternate" type="text/html" title="COVID-19 Dashboard" /><published>2020-12-28T00:00:00+00:00</published><updated>2020-12-28T00:00:00+00:00</updated><id>https://www.tenderisthebyte.com/app/2020/12/28/covid-19-dashboard</id><content type="html" xml:base="https://www.tenderisthebyte.com/app/2020/12/28/covid-19-dashboard/"><![CDATA[<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />

    <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/base-min.css" />
    <link rel="stylesheet" href="https://unpkg.com/purecss@2.0.3/build/pure-min.css" />
    <link rel="stylesheet" href="https://unpkg.com/purecss@2.0.3/build/grids-responsive-min.css" />


    <script src="https://cdn.jsdelivr.net/npm/vega@5.17.0"></script>
    <script src="https://cdn.jsdelivr.net/npm/vega-lite@4.17.0"></script>
    <script src="https://cdn.jsdelivr.net/npm/vega-embed@6.12.2"></script>

    <style>
        body {
            margin-left: 1em;
            color: #333333;
        }
        
        html,
        button,
        input,
        select,
        textarea,
        .pure-g [class*="pure-u"] {
            /* Set your content font stack here: */
            font-family: Verdana, Arial, sans-serif;
        }
        
        .button-margin {
            margin-top: 1em;
            margin-right: 0.5em;
            margin-bottom: 1em;
        }
        
        .small-font {
            font-size: 0.75em;
        }
        
        .gray-font {
            color: #666666
        }
        
        .push-down {
            margin-top: 0.67em;
        }
            
        .extra-bottom-margin {
            margin-bottom: 2em;
        }
    </style>

    <script src="/assets/js/CovidDashboard.min.js"></script>

    <!-- Favicon
    –––––––––––––––––––––––––––––––––––––––––––––––––– -->
    <link rel="apple-touch-icon-precomposed" sizes="57x57" href=" /assets/img/favicon/apple-touch-icon-57x57.png " />
    <link rel="apple-touch-icon-precomposed" sizes="114x114" href=" /assets/img/favicon/apple-touch-icon-114x114.png " />
    <link rel="apple-touch-icon-precomposed" sizes="72x72" href=" /assets/img/favicon/apple-touch-icon-72x72.png " />
    <link rel="apple-touch-icon-precomposed" sizes="144x144" href=" /assets/img/favicon/apple-touch-icon-144x144.png " />
    <link rel="apple-touch-icon-precomposed" sizes="120x120" href=" /assets/img/favicon/apple-touch-icon-120x120.png " />
    <link rel="apple-touch-icon-precomposed" sizes="152x152" href=" /assets/img/favicon/apple-touch-icon-152x152.png " />
    <link rel="icon" type="image/png" href=" /assets/img/favicon/favicon-32x32.png " sizes="32x32" />
    <link rel="icon" type="image/png" href=" /assets/img/favicon/favicon-16x16.png " sizes="16x16" />
    <meta name="application-name" content="Tender Is The Byte" />
    <meta name="msapplication-TileColor" content="#FFFFFF" />
    <meta name="msapplication-TileImage" content=" /assets/img/favicon/mstile-144x144.png " /> <!-- Begin Jekyll SEO tag v2.8.0 -->
<title>COVID-19 Dashboard | Tender Is The Byte</title>
<meta name="generator" content="Jekyll v4.4.1" />
<meta property="og:title" content="COVID-19 Dashboard" />
<meta name="author" content="Ryan Moore" />
<meta property="og:locale" content="en_US" />
<meta name="description" content="Track and compare COVID-19 confirmed cases across US counties." />
<meta property="og:description" content="Track and compare COVID-19 confirmed cases across US counties." />
<link rel="canonical" href="https://www.tenderisthebyte.com/app/2020/12/28/covid-19-dashboard/" />
<meta property="og:url" content="https://www.tenderisthebyte.com/app/2020/12/28/covid-19-dashboard/" />
<meta property="og:site_name" content="Tender Is The Byte" />
<meta property="og:image" content="https://www.tenderisthebyte.com/assets/img/apps/delaware_covid_chart.png" />
<meta property="og:type" content="article" />
<meta property="article:published_time" content="2020-12-28T00:00:00+00:00" />
<meta name="twitter:card" content="summary" />
<meta property="twitter:image" content="https://www.tenderisthebyte.com/assets/img/apps/delaware_covid_chart.png" />
<meta property="twitter:title" content="COVID-19 Dashboard" />
<meta name="twitter:site" content="@TenderIsTheByte" />
<meta name="twitter:creator" content="@TenderIsTheByte" />
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"BlogPosting","author":{"@type":"Person","name":"Ryan Moore"},"dateModified":"2020-12-28T00:00:00+00:00","datePublished":"2020-12-28T00:00:00+00:00","description":"Track and compare COVID-19 confirmed cases across US counties.","headline":"COVID-19 Dashboard","image":"https://www.tenderisthebyte.com/assets/img/apps/delaware_covid_chart.png","mainEntityOfPage":{"@type":"WebPage","@id":"https://www.tenderisthebyte.com/app/2020/12/28/covid-19-dashboard/"},"publisher":{"@type":"Organization","logo":{"@type":"ImageObject","url":"https://www.tenderisthebyte.com/assets/img/favicon/favicon-196x196.png"},"name":"Ryan Moore"},"url":"https://www.tenderisthebyte.com/app/2020/12/28/covid-19-dashboard/"}</script>
<!-- End Jekyll SEO tag -->

</head>

<body>
    <div id="app"></div>
    <script>
        var hasLocalStorage = true;
        var ls, storedData, startingData;
        try {
            ls = window.localStorage;
            storedData = ls.getItem("data");
            startingData = storedData ? JSON.parse(storedData) : null;
        } catch (e) {
            console.warn("could not use localStorage!")
            hasLocalStorage = false;
            startingData = null;
        }

        var app = Elm.Main.init({
            node: document.getElementById("app"),
            flags: {
                hasLocalStorage: hasLocalStorage,
                startingData: startingData,
                windowWidth: window.innerWidth,
                windowHeight: window.innerHeight
            },
        });

        var requestAnimationFrame =
            window.requestAnimationFrame ||
            window.mozRequestAnimationFrame ||
            window.webkitRequestAnimationFrame ||
            window.msRequestAnimationFrame;

        let updateChart = function(spec) {
            requestAnimationFrame(function() {
                // TODO first check if case-count-chart exists
                if (document.getElementById("case-count-chart")) {
                    vegaEmbed("#case-count-chart", spec, {
                        actions: {
                            export: true,
                            source: false,
                            compiled: false,
                            editor: false
                        },
                        renderer: "canvas"
                    }).catch(
                        console.warn
                    );

                }
            });
        };

        app.ports.sendToVegaLite.subscribe(updateChart);

        if (hasLocalStorage) {
            app.ports.storeData.subscribe(function(data) {
                if (data.length > 0) {
                    var dataJson = JSON.stringify(data);
                    try {
                        ls.setItem("data", dataJson);
                    } catch (e) {
                        console.warn("could not save data to localStorage!");
                    }
                }
            });
        }
    </script>
</body>

</html>]]></content><author><name>Ryan Moore</name><email>moorer@udel.edu</email></author><category term="app" /><summary type="html"><![CDATA[Track and compare COVID-19 confirmed cases across US counties.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.tenderisthebyte.com/assets/img/apps/delaware_covid_chart.png" /><media:content medium="image" url="https://www.tenderisthebyte.com/assets/img/apps/delaware_covid_chart.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Virome Bytes: Microdiversity of Mediterranean Sea Viruses</title><link href="https://www.tenderisthebyte.com/blog/2020/02/29/virome-bytes-mediterranean-sea-virus-microdiversity/" rel="alternate" type="text/html" title="Virome Bytes: Microdiversity of Mediterranean Sea Viruses" /><published>2020-02-29T00:00:00+00:00</published><updated>2020-02-29T00:00:00+00:00</updated><id>https://www.tenderisthebyte.com/blog/2020/02/29/virome-bytes-mediterranean-sea-virus-microdiversity</id><content type="html" xml:base="https://www.tenderisthebyte.com/blog/2020/02/29/virome-bytes-mediterranean-sea-virus-microdiversity/"><![CDATA[<h2 id="virus-microdiversity">Virus microdiversity</h2>

<p>Marine viruses are probably the most well-characterized group of environmental viruses.  <a href="https://doi.org/10.1038/nature04160">The oceans were one of the first ecosystems where the abundance and importance of environmental viruses was truly realized</a>, and the relative ease of collecting viruses from seawater (as compared to, say, soils) has helped further their study in this environment.  However, even within marine habitats,  there’s still a lot that we don’t know about viruses and their ecology.</p>

<p>The microdiversity of viruses is a relatively new area of study in environmental viral ecology.  Microdiversity, here, refers to mutation frequencies in genomes within the same population.  It accompanies trends like the <a href="https://doi.org/10.1038/ismej.2017.119">shift from OTUs to ASVs</a> in focusing in on smaller differences in environmental DNA sequences.  In a paper entitled <a href="https://doi.org/10.1128/mSystems.00554-19">Trends of microdiversity reveal depth-dependent evolutionary strategies of viruses in the mediterranean</a>, Felipe Coutinho and colleagues use microdiversity to study the selective pressures exerted on viral genomes at different depths in the ocean and Mediterranean Sea.</p>

<p>Coutinho et al. examined four viral shotgun metagenomes (viromes) sampled from the surface, the <a href="https://en.wikipedia.org/wiki/Deep_chlorophyll_maximum">deep chlorophyll maximum</a> (DCM), and the <a href="https://en.wikipedia.org/wiki/Bathyal_zone">bathypelagic</a>.  To increase their sample size, the researchers supplemented their own samples with viromes from the <a href="https://oceans.taraexpeditions.org/en/m/about-tara/les-expeditions/tara-oceans/"><em>Tara</em> Oceans expedition</a> and <a href="http://aco-ssds.soest.hawaii.edu/ALOHA/">Station ALOHA</a>, which were also sampled over multiple depths.  Microdiversity was measured using pN/pS ratios, similar to dN/dS ratios, which are calculated as the number of nonsynonymous polymorphisms per nonsynonymous site to the number synonymous polymorphisms per synonymous site.</p>

<h2 id="different-depths-different-selective-pressures">Different depths, different selective pressures</h2>

<p>The authors concluded that marine viruses at different depths show signs of being under different primary selection pressures.</p>

<figure class="figure figure--center figure--border">
    <img src="/assets/img/posts//mediterranean_virus_microdiversity/microdiversity_cartoon.jpg" alt="The author's model of the observed patterns of microdiversity" />
    <figcaption>The author's model of the observed patterns of microdiversity</figcaption>
</figure>

<p>In the deep ocean, <a href="https://doi.org/10.1126/sciadv.1602565">where cells and viruses are found in lower numbers</a>, viral metabolism proteins are under the greatest selection pressure.  This is presumably to help increase traits such as burst size that would maximize the number of viral progeny produced, thereby increasing the likelihood that one of those phages encounters a suitable host.</p>

<p>In the DCM, viruses accumulate mutations in genes used for host recognition, so that they can expand their host range to compete with other phages.  This is necessary because while phage populations in the DCM are large, this study found them to be highly clonal (low diversity).  Having lots of copies of the same phage would presumably make competition for hosts intense and encourage host switching.</p>

<p>Viruses from the surface samples had, on average, the greatest number of mutations, but the lowest rates of microdiversity.  The high rate of mutation was attributed to high levels of UV radiation in surface waters.  The low rate of microdiversity may be due to the combination of relatively high viral counts combined with intermediate diversity.  This would result in lower rates of competition for host cells and less need to increase traits like burst size, that may be more important in low cell count environments.</p>

<p>Overall, this is an interesting study that used environmental gradients to examine specific factors driving viral ecology and evolution in the natural environment.</p>

<p class="gray"><em>Citation: Coutinho, FH. et al.  Trends of Microdiversity Reveal Depth-Dependent Evolutionary Strategies of Viruses in the Mediterranean.  mSystems 4 (6) e00554-19 (2019). <a href="https://doi.org/10.1128/mSystems.00554-19">doi: 10.1128/mSystems.00554-19</a>.</em></p>]]></content><author><name>Amelia Harrision</name></author><category term="blog" /><summary type="html"><![CDATA[In today's edition of Virome Bytes, Amelia Harrison discusses a paper looking at the microdiversity of Mediterranean Sea viruses!]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.tenderisthebyte.com/assets/img/posts/mediterranean_virus_microdiversity/microdiversity_cartoon_small.jpg" /><media:content medium="image" url="https://www.tenderisthebyte.com/assets/img/posts/mediterranean_virus_microdiversity/microdiversity_cartoon_small.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>