Automating Static Website Deployment

Table of Contents

Setting up GitHub Pages
Setting up the private, code repository
Configuring the GitHub Action for publishing

In this post I am going to document the steps I took to implement a fully automated deployment of my blog using GitHub Actions and GitHub Pages.

As always, I started my journey with the definition of what I really wanted to get at the end:

The website is published on GitHub pages

Since the website is static and all of its content can be easily downloaded using a web crawler (like wget --mirror https://website.tld) I was OK with exposing the structure in the public repository, which is what GitHub offers on a free plan.
The code to generate the website should be private

I do a lot of work on the SSG (which is Pelican in my case) itself: extend it with plug-ins that may contain API tokens to reach out to some third party APIs, hack the core code when I want to quickly test stuff, etc. – so, I really did not have any desire to publish publicly all the commotions I did in the background (sometimes I do more than a hundred commits per day just to experiment with different ideas I have).
There should be a valid history of changes in both repositories

Well, I would get the history on my private repository for free, since it is the core value of maintaining a repository in the VCS, but I also wanted to have clean history of changes to the content I publish publicly.

It would be a pleasant bonus if the changes in the public repository could refer back to the corresponding commit in the private repository.

One may say that to do what I set out to do I would need to subscribe for a paid account with GitHub since according to their help page GitHub Pages for private repositories are only available on the paid plans.

However, as I pointed out above, it does not make sense to hide the content of the actual static website, hence all I needed to do is to find a way how to “publish” the resulting artefact to the GitHub Pages repository, and, preferably, that “publishing” should happen on GitHub’s side.

Luckily for me, GitHub started to support GitHub Actions on the free plan some time ago and as long as it is not abused according to their terms and conditions, it is a perfect vehicle for what I am trying to do, in my opinion.

Setting up GitHub Pages

There are multiple howtos and tutorials on the Internet regarding how to set GitHub Pages up, including the official help section on this topic, so I will only elaborate on details where I did something specific for the purposes of achieving my goals.

There are different types of GitHub Pages:

user or organisation
per-project

The difference between two is subtle (the former requires a dedicated repository for your website, while the latter allows you to keep it in a branch of the existing repository), but for the purposes of this article I am assuming that we are working with the user level GitHub pages which are residing in the repository named “<username>.github.io” (where <username> is your GitHub user name) as per the official documentation.

A few caveats I found and spent some time solving after following the official documentation are listed below:

GitHub’s documentation assumes use of Jekyll for the site generation.

It is not obvious how to use a different SSG (like Pelican). As far as I understand, there are multiple triggers for GitHub to consider that the web site is in a “published” state, so just ignore any references to Jekyll in the documentation: you will trip one of the triggers sooner or later, for example by pushing HTML files into your repository.
Configure your DNS before setting the custom domain name in GitHub Pages.

Pushing the CNAME file with the name of your custom domain within will trigger a DNS check from GitHub to see that your custom domain name is pointing back to GitHub Pages.

DNS heavily relies on caching and depending on the TTL settings in your zone: if a negative check is performed (that is, when GitHub fails to retrieve the corresponding record) you will likely need to wait for quite a while for GitHub to retry.

Setting up the CNAME record in advance and then verifying it with a query before you commit the CNAME file to your repository ensures that you will get the quickest validation response from GitHub, e.g. I set up my CNAME records and then verified it from the command line (before) submitting the request to GitHub:
```
[user@localhost ~]$ host -t cname dmitry.khlebnikov.net 8.8.8.8
Using domain server:
Name: 8.8.8.8
Address: 8.8.8.8#53
Aliases:

dmitry.khlebnikov.net is an alias for galaxy4public.github.io.
```
There are some shenanigans with the “Enforce HTTPS” option.

It is not obvious from the documentation, but the enforcement of HTTPS for custom domains on GitHub’s side is dependent on the several things:
- before the checkbox is enabled your custom domain name should be confirmed by GitHub (your CNAME file is in place and the repository settings show that the name was recognised);
- the CNAME record should point to your “<username>.github.io.” DNS record (or, you can point it directly to GitHub Pages IP addresses if you want to conceal the repository name in the DNS output);
- if GitHub did not like something and you adjusted anything in the above dot points the only way to trigger the enforcement of HTTPS is to re-submit the CNAME file to the repository (yes, you read it right: you need to delete the file and push it to the repository again);
- Removing the CNAME file from the repository is a disruptive action – the site will not be accessible for the duration of the file being missing.

OK, you have your public repository configured the way you want, so let’s look at the settings we need to be able to publish our code to this public repository.

When I try to automate something, I usually start with writing down manual steps I would do to achieve the results. This helps me to see patterns and to understand what I can easily automate and what will require some brain-storming to resolve.

In the case of updating the repository it is quite trivial: if I were to push updates manually, all I need is a private SSH key with the corresponding public SSH key configured with write privileges for the repository and I could push with git push from my local copy of the repository.

… private keys are called “private” for a reason – they are not supposed to leave the device under any circumstances. […] please pay attention when you read of hear somebody advising you to upload your private keys somewhere, it is usually bad advice.

My private keys are called “private” for a reason – they are not supposed to leave my device(s) under any circumstances (except for backup purposes such as storing them in a safe). So, please pay attention when you read or hear somebody advising you to upload your private keys somewhere, it is usually bad advice.

For the integration purposes, GitHub provides so-called “Deploy keys” and “Personal access tokens”. The former is just an SSH key pair associated with a particular repository (you can configure it in repository’s setting) while the latter is an OAuth access token associated with your account.

While you can successfully use both, I would recommend to use the “Deploy keys” only: despite that you can try to scope access down for a personal token, it would not be good enough and the actions performed using that token will look like you are executing them.

To configure a “Deploy key” we need two things:

Generate an SSH key pair, e.g.:

[user@localhost ~]$ ssh-keygen -t ed25519 -N '' -C 'Updating the blog from GH Action' -f ~/gh-action
Generating public/private ed25519 key pair.
Your identification has been saved in /home/user/gh-action
Your public key has been saved in /home/user/gh-action.pub
The key fingerprint is:
SHA256:7W9pUV5IlrRVE0RAVkjLgKJz4RdtVbH7GKQu8AfYITw Updating the blog from GH Action
The key's randomart image is:
+--[ED25519 256]--+
|          o.+BOX*|
|       + o o+.=oo|
|      o E +  =oo |
|     o o B . oo o|
|      o S + .o.o |
|         + o. .o.|
|          + oo. .|
|           ++    |
|           o.    |
+----[SHA256]-----+

Here, I chose the ed25519 key type since it is the shortest from the GitHub supported key types at the moment, yet it is strong enough.

I also made the key pair passphrase-less (-N '') since the purpose of the key pair is to automate things in the unattended fashion and there will be nobody to type in the passphrase.

The key pair comment just makes it easier to maintain your keys, but is optional.

Finally, the -f ~/gh-action option specifies where the generated private key is going to be stored. The public counterpart will use the same path with the .pub suffix appended to it.

Set the newly generated public key up as the “Deploy key”:

All you need to do is to go to the repository settings page for the public repository you created for GitHub Pages, click on “Deploy keys” in the left side menu, then click on the “Add deploy key” button in the upper right corner.

On the next page, provide a sensible description for the deploy key (I used the same text as I put into the keys comment, i.e. “Updating the blog from GH Action”) and copy and paste the recently generated public key. GitHub does not allow you to upload files over there, so you need to copy the content of the public key file and paste it into the form, e.g.:
```
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGHCy+stVCBjsrVO2ld1DwKCwcKL9+i1sjxcZu4u4lFQ Updating the blog from GH Action
```
NOTE: You need to ensure that you tick the “Allow write access” checkbox, otherwise it would not be possible to push to the repository with the corresponding private key.

This, actually, concludes the configuration of the GitHub Pages repository for now – in later articles I will document how one could leverage the repository Issues for managing comments on the web site and maintain the counters for likes on the pages, but it would be a completely separate post :).

Setting up the private, code repository

A typical Pelican repository layout is quite simple and comprises one mandatory directory, one semi-mandatory file, and everything else is optional, but could be used to enhance your experience.

The mandatory directory is the so-called “content” directory (in Pelican’s terms). The name of the directory can be anything you want, but it is better be reflected in the PATH = directive of the setting file.

I am saying “better be” since Pelican can operate without any configuration files, but the result will be limited, hence I call the pelicanconf.py file (which is the default name for the configuration file) to be “semi-mandatory”. The name of the configuration file can be also anything you like, however, I suggest to stick with the default for now.

Basically, you can quickly start by following the Pelican documentation and doing something as follows:

[user@localhost ~]$ virtualenv ~/venv/pelican
created virtual environment CPython3.8.2.final.0-64 in 477ms
  creator CPython3Posix(dest=/home/user/venv/pelican, clear=False, global=False)
  seeder FromAppData(download=False, pip=latest, setuptools=latest, wheel=latest, via=copy, app_data_dir=/home/user/.local/share/virtualenv/seed-app-data/v1.0.1)
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
[user@localhost ~]$ . ~/venv/pelican/bin/activate
(pelican) $ mkdir ~/blog
(pelican) $ cd ~/blog
(pelican) $ git init
Initialized empty Git repository in /home/user/blog/.git/
(pelican) $ git config --local user.email "your@github-email.here"
(pelican) $ git config --local user.name "Joe Happy"
(pelican) $ pelican-quickstart 
Welcome to pelican-quickstart v4.2.0.

This script will help you create a new Pelican-based website.

Please answer the following questions so this script can generate the files
needed by Pelican.


> Where do you want to create your new web site? [.] 
> What will be the title of this web site? My Awesome Blog
> Who will be the author of this web site? Joe Happy
> What will be the default language of this web site? [en] 
> Do you want to specify a URL prefix? e.g., https://example.com   (Y/n) n
> Do you want to enable article pagination? (Y/n) 
> How many articles per page do you want? [10] 
> What is your time zone? [Europe/Paris] Australia/Melbourne
> Do you want to generate a tasks.py/Makefile to automate generation and publishing? (Y/n) n
Done. Your new project is available at /home/user/blog
(pelican) $ ls -l
total 16
drwxr-xr-x 2   user   user 4096 May 10 00:31 content
drwxr-xr-x 2   user   user 4096 May 10 00:31 output
-rw-r--r-- 1   user   user  869 May 10 00:31 pelicanconf.py
-rw-r--r-- 1   user   user  589 May 10 00:31 publishconf.py
(pelican) $ rm -rf output publishconf.py
(pelican) $ git add content pelicanconf.py 
(pelican) $ git commit -m 'Initial commit'
[master (root-commit) f077002] Initial commit
 1 file changed, 35 insertions(+)
 create mode 100644 pelicanconf.py

A short break down of the above session snippet is:

On line 1 we create a virtual Python environment, so we could install Pelican locally;
We enter the newly created virtual environment on line 2, which makes Pelican available to us;
We create an empty repository (~/blog) and initialise it using Pelican’s quickstart;
Since we are not using the default publishing capabilities and we are not interested in storing the generated pages in our code repository, we clean things up a bit;
Finally, we commit the generated skeleton to Git.

A good test at this stage would be to ensure that Pelican is working and likes our structure:

[user@localhost ~]$ pelican
WARNING: No valid files found in content for the active readers:
  | BaseReader (static)
  | HTMLReader (htm, html)
  | MarkdownReader (md, markdown, mkd, mdown)
  | RstReader (rst)
Done: Processed 0 articles, 0 drafts, 0 pages, 0 hidden pages and 0 draft pages in 0.07 seconds.

So far so good, but it is not a real test, since there are no source files to generate something from, so let’s give Pelican something to work on:

[user@localhost ~]$ printf 'Title: First Post\nDate: 2020-05-10\n\n#First post\nPelican is awesome!' > content/first.md
[user@localhost ~]$ pelican
Done: Processed 1 article, 0 drafts, 0 pages, 0 hidden pages and 0 draft pages in 0.12 seconds.
[user@localhost ~]$ elinks -dump output/index.html 
                               [1]My Awesome Blog

     • [2]misc

                                 [3]First Post

   Published: Sun 10 May 2020

    By [4]Joe Happy

   In [5]misc.

                                   First post

   Pelican is awesome!

The output from elinks was truncated on purpose since I just wanted to showcase that Pelican has indeed generated the structure for a static website from just one article file we created.

Before we push our local repository to GitHub we may want to do some house keeping first, e.g. create the .gitignore file and list the temporary things we do not want Git to track. A good enough version of the .gitignore file I am using for my code repository is the following:

*~
*.pyc
.*.swp

**/__pycache__

/output

Do not forget to actually commit that .gitignore file to your local repository using the git add .gitignore && git commit -m 'Added .gitignore', by the way.

Now, we need to create a private repository on GitHub, so jump into your browser, go to your GitHub account, press the “+” icon in the upper right corner (right next to your profile icon), and select “New repository”.

On the “Create repository” page put whatever you desire as the name and the description of the repository you are about to create. Ensure that the “Private” radio button is selected and uncheck the “Initialize this repository with a README” if it was checked.

Once the repository is created, you will be presented with a page that enumerates your options for the next step, but I will just go ahead and show a session dump of what you will need to do. In the following session snippet blog is the repository name I chose for my private code repository and you will need to replace it with your private repository name (the working directory is our newly created local repository):

[user@localhost ~]$ git remote add origin git@github.com:galaxy4public/blog.git
[user@localhost ~]$ git push -u origin master
Enter passphrase for key '/home/user/.ssh/keys/github': 
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 4 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), 1008 bytes | 1008.00 KiB/s, done.
Total 6 (delta 0), reused 0 (delta 0), pack-reused 0
To github.com:galaxy4public/blog.git
 * [new branch]      master -> master
Branch 'master' set up to track remote branch 'master' from 'origin'.

Do you remember how we generated a deploy key pair earlier and installed the public key part into the public blog repository, so GitHub would allow the bearer of the private key to authenticate and deploy changes to the public blog repository? Well, since the purpose of this article is to introduce the full automation, the bearer of the key would be the GitHub Action associated with the private repository, hence we need to provide the action with the private key somehow.

GitHub has a feature called “repository secrets” and it is a perfect candidate to pass the private key to the GitHub Action. We need to follow the official documentation for the feature and create a secret called “DEPLOY_KEY” with the content of the private part of the deploy key. This will be used in the last step of the GitHub Action we are about to define.

Configuring the GitHub Action for publishing

Everything is well and good, but “where is the automation?” you may ask. After all, I suspect this was the primary reason you are reading this post. Well, we are about to start to look into the automation part and it is rather short in comparison to all the steps we did to set repositories up.

Our automation relies on the GitHub Action feature of GitHub. In plain terms, GitHub Action is a free compute resource provided by GitHub (there are some limits, but for the purposes of a personal blog it is unlikely that you will ever hit these limits).

Each GitHub Action is associated with a specific repository and is defined using quite a simple YAML configuration file which instructs GitHub on how to provision a required compute environment and what to run inside that environment. The YAML file can be arbitrarily named and resides in the .github/workflows/ subdirectory (starting from the root of the corresponding repository).

The GitHub Action I am using for my blog web site is stored in .github/workflows/pelican.yml and contains the following (we will dissect it further down the post):

name: Static Website Generator

on:
  push:
    branches: [ master ]

env:
  LANG: en_AU.UTF-8

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
    - name: Initialise locale
      run: |
        if [ "$LANG" != 'C' -a "{$LANG:0:2}" != 'C.' ]; then
          CP="${LANG#*.}" && [ -z "$CP" ] && CP=UTF-8 ||:
          sudo sed -i -E "/^\s*$LANG(\s|\$)/{:a;n;ba;q};\$a$LANG $CP" /etc/locale.gen
          sudo locale-gen
          sudo localectl set-locale LANG="${LANG:-C.UTF-8}"
        fi
        locale -a
    - name: Checkout the primary repo
      uses: actions/checkout@v2
      with:
        fetch-depth: 0
        submodules: recursive
    - name: Restore modification times for content
      run: |
        git log --pretty=tformat:"%at" --name-status --no-merges -- \
            content \
            themes/mind-drops/content \
        | sed -nE '
            /^\s*$/d;/^[[:digit:]]+$/{h;d};
            /^[UXB]/d;
            /^[AMT]/{s,^\S\s+,,;G;s,^(.+)\n(.+),\2 \1,;p};
            /^[DR]/{
              s,^(D|[CR][[:digit:]]+)\s+,\1 ,;G;
              s,^(\S+) ([^[=\t=]]+[[=\t=]])?(.*)\n(.+),\4\1 \3,;
              p
            }' \
        | LC_ALL=C sort -k2 -k1rn | uniq -f1 \
        | sed -E '/^[[:digit:]]+D /d' \
        | while read TSTAMP FILE; do
            if [ -f "$FILE" ]; then
                echo "$FILE => $(date -d @$TSTAMP)"
                touch -m -d "@$TSTAMP" -- "$FILE"
            fi
        done
    - name: Checkout Pages repo
      uses: actions/checkout@v2
      with:
        repository: galaxy4public/galaxy4public.github.io
        path: output
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: 3.x
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Generate the website
      run: |
        ls -laR content/
        rm -rf output/*
        ls -la output/
        TIMEZONE=$(sed -nE 's|^\s*TIMEZONE\s*=\s*['"'"'"]([^'"'"'"]+)['"'"'"].*|\1|;T;p' pelicanconf.py)
        TZ="${TIMEZONE:-UTC}" pelican
    - name: Publish to GitHub Pages
      env:
        DEPLOY_KEY: ${{ secrets.DEPLOY_KEY }}
      run: |
        cd output
        git config --local user.email "action@github.com"
        git config --local user.name "GitHub Action"
        if output=$(git status --porcelain) && [ -z "$output" ]; then
          echo "No new content was generated, exiting gracefully"
        else 
          git add -A
          git commit -m "Updated content on $(date)"
          eval $(ssh-agent)
          echo "$DEPLOY_KEY" | ssh-add -t 5m /dev/stdin
          ssh-add -l
          git push git@github.com:galaxy4public/galaxy4public.github.io.git
          ssh-add -D
          ssh-agent -k
        fi
        cd ..
        echo Completed

This is a copy of my live GitHub Action for deploying my blog that you are most likely reading right now and I decided not to edit anything, so if you just want to re-use it you will need to replace a few things, namely:

en_AU.UTF-8 => to a locale you are using (you can run locale -a if you are running Linux to see the list of locales available on your system);
content => you may need to change that to the name of your content directory (if you did not use the default name);
themes/mind-drops/content => you will need to drop this line since it is my theme’s content directory and you would not have it;
galaxy4public/galaxy4public.github.io => to <your_username/your_blog_repo_name>, obviously :)

Let’s look a bit more closely to understand how this GitHub Action is structured and what each step is doing.

It all starts with the definition of the action itself, the conditions of how it is triggered, and how it runs: you can get a formal description of the YAML structure of this configuration file in the official GitHub documentation on Workflows.

Here, we are only going to focus on steps defined under the “jobs:” section of the file since these steps are defining the logic we are after.

The “Initialise locale” step is quite important for Pelican since with a misconfigured locale Pelican tends to produce incorrect output (which is kind of expected). So in this step we are trying to determine whether the user (us :) ) has supplied the LANG variable and if they did we update /etc/locale.gen file, run the locale-gen command to update the corresponding files, and set the locale of the container to the requested locale.

The “Checkout the primary repo” step is leveraging the official “Checkout V2” Action and checks out a full copy of the source code repository of our blog and all the linked submodules. Initially, I was using a shallow copy using fetch-depth: 1, but the next step was requiring the full repository history to do its job reliably and I changed it to be a full history clone.

Since git does not store timestamps for the files and directories under its control, yet Pelican relies on timestamps to populate the modification time of the artefacts – we need to find a way to reconstruct at least file timestamps after the tree was checked out. One of the possible approaches would be to create a plugin that could determine whether we are inside a git working tree or not and depending on that apply different timestamp extraction policies, but I thought that a much easier way would be to prepare the checked out tree, hence making it compatible with the way Pelican expects things to be.

The “Restore modification times for content” step is my variant of how one could reconstruct the timestamps for files close enough to make it possible to use with Pelican. The approach relies on the fact that git records the timestamp of each commit including adding, updating, and deleting files. We create a list of all these file events using git log for file trees under “content” (where our blog content lives) and “themes/mind-drops/content” (where my custom theme injects some content such as the Web service worker script), then we use sed to filter and to re-arrange the output a bit, followed by reverse sorting to help to remove the entries that were introduced and later deleted. In the end, we have a list of file names with timestamps, so we go through the list in a loop and set the timestamps to files using touch.

The “Checkout Pages repo” step is cloning the public blog repository into the “output” directory where Pelican will put the generated files. This is needed to ensure that we can track the changes to the public repository, since Pelican is careful enough (if not instructed otherwise) only to update the files it generates and leave everything else in place as is. We use this later to determine whether any new content has been generated or not.

The “Set up Python” and the “Install dependencies” steps are pretty generic: the former is using the official GitHub Action to install and configure the latest available version of Python 3.x and the latter is leveraging pip to install all blog’s dependencies (including Pelican itself).

The “Generate the website” step is running pelican to process our articles and pages and to generate the result in the “output” directory. There are a couple of tricks with this step, though.

The first trick, which is not that obvious, is that we are removing the content of the “output” directory. It seems a bit weird since we just checked it out several steps before, does not it? Well, we are removing everything except hidden files and directories which happen to contain the “.git” subdirectory with all the actual data about the repository. Why do we do it? It is simple, this helps us to determine a situation if some file or directory was removed, so we could propagate that knowledge to the public blog repository. If we did not clean up the content of the “output” directory we would only append new changes and would never remove anything – this is how it was before I stumbled upon the problem, by the way. :)

The second trick of the “Generate the website” step is the extraction of the time zone information from the configuration files and is setting the TZ variable correctly just before we call pelican. Without this either Pelican may fail or if it does not it will produce UTC based date and times, which would be undesirable (at least for me, since my time zone is in Australia).

The final step is to push the updated content to the public blog repository, which will make it visible via GitHub Pages. Several things to notice there are:

In the env: section we are setting up the DEPLOY_KEY variable – this syntax is used to retrieve a named secret value from the associated secret key for a repository. We store the private part of the deploy key we specifically generated for this purpose at the beginning of this article in the private repository secret.
git status is used to determine whether there are any changes between what we have in the working tree and the repository index. If no changes were detected we just exit gracefully.
If any change to the generated content was detected, we temporarily load the private part of the deploy key into the ssh-agent (for 5 minutes), push changes to the public blog repository, then clean up the key from the agent and kill the agent itself.

From this point on, any push to the private codebase repository will trigger the GitHub Action and if the change has resulted in any updated content such content will be published to GitHub Pages!

There are quite a few things we could improve: such as introducing a broken links check, doing some sanity checks, etc. – but this would be for another article, I guess. :)