Platform-free bots - jimkang.com

I’ve made several bots over the years. They’re mostly Twitter bots. Some of them are throwaway larks, and some of them only work in the moment. If Twitter becomes too harmful to humanity to gift with free content, I’m OK with letting those go. However, there are many bots whose fate I want to keep in my own hands, rather than Twitter’s. To that end, I built a static site updater.

Motivation

For a long time, I’ve been bothered by concerns that that are often discussed in the bot community, such as:

What if the company behind Twitter (or any other platform, really) goes out of business?
What if people just stop using Twitter?
What if Twitter just decides to ban your bot and not explain it? Or hellban it?
What if, in the future, internet access is hard to come by?
What if you want to find old tweets by your bot? This is notoriously difficult for any Twitter user right now.

Twitter does have an archive request feature. You can request your account archives from right now by going to settings in the Twitter web app. However, this doesn’t give me much comfort because:

The archives do not include images or video, which are the main medium of some bots.
Under duress, I’m not sure you’ll be able to get the archives. When Vine shut down, I requested my archive. I got a notification that said the archiving process started, but it didn’t complete before the company and service shut down. I never got my archive.
Twitter can stop offering the archives at any time.

I could have my bots crosspost to Tumblr and Mastodon, and some do. The issue is that, although the eggs are now in multiple baskets, the baskets still have many of the same problems: various sorts of potential inaccessiblity

There is something that is likely to last longer than any social media platform, though: HTML — the nearly three-decades-old Hypertext Markup Language.

That is the format I need my bots’ output to be in. I meant to make them post to plain HTML for a while, but inertia is hard to fight.

When Vine shut down, I had a Vine account that was very valuable to my extended family. So, I built a service that handled simple video posting on my own server. Doing this gave me some code I could reuse to finally make my platform-free bot archives.

How to use it

static-web-archive-on-git is a Node module that will maintain a static weblog for you. From the README:

The idea here is that you have a GitHub repo that is the source for a lightweight static weblog, and you have a program that you want to update it programmatically.

If you have a bot written in Node, it can take whatever it posts to Twitter or Mastodon (some text and an image or a video) and also post that to its static web archive by calling a function.

By doing so, it will build an archive of the bot’s content — including media — that is extremely portable.

It stores the html (the index pages and single-entry pages, classic weblog-style) in a GitHub repo. With this, you can do any of the following:

Make the default branch of that repo gh-pages and automatically get a working web site at your-username.github.io/repo-name.
Pull down the contents of that repo to some server you control that has a web server on it and serve it from there.
Send the contents to the Internet Archive.
Zip it all up and keep it wherever you think is safe and let future archaeologists find it. (If they have a thing that can read images and text, they’ll figure it out if they understand your language. If they have a thing that reads HTML, then they’ll be golden.)

Whatever you do with your HTML, it won’t matter what Twitter does.

I’ve already integrated it into my bot linkfinds. Its archive is over at jimkang.com/linkfound. I’m going to try to get all of my bots posting to archives. I’m sure I’ll find bugs, but I’m fairly sure it’ll eventually work for most of them.

Update, 1/31/2018: So, I’ve set up archiving for most of my bots. It’s surprisingly nice to see them outside of the Twitter frame. The design is not complete for some of them, but they’re all functional. Check ‘em out:

Limitations

There are quite a few limitations to this approach, many of which you may have noticed:

This only works for bots whose meaning can be divorced from a particular medium. Much of what people like about godtributes, for example, is how it responds to them. The lion’s share of its tweets are responses, and they are mostly only meaningful to the person it’s replying to. That cannot be captured through this kind of archiving, as it’s only part of the story.
Archives cannot replicate the experience of bots dropping into your life by making tiny, occasional visits.
The current implementation depends on another massive VC-funded company, GitHub. Like all companies in this position, it may not survive that long or may become too detrimental to the world to touch.

The shutting-down concern is mitigated by the fact that it’s pretty easy to regularly and incrementally download the contents posted to GitHub (that’s what git is made for).

If I need to get off of the platform because I want nothing to do with it, it is actually easy to remove the GitHub parts from static-web-archive-on-git and have it maintain a local archive of HTML files.
- This, however, comes with the trade-off that, should it mess up in a destructive way, everything could be lost. You would not be able to get everything back after an accidental wipe by reverting to a previous commit as you could with a git-backed archive.
  - However, this problem itself could be mitigated by backing the whole thing up regularly.
It is Node-specific. At some point, I can plug this into a simple REST service so that any bot can use it, regardless of how it is built as long as it can make an http request. It’s not known if anyone really needs this yet, so I haven’t bothered.
Updating the archive is slow because there are stalling setTimeout calls everywhere.

Internally, the GitHub API makes updates via commits. The update REST method does not take a parent commit SHA because it figures out which one it should use on its own. Unfortunately, it also seems to respond to REST calls before it actually applies whatever commit it creates across the board.

I’ve observed the following race condition:
- We send an http request for Update 1 via the GitHub API update file method.
- The Github API sends a response to Update 1 that indicates that it worked fine.
- Thinking that the branch is updated, we send a request for Update 2.
- The GitHub API responds to Update 2 with an error status code (409), citing that it’s a commit based on a commit that is not the current tip of the branch.
I don’t know what’s happening internally, but my guess is that a commit created for an API call does not propagate to all API servers by the time a response is sent back for that API call. e.g. Something like this is going on:
- Commit 0 is the tip of the branch on all GitHub machines.
- Update 1 is received by Machine A, and Commit 1 is created, using Commit 0 as a parent.
- Machine B applies Commit 1; Commit 1 is the tip of the branch on Machines B, D, and some other GitHub machines now, but not on Machine C.
- A response for Update 1 is sent back by some machine.
- Update 2 happens to be received by Machine C, which does not have the latest yet, and Commit 2 is created, using Commit 0 as the parent.
- Machine D attempts to apply Commit 2. Here, Commit 1 is the tip of the branch, so it fails.
- A response for Update 2 is sent back with an error message like "is at 1 but expected 0".
The above scenario is pure speculation, but the following is definitely true: Updates are not atomic.

The setTimeouts are an attempt to space out update calls in order to avoid these wrong-commit-parent clashes. They don’t always, but they do most of the time. I’ve used this code for about eight months as of 1/15/2018, and it’s mostly been fine, FWIW. Obviously, it’s not going to work out for extremely real-time-dependent archiving purposes, though.

Well, that’s it. May you enjoy being platform-free via one method or another!