Interplanetary Versioned File System

Ever feel like things are sometimes just a little harder than they should be? That’s the way I felt when I wanted to save version-controlled Markdown and SECST documents to IPFS from an editor I am developing. If you don’t know what IPFS is, then this article may not be for you. Take a look at IPFS and come back if it is.

By its very nature, IPFS creates a new version of a document every time you save the document, unfortunately, it does not provide a simple way to track the versions and keep them related to each other.

There was a substantive attempt at creating a comprehensive mechanism for doing this with the Interplanetary Version Control (IPVC) system; however, work on the project has been suspended by the author. Additionally, its powers are way beyond what I was seeking. It is modeled on Git, so its power comes with complexities beyond the ken of a typical document author.

I wanted something lightweight and easy to wrap in a user interface so that casual users will have the ability to track and retrieve old versions of files, either through automatic numbering or using user-provided names (as many people still do when sharing versions of files with each other or keeping track of things on their computer).

In addition to IPVC, I found some instructions for using IPLD (Interplanetary Linked Data). It seems complex and also resource consumptive. The approach keeps entire copies of files around, even if just a few characters have changed. Although IPFS automatically de-dups data at the block level behind the scenes, character sequence changes in documents exist on a far smaller level of granularity.

Unable to find what I needed, I wrote the Interplanetary Versioned File System (IPVFS). In this article, I describe both how to use it and how it works along with design alternatives and tradeoffs (IPVFS is currently in an alpha state, things could change!).

How To Use IPVFS

IPVFS has only three API end-points:

An initialization function ipvfs which takes an ipfs instance as its argument and augments it to support file versioning then returns the instance.
read(path,options) which is similar to ipfs.files.read with additional options.
write(path,options) which is similar to ipfs.files.write with additional options.

The read and write methods exist on an object versioned added to the ipfs.files property. They can be accessed as ipfs.files.versioned.read and ipfs.files.versioned.write. Ultimately, they will be an API superset of the standard functions; hence, so long as pointers to the original versions are kept around, you could theoretically elevate the versioned methods to replace the standard versions.

Here are a few lines of standard IPFS code followed by similar IPVFS code:

import ipvfs from "../index.js";
import {create} from "ipfs";
import {all} from "@anywhichway/all";

let ipfs = await ipvfs(create({repo:"hackernoon-filestore"}));

await ipfs.files.write("/hello-world.txt","hello there peter!",{create:true});
// log contents
console.log((await all(ipfs.files.read("/hello-world.txt"))).toString()); 
await ipfs.files.write("/hello-world.txt","hello there paul!",{create:true});
// log new contents, but access to the old version is not available
console.log((await all(ipfs.files.read("/hello-world.txt"))).toString()); 

await ipfs.files.versioned.write("/hello-world-versioned.txt","hello there peter!");
// log contents
console.log(await ipfs.files.versioned.read("/hello-world-versioned.txt",{all:true})); 
await ipfs.files.versioned.write("/hello-world-versioned.txt","hello there paul!");
// log new contents
console.log(await ipfs.files.versioned.read("/hello-world-versioned.txt",{all:true}));
// log first version contents
console.log(await ipfs.files.versioned.read("/hello-world-versioned.txt#1",{all:true}));

To retrieve an old version of a file, you just append #<number> to the file name, where <number> is the sequential version.

You may have noted the use of the function all from the package @anywhichway/all. IPFS read returns chunks of data asynchronously, the all function just collects them into a single buffer. Without this function, you would have to write your own function to collect the chunks in a for loop.

IPVFS also allows you to pass {all:true} as an option and the chunks are concatenated for you. Furthermore, IPVFS keeps additional metadata about what it is storing, you do not have to convert returned data to a string. Since a string was saved, a string is returned.

You can also name versions and retrieve them by appending @<version name>.

await ipfs.files.versioned.write("/hello-world-versioned.txt","hello there mary!",{metadata:{version:"Mary Version"}});
console.log(await ipfs.files.versioned.read("/[email protected] Version",{all:true}));

IPVFS does not enforce any particular naming convention, but you could use this approach to implement semantic versioning, e.g. {version:”0.0.3”} could be retrieved using @0.0.3.

Finally, you can add arbitrary metadata to files (anything other than version), e.g.

await ipfs.files.versioned.write("/hello-world-versioned.txt","hello there mary!",
  metadata:{
    version:"Mary Version",
    author:"John Jones",
}})

This data can be retrieved by passing withMetadata:true to read, in which case an object is returned instead of just the content, e.g.

const result = await ipfs.files.versioned.read("/[email protected] Version",{withMetadata:true,all:true})),
  {content,metadata} = result,
  {version,author} = metadata;

More on the metadata structure and how to get a version history is covered below. For additional read and write options, visit the documentation on GitHub.

How IPVFS Is Implemented

Currently, IPVFS stores the first version of a file’s content as a standalone un-named CID hashed block in IPFS. A pointer to this block is kept in a named file along with some metadata that includes an array of transformations that are required to convert the original text into the most current version. The alpha release of IPVFS does not automatically pin this content, but it should be pinned.

When a write operation is performed, a test is made to see if the content or custom metadata being provided is different from the most recent version. If the content is different, the library little-diff is used to discover the differences. The difference, if any, and any new custom metadata are used to create a change record which is added to the array of transformations.

When a version of the file content is requested, it is generated from the first version and the array of transformations up to the version requested. The little-diff library is used to convert the actual content and simple object assignment is used for custom metadata.

Design Alternatives And Tradeoffs

Keeping the original content in a separate CID hashed block is a time/space tradeoff.

The original content could be stored in the named file along with the metadata. This would save one write and one IPFS CID entry. However, this would mean the metadata and all the content would need to be read prior to returning anything to the requestor. For large files this could have both a negative performance and RAM impact. By using a pointer to a separate CID hashed block, IPVFS can use the metadata to assemble ordered change sets that can be applied as content streams from the separate block to the requestor. In some sense, IPVFS is acting as a pipe. This makes it both time and memory efficient at scale.

A pointer to a separate CID hashed block could be created and saved for every version, but this would ultimately take a lot of space. It could also subject the system to larger than necessary writes and network traffic. The design would potentially fail with respect to time, memory, network and storage efficiency.

Some version management systems only keep the most recent copy of file contents and use backward transformations to create older versions. This is arguably better since people are more likely to want a recent version. IPVFS could be modified to do this. A new CID hashed block could be created for each change and its CID could replace the pointer. However, this will require an extra write operation and subsequent network traffic as the new block is propagated. This might also result in management overhead as attempts are made to “remove” the old CID hashed object, which is now garbage from a version management perspective. The word “remove” is in quotes because it is not really possible to remove hashed content, in some sense, it expires if the content has not been pinned when the creating IPFS node stops so long as nobody else has created an identical CID hashed block (which is entirely possible and actually quite likely for small files). The design would potentially fail with respect to time and network efficiency. And, code might be considerably more complex.

Metadata Structure and Version History

In order to optimize content access and delivery or implement more sophisticated version management, IPVFS makes its metadata available via the read function using the withMetadata or withHistory options. However, it is also possible to get just the version history and metadata without the actual file content by using the standard ipfs.files.read function. This saves a CID lookup and reads until the requesting program decides to make them.

Below is the contents of a versioned file read using the standard ipfs.files.readfunction.

The file contains an array of change records. The first includes a CID path to the original content and a btime. The remaining properties are the same for all records:

an SHA-256 hash of the version content
a version that will either be the change index + 1 or a manually provided version string
the kind of data stored
an array of delta records (see little-diff)
the mtime for the change
any other metadata properties provided when the file was written (there are none in this example)

[
  {
    "path": "QmScjZmC4J4ZHq6bGTUyYSESfTKDhxo8X7o3QShSawTsqi",
    "hash": "f7a67e7a0a50e87e59713999562d06cc3d2511709c0a3ded8020d8247e47251c",
    "version": 1,
    "kind": "String",
    "delta": [],
    "btime": 1672768094671,
    "mtime": 1672768094671
  },
  {
    "hash": "4fe36dd2fd280cbdd9414f3efa61d2b49116453e7edad0316b8b6be1d1c64817",
    "version": 2,
    "kind": "String",
    "delta": [
      [
        17,
        1,
        ""
      ],
      [
        13,
        4,
        "aul!"
      ]
    ],
    "mtime": 1672768094748
  },
  {
    "hash": "0c8a635762b80e327d384f660387f3acc5f24363de54366404e4a391260fd5c5",
    "version": 3,
    "kind": "String",
    "delta": [
      [
        12,
        1,
        "m"
      ],
      [
        14,
        2,
        "ry"
      ]
    ],
    "mtime": 1672768094806
  }

The above structure can be read and used to optimize file retrieval on a client device by independently accessing the path CID and applying the delta records using little-diff.

IPVFS is currently in beta. I would love your feedback here or in the comments.

Image: Image: PCB Tech on Pixaby