Processing Text with State Machines

Processing text is a very common problem in computer science. Using state machines or finite automatons (FAs) is a great technique for processing text. This article provides a rough framework for tackling string processing problems.

Finite automatons

An FA is a mathematical model of computation. An FA has one "start" state, one or more "final" states, and zero or more intermediate states. The only information it stores is its current state. At any point of time during a computation, it does not know how it got to its current state. An FA has a transition function that maps the current state and input token to the next state. You can read more about FAs on Wikipedia.

An alternate perspective

Implementations of FAs can be simple or complex based on the input string and the problem being solved. Devs often try to use regular expressions or recursive algorithms to tackle problems that would be much simpler with an FA. I have also come across devs who have rejected this technique, finding it "hacky" and "unreliable". This happens because of a lack of knowledge about automata theory.

However, it is easy to write programs that use FAs without going deep into the theory. We can think of an FA as an alternate way to structure the program. The program goes through a set of states as it processes the input. Behaviour of the program on the next input token is defined by its current state. The FA "accepts" or "rejects" an input string based on its final state.

The sitemap parsing task

An example of a string processing task is to parse a sitemap file and extract all the page links (the contents of the

<loc>

tags). This example is perfect because while the initial problem is simple, it becomes progressively harder as we move towards a more complete sitemap processor.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=12&amp;desc=vacation_hawaii</loc>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=73&amp;desc=vacation_new_zealand</loc>
      <lastmod>2004-12-23</lastmod>
   </url>
</urlset>

The library

htmlparser2

will be used for parsing the XML. This parser allows us to hook callbacks which are called whenever an XML tag is opened/closed as the parser processes the input. This makes our task trivial. Think of it as a two-stage parser where the first converts a string into XML tokens and the second converts XML tokens into the required output.

The input for our FA will look like:

urlset
url
loc
/url
url
loc
/url
/urlset

Building an FA

This FA can be used to extract all the page links. Our program should save the text content of a tag only when the current state is

LOC_OPENED

Another convincing argument in favour of FAs is that it is easy to add new states and transitions to it. If the specification of the input string changes, states can be added or removed from an FA to accommodate those changes. This FA handles the

<lastmod>

tags as well.

Implementation

Our FA implementation does not require any extra libraries. A generic hashmap can be used to store the states.

// sitemapStates.js

module.exports = {
  START: {
    transitions: { urlset: 'URLSET_OPENED' },
  },
  URLSET_OPENED: {
    transitions: { url: 'URL_OPENED' },
  },
  URLSET_CLOSED: {
    final: true,
    transitions: {},
  },
  URL_OPENED: {
    transitions: { loc: 'LOC_OPENED' },
  },
  URL_CLOSED: {
    transitions: {
      '/urlset': 'URLSET_CLOSED',
      url: 'URL_OPENED',
    },
  },
  LOC_OPENED: {
    transitions: { '/loc': 'LOC_CLOSED' },
    extractText: true,
  },
  LOC_CLOSED: {
    transitions: {
      '/url': 'URL_CLOSED',
      lastmod: 'LASTMOD_OPENED',
    },
  },
  LASTMOD_OPENED: {
    transitions: {
      '/lastmod': 'LASTMOD_CLOSED',
    },
  },
  LASTMOD_CLOSED: {
    transitions: {
      '/url': 'URL_CLOSED',
      loc: 'LOC_OPENED',
    },
  },
};

// index.js
const fs = require('fs');
const htmlparser2 = require('htmlparser2');
const sitemapStates = require('./sitemapStates');

function parseSitemap(sitemapBuffer) {
  let state = sitemapStates.START;
  const pageLinks = [];

  function onOpenTag(name) {
    const nextStateName = state.transitions[name];
    if (nextStateName === undefined) throw new Error('Failed to process input');
    state = sitemapStates[nextStateName];
  }

  function onCloseTag(name) {
    const tagWithWithSlash = `/${name}`;
    const nextStateName = state.transitions[tagWithWithSlash];
    if (nextStateName === undefined) throw new Error('Failed to process input');
    state = sitemapStates[nextStateName];
  }

  function onText(text) {
    if (state.extractText) pageLinks.push(text);
  }

  const parser = new htmlparser2.Parser({
    onopentag: onOpenTag,
    onclosetag: onCloseTag,
    ontext: onText,
  });

  parser.write(sitemapBuffer);
  if (!state.final) throw new Error('Failed to process input');

  return pageLinks;
}

fs.promises.readFile('sitemap.xml')
  .then(parseSitemap)
  .then(console.log);

Running this will output:

[
  'http://www.example.com/',
  'http://www.example.com/catalog?item=12&amp;desc=vacation_hawaii',
  'http://www.example.com/catalog?item=73&amp;desc=vacation_new_zealand'
]

Getting real

A sitemap file has a lot more than what we've dealt with till now. One sitemap file can link to other sitemap files as well. You can read more about the sitemap protocol on sitemaps.org.

A sitemap file can either be a simple sitemap file:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=12&amp;desc=vacation_hawaii</loc>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=73&amp;desc=vacation_new_zealand</loc>
      <lastmod>2004-12-23</lastmod>
   </url>
</urlset>

or a sitemap index file that links to other sitemaps:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap1.xml.gz</loc>
      <lastmod>2004-10-01T18:23:17+00:00</lastmod>
   </sitemap>
   <sitemap>
      <loc>http://www.example.com/sitemap2.xml.gz</loc>
      <lastmod>2005-01-01</lastmod>
   </sitemap>
</sitemapindex>

FAs that can handle this can be constructed easily.

Please note that the sitemap spec defines two more tags <changefreq>
and

<priority>

that we've omitted from our example.

Learning more

I hope this article provides a gist of how FAs can be used. Remember that while finite automatons look simple, more powerful automatons also exist. These are required for more complex parsing tasks (for example: saving the text of the

<loc>

tags only when the date in

<lastmod>

is later than a given date).

This course on edX is a great resource for learning more about automata theory.