Building A Linux Shell - Part II [A Step-by-Step Guide]

Written by MIMA | Published 2020/06/26
Tech Story Tags: programming | linux | shell | terminal | tutorial | c-tutorials | latest-tech-stories | hackernoon-top-story

TLDR Building A Linux Shell - Part II [A Step-by-Step Guide] The first part of this tutorial, we've built a simple Linux shell that prints a prompt string, reads input, then echoes the input back to the screen. In this part, we'll update our code to give our shell the ability to parse and execute simple commands. A simple command consists of a list of words, separated by whitespace characters (space, tab, newline) The first word is the command name, and is mandatory. If present, they form the arguments we want the shell to pass to the executed command.via the TL;DR App

In the first part of this tutorial, we've built a simple Linux shell that prints a prompt string, reads input, then echoes the input back to the screen. This isn't very much impressive now, is it?
In this part, we'll update our code to give our shell the ability to parse and execute simple commands. Let's first have a look at what simple commands are.
NOTE: You can download the complete source code for Part II & III of this tutorial from this GitHub repository.

What is a Simple Command?

A simple command consists of a list of words, separated by whitespace characters (space, tab, newline). The first word is the command name, and is mandatory (otherwise, the shell won't have a command to parse and execute!). The second and subsequent words are optional. If present, they form the arguments we want the shell to pass to the executed command.
For example, the following command:
ls -l
consists of two words:
ls
(the command's name), and
-l
(the first and sole argument). Similarly, the command:
gcc -o shell main.c prompt.c
(which we used in part I to compile our shell) consists of 5 words: a command name, and a list of 4 arguments.
Now to be able to execute simple commands, our shell needs to perform the following steps:
  • Scan input, one character at a time, to find the next token. We call this process lexical scanning, and the part of the shell that performs this task is known as the lexical scanner, or simply the scanner.
  • Extract input tokens. We call this tokenizing input.
  • Parse the tokens and create an Abstract Syntax Tree, or AST. The part of the shell responsible for doing this is known as the parser.
  • Execute the AST. This is the job of the executor.
The figure below shows the steps performed by the shell in order to parse and execute commands. You can see that the figure contains more steps than is shown in the above list, and that's fine. As our shell grows and become more sophisticated, we'll add the other steps as and when needed.
Now let's look at each of the above four steps in detail, and see the code we'll need to implement those features in our shell.

Scanning Input

In order to get the next token, we need to be able to scan our input, one character at a time, so that we can identify the characters that can be part of the token, and those that are delimiter characters. A delimiter character is one that marks the end of a token (and possibly the beginning of another token). Typically, delimiters are whitespace characters (space, tab, newline), but can also include other characters, such as
;
and
&
.
In general, scanning input means we should be able to:
  • Retrieve the next character from input.
  • Return the last character we've read back to input.
  • Lookahead (or peek) to check the next character, without actually retrieving it.
  • Skip over whitespace characters.
We'll define the functions to perform all of these tasks in a minute. But first, let's have a word about abstracting input.
Remember the
read_cmd()
function, which we defined in part I of this tutorial? That was the function we used to read user's input and return it as an
malloc
'd string. We could pass this string directly to our scanner, but that would make the scanning process a bit cumbersome. In particular, it would be very difficult for the scanner to remember the last character it gave us, so that it can move past that character and give us the character that follows.
To make the scanner's job easy, we'll abstract our input by passing the input string as part of a
struct source_s
structure, which we'll define in the source file
source.h
. Go ahead and create that file in your source directory, then open it in your favorite text editor and add the following code:
#ifndef SOURCE_H
#define SOURCE_H

#define EOF             (-1)
#define ERRCHAR         ( 0)

#define INIT_SRC_POS    (-2)

struct source_s
{   
    char *buffer;       /* the input text */
    long bufsize;       /* size of the input text */
    long  curpos;       /* absolute char position in source */
};

char next_char(struct source_s *src);
void unget_char(struct source_s *src);
char peek_char(struct source_s *src);
void skip_white_spaces(struct source_s *src);

#endif
Focusing on the structure's definition, you can see that our
struct source_s
contains a pointer to the input string, in addition to a two
long
fields that hold information about the string's length and our current position in the string (where we'll get the next character from).
Now create another file named
source.c
, to which you should add the following code:
#include <errno.h>
#include "shell.h"
#include "source.h"

void unget_char(struct source_s *src)
{
    if(src->curpos < 0)
    {
        return;
    }

    src->curpos--;
}


char next_char(struct source_s *src)
{
    if(!src || !src->buffer)
    {
        errno = ENODATA;
        return ERRCHAR;
    }

    char c1 = 0;
    if(src->curpos == INIT_SRC_POS)
    {
        src->curpos  = -1;
    }
    else
    {
        c1 = src->buffer[src->curpos];
    }

    if(++src->curpos >= src->bufsize)
    {
        src->curpos = src->bufsize;
        return EOF;
    }

    return src->buffer[src->curpos];
}


char peek_char(struct source_s *src)
{
    if(!src || !src->buffer)
    {
        errno = ENODATA;
        return ERRCHAR;
    }

    long pos = src->curpos;

    if(pos == INIT_SRC_POS)
    {
        pos++;
    }
    pos++;

    if(pos >= src->bufsize)
    {
        return EOF;
    }

    return src->buffer[pos];
}


void skip_white_spaces(struct source_s *src)
{
    char c;

    if(!src || !src->buffer)
    {
        return;
    }

    while(((c = peek_char(src)) != EOF) && (c == ' ' || c == '\t'))
    {
        next_char(src);
    }
}
The
unget_char()
function returns (or ungets) the last character we've retrieved from input, back to the input source. It does this by simply manipulating the source structure's pointers. You'll see the benefit of this function later in this series.
The
next_char()
function returns the next character of input and updates the source pointer, so that the next call to
next_char()
returns the following input character. When we reach the last character in input, the function returns the special character
EOF
, which we defined as -1 in
source.h
above.
The
peek_char()
function is similar to
next_char()
in that it returns the next character of input. The only difference is that
peek_char()
doesn't update the source pointer, so that the next call to
next_char()
returns the same input character we've just peeked at. You'll see the benefit of input peeking later in this series.
Finally, the
skip_white_spaces()
function skips all whitespace characters. This will help us when we've finished reading a token, and we want to skip delimiter whitespaces before we read the next token.

Tokenizing Input

Now that we have our scanner's functions in place, we'll use those functions in order to extract input tokens. We'll start by defining a new structure, which we'll use to represent our tokens.
Go ahead and create a file named
scanner.h
in your source directory, then open it and add the following code:
#ifndef SCANNER_H
#define SCANNER_H

struct token_s
{
    struct source_s *src;       /* source of input */
    int    text_len;            /* length of token text */
    char   *text;               /* token text */
};

/* the special EOF token, which indicates the end of input */
extern struct token_s eof_token;

struct token_s *tokenize(struct source_s *src);
void free_token(struct token_s *tok);

#endif
Focusing on the structure definition, our
struct token_s
contains a pointer to the
struct source_s
that holds our input. The structure also contains a pointer to the token's text, and a field that tells us the length of that text (so that we don't need to repeatedly call
strlen()
on the token's text).
Next, we'll write the
tokenize()
function, which will retrieve the next token from input. We'll also write some helper functions that will help us work with input tokens.
In the source directory, create a file named
scanner.c
, and enter the following code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include "shell.h"
#include "scanner.h"
#include "source.h"

char *tok_buf = NULL;
int   tok_bufsize  = 0;
int   tok_bufindex = -1;

/* special token to indicate end of input */
struct token_s eof_token = 
{
    .text_len = 0,
};


void add_to_buf(char c)
{
    tok_buf[tok_bufindex++] = c;

    if(tok_bufindex >= tok_bufsize)
    {
        char *tmp = realloc(tok_buf, tok_bufsize*2);

        if(!tmp)
        {
            errno = ENOMEM;
            return;
        }

        tok_buf = tmp;
        tok_bufsize *= 2;
    }
}


struct token_s *create_token(char *str)
{
    struct token_s *tok = malloc(sizeof(struct token_s));
    
    if(!tok)
    {
        return NULL;
    }

    memset(tok, 0, sizeof(struct token_s));
    tok->text_len = strlen(str);
    
    char *nstr = malloc(tok->text_len+1);
    
    if(!nstr)
    {
        free(tok);
        return NULL;
    }
    
    strcpy(nstr, str);
    tok->text = nstr;
    
    return tok;
}


void free_token(struct token_s *tok)
{
    if(tok->text)
    {
        free(tok->text);
    }
    free(tok);
}


struct token_s *tokenize(struct source_s *src)
{
    int  endloop = 0;

    if(!src || !src->buffer || !src->bufsize)
    {
        errno = ENODATA;
        return &eof_token;
    }
    
    if(!tok_buf)
    {
        tok_bufsize = 1024;
        tok_buf = malloc(tok_bufsize);
        if(!tok_buf)
        {
            errno = ENOMEM;
            return &eof_token;
        }
    }

    tok_bufindex     = 0;
    tok_buf[0]       = '\0';

    char nc = next_char(src);

    if(nc == ERRCHAR || nc == EOF)
    {
        return &eof_token;
    }

    do
    {
        switch(nc)
        {
            case ' ':
            case '\t':
                if(tok_bufindex > 0)
                {
                    endloop = 1;
                }
                break;
                
            case '\n':
                if(tok_bufindex > 0)
                {
                    unget_char(src);
                }
                else
                {
                    add_to_buf(nc);
                }
                endloop = 1;
                break;
                
            default:
                add_to_buf(nc);
                break;
        }

        if(endloop)
        {
            break;
        }

    } while((nc = next_char(src)) != EOF);

    if(tok_bufindex == 0)
    {
        return &eof_token;
    }
    
    if(tok_bufindex >= tok_bufsize)
    {
        tok_bufindex--;
    }
    tok_buf[tok_bufindex] = '\0';

    struct token_s *tok = create_token(tok_buf);
    if(!tok)
    {
        fprintf(stderr, "error: failed to alloc buffer: %s\n", strerror(errno));
        return &eof_token;
    }

    tok->src = src;
    return tok;
}
The global variables we defined in this file serve the following purposes:
  • tok_buf
    is a pointer to the buffer in which we'll store the current token.
  • tok_bufsize
    is the number of bytes we allocate to the buffer.
  • tok_bufindex
    is the current buffer index (i.e. it tells us where to add the next input character in the buffer).
  • eof_token
    is a special token we'll use to signal the end of file/input (EOF).
Now let's have a look at the functions we've defined in this file.
The
add_to_buf()
function adds a single character to the token buffer. If the buffer is full, the function takes care of extending it.
The
create_token()
function takes a string and converts it to a
struct token_s
structure. It takes care of allocating memory for the token's structure and text, and fills in the structure's member fields.
The
free_token()
function frees the memory used by a token structure, as well as the memory used to store the token's text.
The
tokenize()
function is the heart and soul of our lexical scanner, and the function's code is fairly straight-forward. It starts by allocating memory for our token buffer (if not already done), then initializes our token buffer and source pointers. It then calls
next_char()
to retrieve the next input character. When we reach the end of input,
tokenize()
returns the special
eof_token
, which marks the end of input.
The function then loops to read input characters, one at a time. If it encounters a whitespace character, it checks the token buffer to see if it's empty or not. If the buffer is not empty, we delimit the current token and break out of the loop. Otherwise, we skip the whitespace characters and move along to the beginning of the next token.
After we've got our token,
tokenize()
calls
create_token()
, passing it the token text (which we stored in the buffer). The token text is converted to a token structure, which
tokenize()
then returns to the caller.

Compiling the Shell

We sure have made a lot of progress in this part, but our shell is still not ready to parse and execute commands. Therefore, we won't be compiling the shell now. Our next compilation will be at the end of part III, after we implement our parser and our executor.

What's Next

In this part, we've implemented our lexical scanner. In the next part, we'll write the parser and the executor, after which we'll have a shell that is capable of executing simple commands.
Stay tuned!

Written by MIMA | GNU/Linux system administrator and programmer
Published by HackerNoon on 2020/06/26