Flynn’s Taxonomy and the Concept of Multithreading

In the ever-evolving landscape of computing, the quest for optimal performance has led to the exploration of parallelism — the art of breaking down complex tasks into smaller, manageable ones. At the heart of this pursuit lies the fundamental concept of threading, a powerful technique that enables programs to execute multiple tasks concurrently, unleashing the true potential of modern computing systems.

Threading involves breaking a program into smaller, independently executable units called threads, running simultaneously to enhance efficiency and speed. This is one of the fundamental concepts of parallel programming.

Before delving into the intricacies of threading, it’s essential to grasp the theoretical framework that underpins the world of parallel computing. Flynn’s Taxonomy, named after computer scientist Michael Flynn, provides a comprehensive classification system for parallel processing architectures.

Understanding Flynn’s Taxonomy is crucial because it lays the foundation for comprehending the intricate relationships between threads, processes, and the overall structure of parallelized systems. A clear understanding of these concepts is essential for understanding how to create multi-threaded applications.

This article marks the beginning of a blog series on multi-threading, covering all the theoretical and programming aspects of multi-threading. We’ll dig deep into threading APIs and libraries (POSIX Threads, C++ threads) and OpenMP in the upcoming blogs.

Without further ado…

Flynn’s Taxonomy

In 1966 [1], computer scientist Dr. Michael J. Flynn proposed a classification of computer architectures based on their capability to process the number of instruction streams and data streams concurrently. This classification is still relevant and known as Flynn’s taxonomy.

In his initial paper, Flynn proposed four classifications:

Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data streams (SIMD)
Multiple instruction streams, single data stream (MISD)
Multiple instruction streams, multiple data streams (MIMD)

Flynn published three further classifications of SIMD in a subsequent paper published in 1972[2]: Array Processors, Pipelined processors, and Associative Processors.

“Stream” refers to the sequence of data or instructions as seen by the machine during the execution of a program. [1]

A single stream issues one element in each time unit, resembling a queue. Conversely, multiple streams typically release numerous elements per time unit, akin to managing multiple queues simultaneously.

Consequently, a single instruction stream (SI) issues only one instruction within a time unit, whereas a multiple instruction stream (MI) releases multiple instructions during the same period. Similarly, a single data stream (SD) releases a single data element per time unit, while a multiple data stream (MD) issues multiple data elements within the same time frame.

Flynn’s taxonomy can be visually represented as:

Before proceeding, it's important to recall how instructions are executed (the so-called instruction cycle).

The instruction cycle consists of following steps:

Fetch instruction - read the instruction from the memory.
Decode instruction - decode to interpret the requested action.
Fetch Data (if needed) - get the data for processing the instruction.
Execute instruction (process data) - execute the instruction/process the data.
Write-back data - save/store the results.

Let’s try to understand what these different classifications mean.

Single instruction stream, single data stream (SISD)

SISD represents a computing architecture where a uniprocessor machine executes a single instruction stream, operating on a single data stream. This architectural design corresponds with the principles of the von Neumann model.

The control unit retrieves a sequence of instructions from memory, decodes them, and dispatches them to the processing unit, which handles the data fetched from memory one unit at a time. Consequently, only one instruction is executed at any given time, reflecting a sequential execution of instructions.

It is crucial to note that SISD processors may exhibit limited concurrency. For instance, the execution phases of sequential instructions can overlap, facilitated by an additional functional unit. SISD processors achieve concurrent processing through techniques like *pipelined *or superscalar processors.

Pipelined processors involve executing multiple instructions in the same execution unit concurrently by segmenting the execution unit into different phases, while superscalar processors execute multiple instructions in parallel using multiple execution units.

Before the 2000s, most single-core processors adhered to the SISD architecture, encapsulating the foundational structure of early computing systems.

Single instruction stream, multiple data streams (SIMD)

In a SIMD system, a single instruction stream is broadcast to all the processing elements (PE), and each element independently processes its assigned portion of the data. This allows for efficient parallelization of tasks and can result in significant speedup for specific computations.

SIMD is well-suited for tasks involving performing the same operation on a large data set in parallel, such as graphics processing, scientific simulations, or certain mathematical computations.

An instruction is supplied to numerous data elements concurrently, following a synchronized execution pattern. All instructions are queued in this “lockstep” process, and the associated data is distributed across distinct compute units. Each compute unit executes the initial instruction in the queue simultaneously, proceeds to execute the subsequent instruction in sync, and continues this synchronized execution for following instructions in the queue.

SIMD architecture finds its most prevalent application in Graphics Processing Units (GPUs).

In 1972, Flynn published another research paper classifying SIMD into three further classifications:

Array Processors (SIMT)

An array (or vector) processor executes an instruction set optimized for efficiently processing large one-dimensional arrays called vectors. In this subtype of SIMD, each parallel processing element has its distinct memory and register file. However, they receive the same instructions.
Nvidia’sSIMT (Single Instruction, Multiple Threads) concept, implemented in GPGPUs, is the most common example of array processors. SIMT combines SIMD with multi-threading and is the basis of most modern GPUs.
Pipelined Processors (packed SIMD)

Just like other SIMD variants, pipelined processors receive the same instructions. However, they read data from a central resource and process fragments of that data independently. Finally, they write back the results to the same central resource. Modern SIMD variations, such as Intel’s AVX or ARM’s NEON, use register files as the shared resource (instead of main memory).

This register-based SIMD is commonly calledpacked SIMD, underscoring its efficient use of register files for parallel computation.
Associative Processors (predicated/masked SIMD)

Each parallel processing unit receives the same instruction, but the decision to execute or skip the instruction is made independently based on local data within the unit. Modern terminology characterizes this as “predicated” or “masked” SIMD, showcasing the ability to conditionally execute operations based on specific criteria within individual processing units.
Modern SIMD ISAs such as AVX-512 and GPGPUs use predication to improve cache usage efficiency and ensure effective pipelining.

Most modern GPUs utilize a combination of these methods, most commonly SIMT with predication.

Multiple instruction streams, single data stream (MISD)

In MISD systems, multiple instruction streams operate concurrently on a single data set. This type of architecture is less common than other categories in Flynn’s Taxonomy and is often considered more theoretical than practical.

In an MISD system, different processors or processing elements receive distinct instructions, but they all operate on the same data stream. Each processor performs a unique computation on the shared data stream based on its assigned set of instructions. The results from each processor can be compared or combined to achieve a final output.

One possible use case of MISD systems is creating fault-tolerant systems or specialized applications where redundant computations are performed simultaneously for error checking or reliability.

For instance, some earlier MISD systems were designed for fault-tolerant space mission systems.

Multiple instruction streams, multiple data streams (MIMD)

In MIMD systems, multiple processors or processing elements operate independently, each with its own set of instructions and data. It is well-suited for applications that require parallelism on different tasks or where the workload can be divided into independent subtasks.

MIMD systems are highly flexible and represent the most widely used class of computer architecture.

Depending on how MIMD processors access the memory, they are classified into two categories: Shared and distributed memory.

Examples

Let’s try to understand these classifications using a simple example.

Suppose we have two 1-D vectors,A and B, each having an equal number of single-precision floating-point numbers (floats).

We have to multiply all the elements in A with a scalarc and then add them to every corresponding element in B. The result must be stored in a third array called result.

Thus, we have:

Vector A: [a1, a2, a3, ...]
Vector B: [b1, b2, b3, ...]
Scalar: c

result: [c * a1 + b1, c * a2 + b2, c * a3 + b3, ...]

SISD

For the given task, scalar addition and multiplication would involve processing each element of the vectors sequentially. The processor performs the addition of corresponding elements from the two vectors, followed by the multiplication with a scalar, and stores the results in a third vector. The operations are carried out one at a time. For example, iterating through every iteration of this for loop without any optimizations would be the SISD way of getting the result vector.

for(int i = 0;  i < n; ++i){
  result[i] = c* A[i] + B[i];
}

SIMD

Suppose our machine supports Advanced Vector Extensions (AVX) SIMD instructions. Thus, a single SIMD register (YMM) can store 8 single-precision floats. Therefore, each element-wise operation is performed in parallel while using SIMD, enhancing efficiency. The same scalar addition and multiplication are applied to multiple elements concurrently.

For example, let us assume we have n elements in A and B. We can perform 8 data operations simultaneously, significantly improving our efficiency. Only n%8 iterations will be done sequentially. In fact, those n%8 elements can also be efficiently performed using mask registers. For example, a C++ code using SIMD intrinsics can really do wonders.

// SIMD code (with intrinsics)
int ub = n - (n % 8); // 8 floats per SIMD register
__m256 x, y, z, s, temp; // vectors of 8 float entries
s = _mm256_set1_ps(c); // store the scalar c
for(int i = 0;  i < ub; i += 8){ // notice the increment of 8
  x = _mm256_loadu_ps(&A[i]); // load 8 elements of A in x
  y = _mm256_loadu_ps(&B[i]); // load 8 elements of B in y
  z = _mm256_loadu_ps(&result[i]); // load 8 elements of result in z
  temp = _mm256_mul_ps(s, x); // multiply elements of A with s and store in temp
  z = _mm256_add_ps(temp, y); // add elements of temp with y and store in z
  _mm256_storeu_ps(&result[i], z); // copy the elements to the result vector
} 
 /* Reminder loop for remaining iterations
  * Masked registers can also be used instead of this
*/
for(int i = ub;  i < n; ++i){
  y[i] = a * x[i] + y[i];
}

Don’t worry if all the SIMD jargon feels a bit heavy. SIMD intrinsics deserve an article of their own, and I’ll cover them in a future article. Intel intrinsics guide offers detailed documentation on how to use different intrinsics.

Modern compilers also enable vectorization if the user gives relevant optimization flags. Feel free to check my article on Compiler Optimizations to learn more about such compiler flags.

MIMD

In MIMD, multiple processors operate independently, each executing its instructions on its data set. In the given task, different processors might perform various operations on different elements of the vectors simultaneously.

Depending on the shared or distributed memory model, several ways exist to solve the given task. For instance, if working with a shared memory programming model like OpenMP, different threads may work together and distribute the loop iterations (See__OpenMP scheduling__).

Similarly, for a distributed memory programming model like MPI, the vectors may be distributed to different processors such that each calculates their respective chunk of result. One of the processors may work as a coordinator, distribute the data, and even collect the final result from all the processors.

For example, if p processors exist, the data may be divided into n/p chunks.

// Each processor works on n/p data elements
Processor 1: [c * a1 + b1, ....] -> Result1
Processor 2: [c * ae + be, ....] -> Result2 // e = n/p +1
Processor 3: [c * af + bf, ....] -> Result3 // f = 2*n/p + 1

What next?

This brings us to the end of the first part of this blog series on multi-threading. In the next part, we’ll dig deep into SIMD and MIMD systems, mainly introducing the programming models used to exploit them.

References

[1] Very high-speed computing systems | IEEE Journals & Magazine | IEEE Xplore

[2] Some Computer Organizations and Their Effectiveness | IEEE Journals & Magazine | IEEE Xplore

[3] Intro to Parallelism with Flynn’s Taxonomy (youtube.com)

Also published here.

Featured Photo by JACQUELINE BRANDWAYN on Unsplash.