# The FFP Machine Implemented With An FPGA

Edit: If you're interested in reduction machines for functional programs in general, especially ones that work and have had three years of development as opposed to three weeks, checkout these guys: The Reduceron.
Edit 2: Also check out this Non-Von Neumann computer!

As I previously hinted, I worked on an FFP Machine recently for a school project. Unfortunately I only had about three weeks to work on it (along with everything else going on), so that time was largely spent just understanding the FFP Machine, and I didn't get very far in the code. (The code referenced in the paper to download doesn't even compile to Verilog yet, and I've actually scrapped what's there in a new project that I'll share if I work more on it. Also, the school won't have copyright over the new project.)

# The FFP Machine Implemented With An FPGA - Kevin Secretan

Abstract

While researching hardware specializations for functional programming such as the LISP Machine, I discovered several papers on the Formal Functional Programming (FFP) Machine. The architecture relies on a set of small-grain processors working concurrently on a program expression to reduce it to an answer, which made the project a good candidate for implementation on an FPGA. I proceeded to implement the necessary parts of the FFP Machine, however as of this writing I have not finished a programmable version. With a few more man-days of work, it would still be a limited implementation meant as proof-of-concept and groundwork for future projects. I found it difficult to build as this was my first large-scale project using an FPGA, though I can describe the intended functionality of program reduction in action and explain the parallelism. This paper will focus on a summary of what the FFP Machine is and some implementation details.

## 1 Introduction

### 1.1 Goal

The goal of this project was to implement the necessary pieces of the FFP Machine and demonstrate the functionality to an enough extent that it serves as a proof of concept. The project involved learning about hardware design languages in general, learning the architecture of the FFP Machine, and building it. Due to time constraints a fully general form of the machine could not be built, but I aimed to lay the groundwork to potentially build upon in the future. The source code is in Python using the MyHDL library so it can be extended and edited easily for future work.

I first heard of the FFP Machine while researching hardware implementations for Scheme, a dialect of Lisp. I found the dissertation Three Implementation Models for Scheme[3], which as its third implementation discusses compiling Scheme to the FFP Language for execution on the FFP Machine. To my knowledge, the FFP Machine has never before been implemented on an FPGA, and any implementations at all do not exist in a form that I can access. The dissertation DOT: A distributed operating system model of a tree-structured multiprocessor[2] provides many useful implementation details for building and simulating the FFP Machine and reasoning about the performance of programs written for it, but nevertheless the paper does not reveal an actual implementation to "try at home".

### 1.2 Background

#### 1.2.1 The FFP Language and Machine

The FFP Machine is an architecture designed to execute programs written in the FFP language, first specified by John Backus in his Turing Award Lecture Can Programming Be Liberated from the von Neumann Style? A functional Style and its Algebra of Programs.[1] The FFP language is a (formal) functional programming language similar to Lisp but with many syntactic sugaring removed in order to make it suitable for machine execution. The language works via program reduction rather than evaluation, and by implementing it in a parallel environment such as an FPGA board it can do many operations normally considered to take linear time (such as summing a list of numbers) in constant time.

A simple FFP program is (+ :<4, 6, 8>) which reduces to the value 18. From this simple program an intuition can be formed of what syntactical elements the FFP Language has available to it: all expressions start with a left parenthesis and are immediately followed by a function (or a sub-expression that reduces to a function), such as +, which will correspond to a "machine code" primitive executed by the FPGA, which for the case of this project is written in MyHDL. The colon separates the function from its arguments, and the angle brackets denote a list whose elements are separated by commas. At present the user of my implementation must write the FFP program in a slightly lower-level fashion, but a compiler to that lower-level shouldn't require too much work. Rather than spend time on that, however, it might be better to design a string interpreter directly on the FPGA to evaluate strings on the fly and avoid any compilation at all!

Gyula Magó produced an excellent paper[6] called The FFP Machine which goes into the nitty-gritty details of the architecture and overall design of the FFP Machine and has served as my primary reference, and also is where the examples given in this paper come from. This paper presents a heavily summarized view of the FFP machine and deals more with the specific HDL implementations of the parts I got working rather than the system as a whole.

## 2 Methods, Techniques, and Design

### 2.1 Software

#### 2.1.1 Tools

Originally the goal was to use straight Verilog to implement the FFP Machine, since it was known and compilation was simple using Quartus' IDE. As I learned more about Verilog, however, I discovered MyHDL which provides a mechanism for writing Python code that compiles down to either Verilog or VHDL. I chose to use MyHDL instead as the syntax is cleaner, I have access to Python's power, and certain historical warts involving blocking vs. non-blocking statements and dealing with signed vs. unsigned integers in Verilog could be avoided. In the end I compile everything down to Verilog for final loading into Quartus' IDE which can program Altera's DE2 FPGA development board with my FFP Machine and some program to run.

#### 2.1.2 Building

In order to build a program in the FFP Language and load it onto my FFP Machine, it is necessary to write the principal representation of the FFP expression yourself and stick it into the top_level.py module, then run Python with the MyHDL module to convert it into Verilog which you can then use to program an FPGA. Future work on this step includes having a routine in Verilog that can read the program off an SD card instead of bundling it with the FFP Machine itself.

### 2.2 FFP Machine Design

This is a summary of the FFP Machine architecture described by Magó in his The FFP Machine paper. At the highest level, an FFP Machine consists of an interprocess communication network of T-cells (Tree) connected at the Leaves to an array of L-cells.

Figure 1: FFP high-level overview.

Each L-cell holds exactly one symbol in the FFP expression, hence the program
(+ :<4, 6, 8>) is represented in the array of L-cells, called the Principal Representation, as:

Figure 2: Principal Representation of a program

Note that since each L-cell holds one symbol this reduces the FFP language syntax, no longer requiring colons (since the first symbol after a parenthesis is either a function or a sub-expression that reduces to a function) or commas (since the boundaries of the L-cell act as delimiters).

The Auxiliary Representation of a program is constructed dynamically within the L-cells, and it consists of three pieces of data: some number of selectors, a relative level number (rln), and an index. To illustrate this form of expression, I will use the more interesting program (TR : <<2, 4, 6>, <3, 5, 7>>) where TR is the matrix transpose function.

Figure 3 helps explain selectors, which designate the path to reach some node if the principal representation is made into a tree. Hence the path (2,1,2) represents the number 4, while the path (1) denotes the function TR. (The path could also be represented as (1,0,0), since 0's denote an end.) The RLN is seen as the height of the node in the tree, and the index is simply the index into the array of L-cells for each item. Table 1 shows both the principal representation of this program and the auxiliary representation information.

Figure 3: The tree-form of the FFP expression (TR : <<2, 4, 6>, <3, 5, 7>>). Edges of the tree are the selector values.

       Principal:  (  TR < < 2 4 6 > <                    3   5   7    >   >    )
1st selector 0    1    2    2   2 2 2      2   2    2   2   2    2   2    0
2nd selector  0    0    0    1   1 1 1      1   2    2   2   2    2   0    0
3rd selector 0    0    0    0   1 2 3      0   0    1   2   3    0   0    0
rln     0    1    1    2   3 3 3      2   2    3   3   3    2   1    0
index     1    2    3    4   5 6 7      8   9   10  11  12   13  14   15


Table 1: Auxiliary Representation of (TR : <<2, 4, 6>, <3, 5, 7>>)

Computing the auxiliary representation is described further on.

### 2.3 Running a program

One machine cycle consists of three phases: partitioning, execution, and storage management. This corresponds to having a global phase Signal in HDL that can activate different branches of code.

Partitioning is where the machine locates any programs that can be reduced, then partitions the communication network of T cells to match up with the L cells to reduce. The actual actions it does in the T-network are described later on.

Execution consists of initialization, which will only happen once, and actual execution which may be suspended and resumed in a further machine cycle. Initialization consists of requesting whatever primitive program is needed if an L-cell contains a function, creating the auxiliary representation of the program, and receiving the L-cell primitive. The execution part executes the local program it received by communicating in the T-cell network with message waves, and at the end either requests extra space and suspends execution. My implementation does not contain the space-request feature. The primitive code that is run is different than a standard program and may also use the T-cell network for processing, though not always.

Storage management changes the principal representation in the L-array by erasing cells and rewriting values, also creating empty L-cells and moving things around if any were requested by the execution phase. My implementation just does the remapping then scans for any singular result it can bring back to display on the output.

### 2.4 Message passing

#### 2.4.1 T Cells

The partitioning phase sets the two partitioning switches (shown in Figure 4) in each T cell to one of four possible configurations (shown in Figure 5). The configurations determine which messages can be sent by the two children of each T cell, and they are determined by whether or not subtrees contain parentheses which denote reducible programs.

Figure 4: Shows the partitioning switches. The inner circle is a message processor.

Figure 5: The four partitioning configurations whose leaf nodes contain some portion of an application.

Partitioning is done in a parallel operation where each L-cell sends its information about parentheses up the network of trees. Each T-cell in turn processes the information in the two child-messages and sends it up to its parent T-cell. By the time it reaches the root node, the partitioning has been completed.

Switches are set according to whether subtrees contain parentheses: if the left (and correspondingly right) subtree contains no parentheses, the left (and respectively right) partitioning switch of the T cell is set to the inward position.

Each message consists of one of five possible values (encoded as 0 through 4 (requiring 3 bits)) depending on the information about its children nodes:
• -- : there are no parentheses in the subtree
• (): the leftmost parenthesis is "(" and the rightmost is ")"
• ((: the leftmost is "(" and the rightmost is "("
• )): the leftmost is ")" and the rightmost is ")"
• )(: the leftmost is ")" and the rightmost is "("

For the programs I'm interested in initially demonstrating the messages will be either -- or () at the top of the tree. A handy guide to determine which message packet should be sent to the parent is given in Table 2. (If a subtree contains only one parenthesis, then it's both left and rightmost, so this happens at the T cells closer to the L-cells.)

                      R X L     --   () ((  ))   )(
--     --   () ((  ))   )(
()     ()   () ((  ()   ((
((     ((   () ((  ()*  ((*
))     ))   )) )(  ))   )(
)(     )(   )) )(  ))*  )(*


Table 2: Computing which message to send to the parent T-cell, the horizontal axis is the right subtree, the vertical axis is the left subtree. The four items marked with an asterisk designate that the sending T-cell is an effective "root".

### 2.5 Computing Auxiliary Representation

The auxiliary representation is computed using the T-cells. Each L-cell sends an upsweep message through the network and when that message hits a root node, that root broadcasts a message in a downsweep back to each L-cell which stores the necessary data. Thus it only takes logarithmic time in the number of L-cells to compute the values.

On the upsweep, if a comes in the left child of a T cell and b comes in on the right child communication channel, the T cell sends a+b up to its parent. On the downsweep, if the number d comes in from the parent, the T cell sends d to its left child and a+d (hence it must store a on the upsweep) to its right child. The root node starts with d = 0.

The index computation for this implementation is trivial because it is already known at compile-time, and so it will just be "baked in" to the L cell on creation. (The user must feed in the principal representation as an array at compile-time.) As the program is reduced, however, calculating the index is as simple as making a cumulative sum through each element in the L array.

The relative level number is trickier to calculate. For each element a in the L array, take the input ai = 1 if the L cell holds an opening bracket or parenthesis, ai = -1 if the L cell holds a closing bracket or parenthesis, and ai = 0 otherwise. The cumulative sum gives the RLN at each node, except for those holding opening parentheses and angle brackets, which are one above what they should be, so subtract one from them to fix it. The highest RLN computed is the number of selectors there will be.

Computing the selectors is done first by defining RLN' = 0 if the L cells hold closing parentheses or angle brackets, RLN' = RLN otherwise. A tuple of booleans is then sent as input to the L cell; the tuple's length is the max height of the tree, and the tuple's values are (if the height is 3) <RLN' == 1, RLN' == 2, RLN' == 3>, which corresponds to (by summing up each element) <S1',S2',S3'>. The tricky part here is that S2' resets to 0 if RLN1' == 1, and S3' resets to 0 if RLN2' == 1. After all is done, each Si is the value of Si' if the RLN is larger than or equal to i, and 0 otherwise. (Thus for example a < at RLN level 2 will always have 0 as its third selector.)

### 2.6 HDL Representations

The source code implemented so far is fairly complicated, see below for where to retrieve it. For a brief summary, a T-cell contains 6 different Signals: a generic msg path, the signal line from its left child and from its right child, a signal line to its parent, a signal line controlling the direction of the sweep, and a signal line signaling the phase of the machine cycle. It has storage for its position switch status and whether or not it is a root node.

An L-cell contains signals for the parent T-cell to send messages and the machine phase. It contains storage for the symbol itself, the index, the RLN, and its selectors (the last three determined in the execution phase with the help of the T-cell parent).

#### 2.6.1 Primitive programs

L-cells request programs to act on or with them. The primitive programs implemented in HDL (or in this case, a mixture of Python and HDL) can be thought of as programmed in the Leaf-cell Programming Language (LPL).

So far, only APNDL (append-to-the-left), LENGTH, and + are implemented. APNDL is defined as a simple boolean equation: if (S1 < 2) or (S2 = 2 and S3 = 0), then erase the FFP symbol. LENGTH is defined as each L-cell sending its last selector up the tree, and the max is the result which is stored in the application symbol (all other L-cells are erased). + is defined as each L-cell sending itself and summing takes place along the way up the T-cell network, with the root node finally containing the result and sending it back down to store in the application symbol and erasing all other L-cells. These examples illustrate why the auxiliary representation is so important: it helps the LPL language be parallel and fast.